- Go Machine Learning Projects
- Xuanyi Chew
- 339字
- 2021-06-10 18:46:39
Ingesting the data
Now, without much further ado, let's write some code to ingest the data. First, we need a data structure of a training example:
// Example is a tuple representing a classification example
type Example struct {
Document []string
Class
}
The reason for this is so that we can parse our files into a list of Example. The function is shown here:
func ingest(typ string) (examples []Example, err error) {
switch typ {
case "bare", "lemm", "lemm_stop", "stop":
default:
return nil, errors.Errorf("Expected only \"bare\", \"lemm\", \"lemm_stop\" or \"stop\"")
}
var errs errList
start, end := 0, 11
for i := start; i < end; i++ { // hold 30% for crossval
matches, err := filepath.Glob(fmt.Sprintf("data/lingspam_public/%s/part%d/*.txt", typ, i))
if err != nil {
errs = append(errs, err)
continue
}
for _, match := range matches {
str, err := ingestOneFile(match)
if err != nil {
errs = append(errs, errors.WithMessage(err, match))
continue
}
if strings.Contains(match, "spmsg") {
// is spam
examples = append(examples, Example{str, Spam})
} else {
// is ham
examples = append(examples, Example{str, Ham})
}
}
}
if errs != nil {
err = errs
}
return
}
Here, I used filepath.Glob to find a list of files that matches the pattern within the specific directory, which is hardcoded. It doesn't have to be hardcoded in your actual code, but hardcoding the path makes for simpler demo programs. For each of the matching filenames, we parse the file using the ingestOneFile function. Then we check whether the filename contains spmsg as a prefix. If it does, we create an Example that has Spam as its class. Otherwise, it will be marked as Ham. In the later sections of this chapter, I will walk through the Class type and the rationale for choosing it. For now, here's the ingestOneFile function. Take note of its simplicity:
func ingestOneFile(abspath string) ([]string, error) {
bs, err := ioutil.ReadFile(abspath)
if err != nil {
return nil, err
}
return strings.Split(string(bs), " "), nil
}
- Canvas LMS Course Design
- PostgreSQL 11 Server Side Programming Quick Start Guide
- 影視后期制作(Avid Media Composer 5.0)
- Mobile DevOps
- 數控銑削(加工中心)編程與加工
- Windows程序設計與架構
- Data Wrangling with Python
- 項目管理成功利器Project 2007全程解析
- 數據掘金
- Red Hat Linux 9實務自學手冊
- 激光選區熔化3D打印技術
- Linux嵌入式系統開發
- Bayesian Analysis with Python
- Working with Linux:Quick Hacks for the Command Line
- HBase Essentials