官术网_书友最值得收藏!

Ingesting the data

Now, without much further ado, let's write some code to ingest the data. First, we need a data structure of a training example:

// Example is a tuple representing a classification example
type Example struct {
Document []string
Class
}

The reason for this is so that we can parse our files into a list of Example. The function is shown here:

func ingest(typ string) (examples []Example, err error) {
switch typ {
case "bare", "lemm", "lemm_stop", "stop":
default:
return nil, errors.Errorf("Expected only \"bare\", \"lemm\", \"lemm_stop\" or \"stop\"")
}

var errs errList
start, end := 0, 11

for i := start; i < end; i++ { // hold 30% for crossval
matches, err := filepath.Glob(fmt.Sprintf("data/lingspam_public/%s/part%d/*.txt", typ, i))
if err != nil {
errs = append(errs, err)
continue
}

for _, match := range matches {
str, err := ingestOneFile(match)
if err != nil {
errs = append(errs, errors.WithMessage(err, match))
continue
}

if strings.Contains(match, "spmsg") {
// is spam
examples = append(examples, Example{str, Spam})
} else {
// is ham
examples = append(examples, Example{str, Ham})
}
}
}
if errs != nil {
err = errs
}
return
}

Here, I used filepath.Glob to find a list of files that matches the pattern within the specific directory, which is hardcoded. It doesn't have to be hardcoded in your actual code, but hardcoding the path makes for simpler demo programs. For each of the matching filenames, we parse the file using the ingestOneFile function. Then we check whether the filename contains spmsg as a prefix. If it does, we create an Example that has Spam as its class. Otherwise, it will be marked as Ham. In the later sections of this chapter, I will walk through the Class type and the rationale for choosing it. For now, here's the ingestOneFile function. Take note of its simplicity:

func ingestOneFile(abspath string) ([]string, error) {
bs, err := ioutil.ReadFile(abspath)
if err != nil {
return nil, err
}
return strings.Split(string(bs), " "), nil
}
主站蜘蛛池模板: 黄冈市| 泰顺县| 东乌| 康乐县| 东兴市| 海伦市| 乌鲁木齐市| 射洪县| 鹿泉市| 横山县| 友谊县| 北京市| 新乡市| 定西市| 讷河市| 深圳市| 佛学| 庆城县| 禄丰县| 那曲县| 盈江县| 义马市| 贵定县| 读书| 莱芜市| 长垣县| 临汾市| 新津县| 桑植县| 河池市| 饶河县| 雷波县| 荔波县| 洛浦县| 册亨县| 仪征市| 大新县| 庆阳市| 东兰县| 伊宁市| 敦化市|