官术网_书友最值得收藏!

Ingesting the data

Now, without much further ado, let's write some code to ingest the data. First, we need a data structure of a training example:

// Example is a tuple representing a classification example
type Example struct {
Document []string
Class
}

The reason for this is so that we can parse our files into a list of Example. The function is shown here:

func ingest(typ string) (examples []Example, err error) {
switch typ {
case "bare", "lemm", "lemm_stop", "stop":
default:
return nil, errors.Errorf("Expected only \"bare\", \"lemm\", \"lemm_stop\" or \"stop\"")
}

var errs errList
start, end := 0, 11

for i := start; i < end; i++ { // hold 30% for crossval
matches, err := filepath.Glob(fmt.Sprintf("data/lingspam_public/%s/part%d/*.txt", typ, i))
if err != nil {
errs = append(errs, err)
continue
}

for _, match := range matches {
str, err := ingestOneFile(match)
if err != nil {
errs = append(errs, errors.WithMessage(err, match))
continue
}

if strings.Contains(match, "spmsg") {
// is spam
examples = append(examples, Example{str, Spam})
} else {
// is ham
examples = append(examples, Example{str, Ham})
}
}
}
if errs != nil {
err = errs
}
return
}

Here, I used filepath.Glob to find a list of files that matches the pattern within the specific directory, which is hardcoded. It doesn't have to be hardcoded in your actual code, but hardcoding the path makes for simpler demo programs. For each of the matching filenames, we parse the file using the ingestOneFile function. Then we check whether the filename contains spmsg as a prefix. If it does, we create an Example that has Spam as its class. Otherwise, it will be marked as Ham. In the later sections of this chapter, I will walk through the Class type and the rationale for choosing it. For now, here's the ingestOneFile function. Take note of its simplicity:

func ingestOneFile(abspath string) ([]string, error) {
bs, err := ioutil.ReadFile(abspath)
if err != nil {
return nil, err
}
return strings.Split(string(bs), " "), nil
}
主站蜘蛛池模板: 盐边县| 新平| 中卫市| 滨州市| 太康县| 济源市| 孟村| 平安县| 德阳市| 周口市| 浮山县| 澄城县| 应城市| 阳谷县| 图们市| 克拉玛依市| 津南区| 县级市| 麻城市| 漠河县| 灵宝市| 巢湖市| 盐边县| 东乌| 太原市| 永顺县| 简阳市| 河北区| 全椒县| 恩平市| 连平县| 本溪| 台南市| 正定县| 女性| 游戏| 泗水县| 崇州市| 巫溪县| 奉新县| 郓城县|