官术网_书友最值得收藏!

Ingesting the data

Now, without much further ado, let's write some code to ingest the data. First, we need a data structure of a training example:

// Example is a tuple representing a classification example
type Example struct {
Document []string
Class
}

The reason for this is so that we can parse our files into a list of Example. The function is shown here:

func ingest(typ string) (examples []Example, err error) {
switch typ {
case "bare", "lemm", "lemm_stop", "stop":
default:
return nil, errors.Errorf("Expected only \"bare\", \"lemm\", \"lemm_stop\" or \"stop\"")
}

var errs errList
start, end := 0, 11

for i := start; i < end; i++ { // hold 30% for crossval
matches, err := filepath.Glob(fmt.Sprintf("data/lingspam_public/%s/part%d/*.txt", typ, i))
if err != nil {
errs = append(errs, err)
continue
}

for _, match := range matches {
str, err := ingestOneFile(match)
if err != nil {
errs = append(errs, errors.WithMessage(err, match))
continue
}

if strings.Contains(match, "spmsg") {
// is spam
examples = append(examples, Example{str, Spam})
} else {
// is ham
examples = append(examples, Example{str, Ham})
}
}
}
if errs != nil {
err = errs
}
return
}

Here, I used filepath.Glob to find a list of files that matches the pattern within the specific directory, which is hardcoded. It doesn't have to be hardcoded in your actual code, but hardcoding the path makes for simpler demo programs. For each of the matching filenames, we parse the file using the ingestOneFile function. Then we check whether the filename contains spmsg as a prefix. If it does, we create an Example that has Spam as its class. Otherwise, it will be marked as Ham. In the later sections of this chapter, I will walk through the Class type and the rationale for choosing it. For now, here's the ingestOneFile function. Take note of its simplicity:

func ingestOneFile(abspath string) ([]string, error) {
bs, err := ioutil.ReadFile(abspath)
if err != nil {
return nil, err
}
return strings.Split(string(bs), " "), nil
}
主站蜘蛛池模板: 凤凰县| 攀枝花市| 宕昌县| 屏东县| 娄底市| 天津市| 福州市| 陕西省| 建宁县| 新宁县| 镇安县| 东乡族自治县| 芜湖县| 枝江市| 汉寿县| 山东省| 晴隆县| 衡南县| 汉源县| 越西县| 叶城县| 安龙县| 武陟县| 株洲市| 永城市| 大城县| 教育| 铜川市| 光泽县| 勃利县| 阜平县| 儋州市| 黔西县| 盘山县| 元阳县| 当雄县| 万全县| 马尔康县| 佛教| 武隆县| 集贤县|