官术网_书友最值得收藏!

Linear regression

Now that that's all done, let's do some linear regression! But first, let's clean up our code. We'll move our exploratory work so far into a function called exploration(). Then we will reread the file, split the dataset into training and testing dataset, and perform all the transformations before finally running the regression. For that, we will use github.com/sajari/regression and apply the regression.

The first part looks like this:

func main() {
// exploratory() // commented out because we're done with exploratory work.

f, err := os.Open("train.csv")
mHandleErr(err)
defer f.Close()
hdr, data, indices, err := ingest(f)
rows, cols, XsBack, YsBack, newHdr, newHints := clean(hdr, data, indices, datahints, ignored)
Xs := tensor.New(tensor.WithShape(rows, cols), tensor.WithBacking(XsBack))
it, err := native.MatrixF64(Xs)
mHandleErr(err)

// transform the Ys
for i := range YsBack {
YsBack[i] = math.Log1p(YsBack[i])
}
// transform the Xs
transform(it, newHdr, newHints)

// partition the data
shuffle(it, YsBack)
testingRows := int(float64(rows) * 0.2)
trainingRows := rows - testingRows
testingSet := it[trainingRows:]
testingYs := YsBack[trainingRows:]
it = it[:trainingRows]
YsBack = YsBack[:trainingRows]
log.Printf("len(it): %d || %d", len(it), len(YsBack))
...

We first ingest and clean the data, then we create an iterator for the matrix of Xs for easier access. We then transform both the Xs and the Ys. Finally, we shuffle the Xs, and partition them into a training dataset and a testing dataset.

Recall from the first chapter on knowing whether a model is good. A good model must be able to generalize to previously unseen combinations of values. To prevent overfitting, we must cross-validate our model.

In order to achieve that, we must only train on a limited subset of data, then use the model to predict on the test set of data. We can then get a score of how well it did when being run on the testing set.

Ideally, this should be done before the parsing of the data into the Xs and Ys. But we'd like to reuse the functions we wrote earlier, so we shan't do that. The separate functions of ingest and clean, however, allows you to do that. And if you visit the repository on GitHub, you will find that all the functions for such an act can easily be done.

For now, we simply take out 20% of the dataset, and set it aside. A shuffle is used to resample the rows so that we don't train on the same 80% every time.

Also, note that now the clean function takes ignored, while in the exploratory mode, it took nil. This, along with the shuffle, are important for cross-validation later on.

主站蜘蛛池模板: 新田县| 图木舒克市| 耿马| 法库县| 抚州市| 英山县| 宽城| 都江堰市| 蕉岭县| 大兴区| 龙州县| 洪洞县| 呈贡县| 金寨县| 昌平区| 厦门市| 林芝县| 类乌齐县| 东宁县| 碌曲县| 龙井市| 山阳县| 诏安县| 西乡县| 洞口县| 沙田区| 南和县| 东台市| 桐柏县| 宜阳县| 称多县| 阜城县| 洪洞县| 绥中县| 项城市| 岳池县| 古丈县| 江北区| 陕西省| 中西区| 竹北市|