官术网_书友最值得收藏!

Understanding model capacity trade-offs

Let's train trees with different depths: starting from 1 split, and to maximal 23 splits:

In []: 
train_losses = [] 
test_losses = [] 
for depth in xrange(1, 23): 
    tree_model.max_depth = depth 
    tree_model = tree_model.fit(X_train, y_train) 
    train_losses.append(1 - tree_model.score(X_train, y_train)) 
    test_losses.append(1 - tree_model.score(X_test, y_test)) 
figure = plt.figure()  
plt.plot(train_losses, label="training loss", linestyle='--') 
plt.plot(test_losses, label="test loss") 
plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3, ncol=2, mode="expand", borderaxespad=0.) 
Out[]: 
Figure 2.8: Training loss versus test loss, depending on the maximum tree depth

On the axis, we've plotted the tree depth, and on the y axis, we've plotted the model's error. An interesting phenomenon that we're observing here is well familiar to any machine learning practitioner: as the model gets more complex, it gets more prone to overfitting. At first, as the model's capacity grows, both training and test loss (error) decreases, but then something strange happens: while error on the training set continues to go down, test error starts growing. This means that the model fits itself to the training examples so well, that it is not able to generalize well on unseen data anymore. That's why it's so important to have a held-out dataset, and perform your model validation on it. From the above plot, we can see that our more-or-less random choice of max_depth=4 was lucky: test error at this point became even less than training error.

主站蜘蛛池模板: 醴陵市| 柳江县| 华安县| 朔州市| 中牟县| 银川市| 祁连县| 深州市| 漳州市| 噶尔县| 屯门区| 博兴县| 都江堰市| 民乐县| 班戈县| 玛多县| 陕西省| 读书| 钟山县| 闽侯县| 平陆县| 兰州市| 平顶山市| 垫江县| 舒兰市| 盈江县| 岳阳市| 麦盖提县| 开江县| 淮安市| 潼关县| 沅陵县| 阳高县| 来宾市| 镶黄旗| 赞皇县| 赣州市| 永济市| 横峰县| 曲麻莱县| 丰城市|