官术网_书友最值得收藏!

K-fold cross-validation

This method was invented and gained popularity in those days when the big date was not yet a problem, everyone had little data, but still needed to build reliable models. First thing we do is shuffle our dataset well, and then divide it randomly into several equal parts, say 10 (this is the k in k-fold). We hold out the first part as a test set, and on the remaining nine parts we train the model. The trained model is then assessed on the test set that did not participate in the training as usual. Next, we hold out the second of 10 parts, and train the model on the remaining nine (including those previously served as a test set). We validate the new model again on the part that did not participate in the training. We continue this process until each of the 10 parts is in the role of the test set. The final quality metrics are determined by the averaging metrics from each of the 10 tests:

In []: 
from sklearn.model_selection import cross_val_score 
scores = cross_val_score(tree_model, features, df.label, cv=10) 
np.mean(scores) 
Out[]: 
0.88300000000000001 
In []: 
plot = plt.bar(range(1,11), scores) 
Out[]: 
Figure 2.10: Cross-validation results

From the preceding graph, you can see that the model's accuracy depends on how you split the data, but not much. By taking the average and variance of the cross-validation results, you can make a sense of how well your model can generalize on different data, and how stable it is.

主站蜘蛛池模板: 广宁县| 呼和浩特市| 吐鲁番市| 庆阳市| 临夏市| 黎川县| 威宁| 莲花县| 泊头市| 大邑县| 定陶县| 屏边| 乃东县| 扎兰屯市| 灵山县| 通榆县| 工布江达县| 兴国县| 邻水| 板桥市| 华安县| 曲阳县| 衡山县| 自治县| 湟中县| 万荣县| 湾仔区| 广河县| 钦州市| 祁门县| 金门县| 乡宁县| 苏尼特左旗| 突泉县| 道孚县| 永德县| 左云县| 车险| 湟中县| 东源县| 徐州市|