官术网_书友最值得收藏!

Applying random forests

Random forests in scikit-learn use the Estimator interface, allowing us to use almost the exact same code as before to do cross-fold validation:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This results in an immediate benefit of 65.3 percent, up by 2.5 points by just swapping the classifier.

Random forests, using subsets of the features, should be able to learn more effectively with more features than normal decision trees. We can test this by throwing more features at the algorithm and seeing how it goes:

X_all = np.hstack([X_lastwinner, X_teams])
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This results in 63.3 percent—a drop in performance! One cause is the randomness inherent in random forests only chose some features to use rather than others. Further, there are many more features in  X_teams than in X_lastwinner, and having the extra features results in less relevant information being used. That said, don't get too excited by small changes in percentages, either up or down. Changing the random state value will have more of an impact on the accuracy than the slight difference between these feature sets that we just observed. Instead, you should run many tests with different random states, to get a good sense of the mean and spread of accuracy values.

We can also try some other parameters using the GridSearchCV class, as we introduced in Chapter 2, Classifying using scikit-learn Estimators:

from sklearn.grid_search import GridSearchCV

parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100, 200],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}

clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))

This has a much better accuracy of 67.4 percent!

If we wanted to see the parameters used, we can print out the best model that was found in the grid search. The code is as follows:

print(grid.best_estimator_)

The result shows the parameters that were used in the best scoring model:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=2, max_leaf_nodes=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=14, verbose=0, warm_start=False)
主站蜘蛛池模板: 德清县| 于都县| 县级市| 沙坪坝区| 南宫市| 赞皇县| 克东县| 芒康县| 曲麻莱县| 保亭| 区。| 望奎县| 武功县| 万山特区| 茶陵县| 铜山县| 墨脱县| 丰都县| 林西县| 重庆市| 阜平县| 应城市| 大连市| 玉树县| 河曲县| 阳朔县| 聂荣县| 朝阳市| 青田县| 曲水县| 昭觉县| 宁武县| 隆化县| 分宜县| 城口县| 平安县| 巴南区| 油尖旺区| 瓦房店市| 衢州市| 岢岚县|