官术网_书友最值得收藏!

Applying random forests

Random forests in scikit-learn use the Estimator interface, allowing us to use almost the exact same code as before to do cross-fold validation:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This results in an immediate benefit of 65.3 percent, up by 2.5 points by just swapping the classifier.

Random forests, using subsets of the features, should be able to learn more effectively with more features than normal decision trees. We can test this by throwing more features at the algorithm and seeing how it goes:

X_all = np.hstack([X_lastwinner, X_teams])
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_all, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

This results in 63.3 percent—a drop in performance! One cause is the randomness inherent in random forests only chose some features to use rather than others. Further, there are many more features in  X_teams than in X_lastwinner, and having the extra features results in less relevant information being used. That said, don't get too excited by small changes in percentages, either up or down. Changing the random state value will have more of an impact on the accuracy than the slight difference between these feature sets that we just observed. Instead, you should run many tests with different random states, to get a good sense of the mean and spread of accuracy values.

We can also try some other parameters using the GridSearchCV class, as we introduced in Chapter 2, Classifying using scikit-learn Estimators:

from sklearn.grid_search import GridSearchCV

parameter_space = {
"max_features": [2, 10, 'auto'],
"n_estimators": [100, 200],
"criterion": ["gini", "entropy"],
"min_samples_leaf": [2, 4, 6],
}

clf = RandomForestClassifier(random_state=14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(X_all, y_true)
print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))

This has a much better accuracy of 67.4 percent!

If we wanted to see the parameters used, we can print out the best model that was found in the grid search. The code is as follows:

print(grid.best_estimator_)

The result shows the parameters that were used in the best scoring model:

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features=2, max_leaf_nodes=None,
min_samples_leaf=2, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
oob_score=False, random_state=14, verbose=0, warm_start=False)
主站蜘蛛池模板: 格尔木市| 红河县| 江北区| 黄龙县| 南汇区| 筠连县| 安顺市| 区。| 北辰区| 安阳市| 诏安县| 会宁县| 陕西省| 博乐市| 嘉峪关市| 通河县| 肇源县| 福鼎市| 阿巴嘎旗| 浑源县| 瓮安县| 中江县| 云阳县| 齐齐哈尔市| 池州市| 左云县| 武山县| 峡江县| 巴彦县| 丰都县| 汪清县| 那坡县| 洪泽县| 蕲春县| 且末县| 淮北市| 威宁| 大安市| 博兴县| 米林县| 青浦区|