官术网_书友最值得收藏!

  • The Data Science Workshop
  • Anthony So Thomas V. Joseph Robert Thas John Andrew Worsley Dr. Samuel Asare
  • 381字
  • 2021-06-11 18:27:26

Summary

We have finally reached the end of this chapter on multiclass classification with Random Forest. We learned that multiclass classification is an extension of binary classification: instead of predicting only two classes, target variables can have many more values. We saw how we can train a Random Forest model in just a few lines of code and assess its performance by calculating the accuracy score for the training and testing sets. Finally, we learned how to tune some of its most important hyperparameters: n_estimators, max_depth, min_samples_leaf, and max_features. We also saw how their values can have a significant impact on the predictive power of a model but also on its ability to generalize to unseen data.

In real projects, it is extremely important to choose a valid testing set. This is your final proxy before putting a model into production so you really want it to reflect the types of data you think it will receive in the future. For instance, if your dataset has a date field, you can use the last few weeks or months as your testing set and everything before that date as the training set. If you don't choose the testing set properly, you may end up with a very good model that seems to not overfit but once in production, it will generate incorrect results. The problem doesn't come from the model but from the fact the testing set was chosen poorly.

In some projects, you may see that the dataset is split into three different sets: training, validation, and testing. The validation set can be used to tune the hyperparameters and once you are confident enough, you can test your model on the testing set. As mentioned earlier, we don't want the model to see too much of the testing set but hyperparameter tuning requires you to run a model several times until you find the optimal values. This is the reason why most data scientists create a validation set for this purpose and only use the testing set a handful of times. This will be explained in more depth in Chapter 7, The Generalization of Machine Learning Models.

In the next chapter, you will be introduced to unsupervised learning and will learn how to build a clustering model with the k-means algorithm.

主站蜘蛛池模板: 安丘市| 苏尼特左旗| 溧水县| 呼伦贝尔市| 龙江县| 错那县| 赤壁市| 宜宾县| 巫山县| 潼关县| 柳州市| 苍南县| 新密市| 临泉县| 喀喇沁旗| 霍城县| 武宁县| 海宁市| 石嘴山市| 青川县| 临江市| 米脂县| 土默特左旗| 孟村| 贺兰县| 南川市| 榆树市| 娄底市| 舞钢市| 葵青区| 石家庄市| 丹江口市| 棋牌| 石泉县| 峨眉山市| 方山县| 宣城市| 广昌县| 梧州市| 莱州市| 枝江市|