- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 379字
- 2021-07-02 23:40:09
Random forests
A single Decision Tree can learn quite complex functions. However, decision trees are prone to overfitting--learning rules that work only for the specific training set and don't generalize well to new data.
One of the ways that we can adjust for this is to limit the number of rules that it learns. For instance, we could limit the depth of the tree to just three layers. Such a tree will learn the best rules for splitting the dataset at a global level, but won't learn highly specific rules that separate the dataset into highly accurate groups. This trade-off results in trees that may have a good generalization, but an overall slightly poorer performance on the training dataset.
To compensate for this, we could create many of these limited decision trees and then ask each to predict the class value. We could take a majority vote and use that answer as our overall prediction. Random Forests is an algorithm developed from this insight.
There are two problems with the aforementioned procedure. The first problem is that building decision trees is largely deterministic—using the same input will result in the same output each time. We only have one training dataset, which means our input (and therefore the output) will be the same if we try to build multiple trees. We can address this by choosing a random subsample of our dataset, effectively creating new training sets. This process is called bagging and it can be very effective in many situations in data mining.
The second problem we might run into with creating many decision trees from similar data is that the features that are used for the first few decision nodes in our tree will tend to be similar. Even if we choose random subsamples of our training data, it is still quite possible that the decision trees built will be largely the same. To compensate for this, we also choose a random subset of the features to perform our data splits on.
Then, we have randomly built trees using randomly chosen samples, using (nearly) randomly chosen features. This is a random forest and, perhaps unintuitively, this algorithm is very effective for many datasets, with little need to tune many parameters of the model.
- Visual C++數(shù)字圖像模式識(shí)別技術(shù)詳解
- HTML5+CSS3基礎(chǔ)開發(fā)教程(第2版)
- MySQL數(shù)據(jù)庫管理與開發(fā)實(shí)踐教程 (清華電腦學(xué)堂)
- Mastering KnockoutJS
- 快人一步:系統(tǒng)性能提高之道
- Python深度學(xué)習(xí):模型、方法與實(shí)現(xiàn)
- UX Design for Mobile
- VMware vSphere Design Essentials
- Enterprise Application Architecture with .NET Core
- 軟技能2:軟件開發(fā)者職業(yè)生涯指南
- PHP高性能開發(fā):基礎(chǔ)、框架與項(xiàng)目實(shí)戰(zhàn)
- JSP編程教程
- Ionic Framework By Example
- Go Programming Blueprints
- OpenCV輕松入門:面向Python