官术网_书友最值得收藏!

Next model – tree ensembles

Algorithms such as Random Forest (RF) or Gradient Boosted Machine (GBM) (also referred to as Gradient Boosted Trees) are two examples of ensemble tree-based models which are currently available in MLlib; you can think of an ensemble as an uber-model which represents a collection of base models. The best way to think about what an ensemble is doing behind the scenes is to consider a simple analogy:

"Suppose that you are the head coach of a famous soccer club and you have heard rumors of an incredible athlete from Brazil and it may be advantageous to sign this young athlete to your club before the other teams do; but your schedule is incredibly busy and instead, you send 10 of your assistant coaches to assess the player. Each one of your assistant coaches grades the player based on his/her coaching philosophy - maybe one coach wants to measure how fast the player can run 40 yards while another coach thinks height and arm-reach are important. Regardless of how each coach defines "athlete potential" you, as head coach, just want to know if you should sign the player to a contract now or wait. And so it goes that your coaches fly down to Brazil and each coach makes an assessment; upon arrival you go up to each of your coaches and ask "should we draft this player now or wait?" and, based on a simple rule like majority vote, you can make your decision. This an example of what an ensemble is doing behind the scenes with respect to a classification task."

You can think of each coach as a decision tree and, therefore, you will have an ensemble of 10 trees (for 10 coaches). How each coach assesses the player is highly specific and this holds true for our trees as well; for each of the 10 trees created features are selected randomly at each node (hence the random, in RF. Forest because there are many trees!). The reason for introducing this randomness and other base models is to prevent over-fitting the data. While RF and GBM are both tree-based ensembles, the manner in which they go about training is slightly different and deserves mention.

GBMs must be trained one tree at a time in order to minimize a loss function (for example, log-loss, squared error, and so on) and usually take longer to train than an RF which can generate multiple trees in parallel.

However, when training a GBM, it is recommended to make shallow trees which in turn lends itself to faster training.

  • RFs generally do not overfit the data compared to a GBM; that is, we can add more trees to our forest and be less prone to over-fitting than if we added more trees to our GBM.
  • Hyper-parameter tuning for an RF is much simpler than GBM. In his paper, Influence of Hyperparameters on Random Forest Accuracy, Bernard et al. show via experimentation that the number of K random features to select at each node is a key influencer with respect to model accuracy (https://hal.archives-ouvertes.fr/hal-00436358/document) Conversely, a GBM has much more hyper-parameters that must be considered, such as loss function, learning rate, number of iterations, and so on.

As with most which is better questions in data science, choosing between an RF and a GBM is open-ended and very task and dataset dependent.

主站蜘蛛池模板: 富蕴县| 泰宁县| 武威市| 呼和浩特市| 罗江县| 湛江市| 桐柏县| 济南市| 牙克石市| 巴中市| 宁都县| 蒙自县| 秭归县| 平顺县| 琼中| 海丰县| 金门县| 西藏| 普宁市| 甘孜| 吴桥县| 株洲县| 外汇| 牡丹江市| 柳林县| 兴安县| 大理市| 特克斯县| 四子王旗| 喀喇沁旗| 定兴县| 勃利县| 巴东县| 佛山市| 聊城市| 德清县| 常德市| 长汀县| 丰城市| 武威市| 新乡县|