官术网_书友最值得收藏!

Our first model – decision tree

Our first attempt at trying to classify the Higgs-Boson from background noise will use a decision tree algorithm. We purposely eschew from explaining the intuition behind this algorithm as this has already been well documented with plenty of supporting literature for the reader to consume (http://www.saedsayad.com/decision_tree.htm, http://spark.apache.org/docs/latest/mllib-decision-tree.html). Instead, we will focus on the hyper-parameters and how to interpret the model's efficacy with respect to certain criteria / error measures. Let's start with the basic parameters:

val numClasses = 2 
val categoricalFeaturesInfo = Map[Int, Int]() 
val impurity = "gini" 
val maxDepth = 5 
val maxBins = 10 

Now we are explicitly telling Spark that we wish to build a decision tree classifier that looks to distinguish between two classes. Let's take a closer look at some of the hyper-parameters for our decision tree and see what they mean:

numClasses: How many classes are we trying to classify? In this example, we wish to distinguish between the Higgs-Boson particle and background noise and thus there are four classes:

  • categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should not be treated as numbers (for example, ZIP code is a popular example). There are no categorical features in this dataset that we need to worry about.
  • impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures of impurity with respect to classification: Gini and Entropy and one impurity for regression: variance.
  • maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to more accurate results but run the risk of overfitting.
  • maxBins: Number of bins (think "values") for the tree to consider when making splits. Generally, increasing the number of bins allows the tree to consider more values but also increases computation time.
主站蜘蛛池模板: 邹平县| 安顺市| 资阳市| 延边| 遂宁市| 买车| 竹北市| 维西| 扎兰屯市| 贡觉县| 和平区| 西林县| 巧家县| 鲜城| 孟村| 特克斯县| 怀宁县| 平陆县| 泾阳县| 保定市| 舟山市| 卢湾区| 晋城| 佛坪县| 灌南县| 海兴县| 渭南市| 神池县| 游戏| 南阳市| 成武县| 武夷山市| 仁怀市| 东阿县| 庆云县| 武邑县| 长海县| 龙口市| 塔河县| 襄垣县| 扬中市|