官术网_书友最值得收藏!

Our first model – decision tree

Our first attempt at trying to classify the Higgs-Boson from background noise will use a decision tree algorithm. We purposely eschew from explaining the intuition behind this algorithm as this has already been well documented with plenty of supporting literature for the reader to consume (http://www.saedsayad.com/decision_tree.htm, http://spark.apache.org/docs/latest/mllib-decision-tree.html). Instead, we will focus on the hyper-parameters and how to interpret the model's efficacy with respect to certain criteria / error measures. Let's start with the basic parameters:

val numClasses = 2 
val categoricalFeaturesInfo = Map[Int, Int]() 
val impurity = "gini" 
val maxDepth = 5 
val maxBins = 10 

Now we are explicitly telling Spark that we wish to build a decision tree classifier that looks to distinguish between two classes. Let's take a closer look at some of the hyper-parameters for our decision tree and see what they mean:

numClasses: How many classes are we trying to classify? In this example, we wish to distinguish between the Higgs-Boson particle and background noise and thus there are four classes:

  • categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should not be treated as numbers (for example, ZIP code is a popular example). There are no categorical features in this dataset that we need to worry about.
  • impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures of impurity with respect to classification: Gini and Entropy and one impurity for regression: variance.
  • maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to more accurate results but run the risk of overfitting.
  • maxBins: Number of bins (think "values") for the tree to consider when making splits. Generally, increasing the number of bins allows the tree to consider more values but also increases computation time.
主站蜘蛛池模板: 芜湖县| 饶阳县| 勃利县| 和田市| 会东县| 敦煌市| 阜南县| 龙里县| 六安市| 云龙县| 南丰县| 乌什县| 定陶县| 临城县| 牟定县| 科技| 桐柏县| 金堂县| 原平市| 陆丰市| 汽车| 泽普县| 调兵山市| 甘洛县| 伊宁市| 库伦旗| 新乡市| 虎林市| 夏邑县| 高青县| 无为县| 阳曲县| 广丰县| 杭锦后旗| 普洱| 仪陇县| 保德县| 恩平市| 重庆市| 宝兴县| 伊川县|