官术网_书友最值得收藏!

Random forest model

Now, let's try building a random forest using 10 decision trees.

val numClasses = 2 
val categoricalFeaturesInfo = Map[Int, Int]() 
val numTrees = 10 
val featureSubsetStrategy = "auto"  
val impurity = "gini" 
val maxDepth = 5 
val maxBins = 10 
val seed = 42 
 
 
val rfModel = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, 
  numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed) 
 

Just like our single decision tree model, we start by declaring the hyper-parameters, many of which should be familiar to you already from the decision tree example. In the preceding code, we will start by creating a random forest of 10 trees, solving a two-class problem. One key feature that is different is the feature subset strategy described as follows:

The featureSubsetStrategy object gives the number of features to use as candidates for making splits at each node. Can either be a fraction (for example, 0.5) or a function based on the number of features in your dataset. The setting auto allows the algorithm to choose this number for you but a common soft-rule states to use the square-root of the number of features you have.

Now that we have trained our model, let's score it against our hold-out set and compute the total error:

def computeError(model: Predictor, data: RDD[LabeledPoint]): Double = {  
  val labelAndPreds = data.map { point => 
    val prediction = model.predict(point.features) 
    (point.label, prediction) 
  } 
  labelAndPreds.filter(r => r._1 != r._2).count.toDouble/data.count 
} 
val rfTestErr = computeError(rfModel, testData) 
println(f"RF Model: Test Error = ${rfTestErr}%.3f") 

The output is as follows:

And also compute AUC by using the already defined method computeMetrics:

    
val rfMetrics = computeMetrics(rfModel, testData) 
println(f"RF Model: AUC on Test Data = ${rfMetrics.areaUnderROC}%.3f") 

Our RF - where we hardcode the hyper-parameters - performs much better than our single decision tree with respect to the overall model error and AUC. In the next section, we will introduce the concept of a grid search and how we can try varying hyper-parameter values / combinations and measure the impact on the model performance.

Again, results can slightly differ between runs. However, in contrast to the decision tree, it is possible to make a run deterministic by passing a seed as a parameter of the method RandomForest.trainClassifier.
主站蜘蛛池模板: 曲周县| 高邑县| 嵩明县| 手机| 江津市| 大田县| 白水县| 凤阳县| 大洼县| 大足县| 乳源| 绥化市| 伊春市| 文化| 射洪县| 黑龙江省| 洪湖市| 玉环县| 兴化市| 中江县| 华阴市| 汝阳县| 酉阳| 雅安市| 芮城县| 广水市| 乌拉特中旗| 浙江省| 澜沧| 临湘市| 巴塘县| 隆尧县| 辰溪县| 方城县| 大港区| 诏安县| 阜南县| 襄垣县| 彰化县| 新源县| 南木林县|