- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 362字
- 2021-07-02 18:46:09
Random forest model
Now, let's try building a random forest using 10 decision trees.
val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 10 val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 5 val maxBins = 10 val seed = 42 val rfModel = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed)
Just like our single decision tree model, we start by declaring the hyper-parameters, many of which should be familiar to you already from the decision tree example. In the preceding code, we will start by creating a random forest of 10 trees, solving a two-class problem. One key feature that is different is the feature subset strategy described as follows:
The featureSubsetStrategy object gives the number of features to use as candidates for making splits at each node. Can either be a fraction (for example, 0.5) or a function based on the number of features in your dataset. The setting auto allows the algorithm to choose this number for you but a common soft-rule states to use the square-root of the number of features you have.
Now that we have trained our model, let's score it against our hold-out set and compute the total error:
def computeError(model: Predictor, data: RDD[LabeledPoint]): Double = { val labelAndPreds = data.map { point => val prediction = model.predict(point.features) (point.label, prediction) } labelAndPreds.filter(r => r._1 != r._2).count.toDouble/data.count } val rfTestErr = computeError(rfModel, testData) println(f"RF Model: Test Error = ${rfTestErr}%.3f")
The output is as follows:

And also compute AUC by using the already defined method computeMetrics:
val rfMetrics = computeMetrics(rfModel, testData) println(f"RF Model: AUC on Test Data = ${rfMetrics.areaUnderROC}%.3f")

Our RF - where we hardcode the hyper-parameters - performs much better than our single decision tree with respect to the overall model error and AUC. In the next section, we will introduce the concept of a grid search and how we can try varying hyper-parameter values / combinations and measure the impact on the model performance.
- 潮流:UI設計必修課
- Java面向對象程序開發及實戰
- 嚴密系統設計:方法、趨勢與挑戰
- Python機器學習經典實例
- Mastering Android Development with Kotlin
- Linux C編程:一站式學習
- C#程序設計教程(第3版)
- 用戶體驗可視化指南
- Learning Nessus for Penetration Testing
- Sails.js Essentials
- 計算機應用基礎(第二版)
- Instant Apache Camel Messaging System
- AI自動化測試:技術原理、平臺搭建與工程實踐
- Getting Started with JUCE
- Access 2016數據庫應用與開發:實戰從入門到精通(視頻教學版)