官术网_书友最值得收藏!

Preprocessing and feature engineering

As per the dataset description on the UCI machine learning repository, there are no null values. Also, the Spark ML-based classifiers expect numeric values to model them. The good thing is that, as seen in the schema, all the required fields are numeric (that is, either integers or floating point values). Also, the Spark ML algorithms expect a label column, which in our case is Result_of_Treatment. Let's rename it to label using the Spark-provided withColumnRenamed() method:

//Spark ML algorithm expect a 'label' column, which is in our case 'Survived". Let's rename it to 'label'
CryotherapyDF = CryotherapyDF.withColumnRenamed("Result_of_Treatment", "label")
CryotherapyDF.printSchema()

All the Spark ML-based classifiers expect training data containing two objects called label (which we already have) and features. We have seen that we have six features. However, those features have to be assembled to create a feature vector. This can be done using the VectorAssembler() method. It is one kind of transformer from the Spark ML library. But first we need to select all the columns except the label column:

val selectedCols = Array("sex", "age", "Time", "Number_of_Warts", "Type", "Area")

Then we instantiate a VectorAssembler() transformer and transform as follows:

val vectorAssembler = new VectorAssembler()
.setInputCols(selectedCols)
.setOutputCol("features")
val numericDF = vectorAssembler.transform(CryotherapyDF)
.select("label", "features")
numericDF.show()

As expected, the last line of the preceding code segment shows the assembled DataFrame having label and features, which are needed to train an ML algorithm:

主站蜘蛛池模板: 宣威市| 静安区| 神农架林区| 宜兰县| 宁武县| 碌曲县| 清远市| 永顺县| 长泰县| 金湖县| 河津市| 临汾市| 富平县| 察隅县| 富蕴县| 岳池县| 信阳市| 新余市| 南和县| 镇远县| 湘阴县| 特克斯县| 乐安县| 滦平县| 揭西县| 宝坻区| 确山县| 平武县| 遵化市| 湘阴县| 扎囊县| 双柏县| 沿河| 卢湾区| 垦利县| 永修县| 葵青区| 台湾省| 水富县| 乐都县| 自治县|