- Mastering Machine Learning on AWS
- Dr. Saket S.R. Mengle Maximo Gurmendez
- 355字
- 2021-06-24 14:23:19
Implementing linear regression through Apache Spark
You are likely interested in training regression models that can take huge datasets as input, beyond what you can do in scikit-learn. Apache Spark is a good candidate for this scenario. As we mentioned in the previous chapter, Apache Spark can easily run training algorithms on a cluster of machines using Elastic MapReduce (EMR) on AWS. We will explain how to set up EMR clusters in the next chapter. In this section, we'll explain how you can use the Spark ML library to train linear regression algorithms:
- The first step is to create a dataframe from our training data:
housing_df = sql.read.csv(SRC_PATH + 'train.csv', header=True, inferSchema=True)
The following screenshot shows the first few rows of the dataset:

- Typically, Apache Spark requires the input dataset to have a single column with a vector of numbers representing all the training features. In Chapter 2, Classifying Twitter Feeds with Naive Bayes, we used CountVectorizer to create such a column. In this chapter, since the vector values are already available in our dataset, we just need to construct such a column using a VectorAssembler transformer:
from pyspark.ml.feature import VectorAssembler
training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']
vector_assembler = VectorAssembler(inputCols=training_features,
outputCol="features")
df_with_features_vector = vector_assembler.transform(housing_df)
The following screenshot shows the first few rows of the df_with_features_vector dataset:

Note how the vector assembler created a new column called features, which assembles all the features that are used for training as vectors.
- As usual, we will split our dataframe into testing and training:
train_df, test_df = df_with_features_vector.randomSplit([0.8, 0.2],
seed=17)
- We can now instantiate our regressor and fit a model:
from pyspark.ml.regression import LinearRegression
linear = LinearRegression(featuresCol="features", labelCol="medv")
linear_model = linear.fit(train_df)
- By using this model, we find predictions for each value in the test dataset:
predictions_df = linear_model.transform(test_df)
predictions_df.show(3)
The output of the show() command is as follows:

- We can easily find the R2 value by using RegressionEvaluator:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(labelCol="medv",
predictionCol="prediction",
metricName="r2")
evaluator.evaluate(predictions_df)
In this case, we get an R2 of 0.688, which is a similar result to that of scikit-learn.
- Intel FPGA/CPLD設計(基礎篇)
- 3ds Max Speed Modeling for 3D Artists
- STM32嵌入式技術應用開發全案例實踐
- 計算機組裝維修與外設配置(高等職業院校教改示范教材·計算機系列)
- 超大流量分布式系統架構解決方案:人人都是架構師2.0
- Hands-On Artificial Intelligence for Banking
- 數字媒體專業英語(第2版)
- WebGL Hotshot
- Wireframing Essentials
- Raspberry Pi Home Automation with Arduino
- 嵌入式系統設計大學教程(第2版)
- 主板維修實踐技術
- 快·易·通:2天學會電腦組裝·系統安裝·日常維護與故障排除
- 新編計算機組裝與維護
- Unreal Development Kit Game Programming with UnrealScript:Beginner's Guide