- Mastering Machine Learning on AWS
- Dr. Saket S.R. Mengle Maximo Gurmendez
- 355字
- 2021-06-24 14:23:19
Implementing linear regression through Apache Spark
You are likely interested in training regression models that can take huge datasets as input, beyond what you can do in scikit-learn. Apache Spark is a good candidate for this scenario. As we mentioned in the previous chapter, Apache Spark can easily run training algorithms on a cluster of machines using Elastic MapReduce (EMR) on AWS. We will explain how to set up EMR clusters in the next chapter. In this section, we'll explain how you can use the Spark ML library to train linear regression algorithms:
- The first step is to create a dataframe from our training data:
housing_df = sql.read.csv(SRC_PATH + 'train.csv', header=True, inferSchema=True)
The following screenshot shows the first few rows of the dataset:

- Typically, Apache Spark requires the input dataset to have a single column with a vector of numbers representing all the training features. In Chapter 2, Classifying Twitter Feeds with Naive Bayes, we used CountVectorizer to create such a column. In this chapter, since the vector values are already available in our dataset, we just need to construct such a column using a VectorAssembler transformer:
from pyspark.ml.feature import VectorAssembler
training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']
vector_assembler = VectorAssembler(inputCols=training_features,
outputCol="features")
df_with_features_vector = vector_assembler.transform(housing_df)
The following screenshot shows the first few rows of the df_with_features_vector dataset:

Note how the vector assembler created a new column called features, which assembles all the features that are used for training as vectors.
- As usual, we will split our dataframe into testing and training:
train_df, test_df = df_with_features_vector.randomSplit([0.8, 0.2],
seed=17)
- We can now instantiate our regressor and fit a model:
from pyspark.ml.regression import LinearRegression
linear = LinearRegression(featuresCol="features", labelCol="medv")
linear_model = linear.fit(train_df)
- By using this model, we find predictions for each value in the test dataset:
predictions_df = linear_model.transform(test_df)
predictions_df.show(3)
The output of the show() command is as follows:

- We can easily find the R2 value by using RegressionEvaluator:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(labelCol="medv",
predictionCol="prediction",
metricName="r2")
evaluator.evaluate(predictions_df)
In this case, we get an R2 of 0.688, which is a similar result to that of scikit-learn.
- INSTANT Wijmo Widgets How-to
- 現(xiàn)代辦公設(shè)備使用與維護
- 計算機維修與維護技術(shù)速成
- 基于Apache Kylin構(gòu)建大數(shù)據(jù)分析平臺
- 電腦高級維修及故障排除實戰(zhàn)
- BeagleBone Robotic Projects
- 微型計算機系統(tǒng)原理及應用:國產(chǎn)龍芯處理器的軟件和硬件集成(基礎(chǔ)篇)
- 微服務架構(gòu)基礎(chǔ)(Spring Boot+Spring Cloud+Docker)
- The Applied Artificial Intelligence Workshop
- Advanced Machine Learning with R
- MicroPython Cookbook
- 快·易·通:2天學會電腦組裝·系統(tǒng)安裝·日常維護與故障排除
- Hands-On Unsupervised Learning with Python
- SOA架構(gòu):服務和微服務分析及設(shè)計(原書第2版)
- 3D打印:從全面了解到親手制作(全彩版)