官术网_书友最值得收藏!

Implementing linear regression through Apache Spark

You are likely interested in training regression models that can take huge datasets as input, beyond what you can do in scikit-learn. Apache Spark is a good candidate for this scenario. As we mentioned in the previous chapter, Apache Spark can easily run training algorithms on a cluster of machines using Elastic MapReduce (EMR) on AWS. We will explain how to set up EMR clusters in the next chapter. In this section, we'll explain how you can use the Spark ML library to train linear regression algorithms:

  1. The first step is to create a dataframe from our training data:
housing_df = sql.read.csv(SRC_PATH + 'train.csv', header=True, inferSchema=True)

The following screenshot shows the first few rows of the dataset:

  1. Typically, Apache Spark requires the input dataset to have a single column with a vector of numbers representing all the training features. In Chapter 2Classifying Twitter Feeds with Naive Bayes, we used CountVectorizer to create such a column. In this chapter, since the vector values are already available in our dataset, we just need to construct such a column using a VectorAssembler transformer:
from pyspark.ml.feature import VectorAssembler

training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']

vector_assembler = VectorAssembler(inputCols=training_features,
outputCol="features")

df_with_features_vector = vector_assembler.transform(housing_df)

The following screenshot shows the first few rows of the df_with_features_vector dataset:

Note how the vector assembler created a new column called features, which assembles all the features that are used for training as vectors.

  1. As usual, we will split our dataframe into testing and training:
train_df, test_df = df_with_features_vector.randomSplit([0.8, 0.2], 
seed=17)

  1. We can now instantiate our regressor and fit a model:
from pyspark.ml.regression import LinearRegression

linear = LinearRegression(featuresCol="features", labelCol="medv")
linear_model = linear.fit(train_df)
  1. By using this model, we find predictions for each value in the test dataset:
predictions_df = linear_model.transform(test_df)
predictions_df.show(3)

The output of the show() command is as follows:

  1. We can easily find the R2 value by using RegressionEvaluator:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="medv",
predictionCol="prediction",
metricName="r2")
evaluator.evaluate(predictions_df)

In this case, we get an R2 of 0.688, which is a similar result to that of scikit-learn.

主站蜘蛛池模板: 太康县| 长丰县| 巢湖市| 定西市| 兴海县| 扎鲁特旗| 绥德县| 巴中市| 望江县| 武川县| 敦煌市| 原阳县| 迁西县| 东兴市| 土默特左旗| 福海县| 当雄县| 西峡县| 都安| 开封市| 聂拉木县| 襄樊市| 长葛市| 湖南省| 白水县| 瑞安市| 龙陵县| 呼伦贝尔市| 大田县| 禄劝| 敦化市| 苗栗市| 邻水| 河间市| 宁安市| 铜鼓县| 永春县| 三原县| 晴隆县| 新邵县| 松原市|