官术网_书友最值得收藏!

Implementing linear regression through Apache Spark

You are likely interested in training regression models that can take huge datasets as input, beyond what you can do in scikit-learn. Apache Spark is a good candidate for this scenario. As we mentioned in the previous chapter, Apache Spark can easily run training algorithms on a cluster of machines using Elastic MapReduce (EMR) on AWS. We will explain how to set up EMR clusters in the next chapter. In this section, we'll explain how you can use the Spark ML library to train linear regression algorithms:

  1. The first step is to create a dataframe from our training data:
housing_df = sql.read.csv(SRC_PATH + 'train.csv', header=True, inferSchema=True)

The following screenshot shows the first few rows of the dataset:

  1. Typically, Apache Spark requires the input dataset to have a single column with a vector of numbers representing all the training features. In Chapter 2Classifying Twitter Feeds with Naive Bayes, we used CountVectorizer to create such a column. In this chapter, since the vector values are already available in our dataset, we just need to construct such a column using a VectorAssembler transformer:
from pyspark.ml.feature import VectorAssembler

training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']

vector_assembler = VectorAssembler(inputCols=training_features,
outputCol="features")

df_with_features_vector = vector_assembler.transform(housing_df)

The following screenshot shows the first few rows of the df_with_features_vector dataset:

Note how the vector assembler created a new column called features, which assembles all the features that are used for training as vectors.

  1. As usual, we will split our dataframe into testing and training:
train_df, test_df = df_with_features_vector.randomSplit([0.8, 0.2], 
seed=17)

  1. We can now instantiate our regressor and fit a model:
from pyspark.ml.regression import LinearRegression

linear = LinearRegression(featuresCol="features", labelCol="medv")
linear_model = linear.fit(train_df)
  1. By using this model, we find predictions for each value in the test dataset:
predictions_df = linear_model.transform(test_df)
predictions_df.show(3)

The output of the show() command is as follows:

  1. We can easily find the R2 value by using RegressionEvaluator:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="medv",
predictionCol="prediction",
metricName="r2")
evaluator.evaluate(predictions_df)

In this case, we get an R2 of 0.688, which is a similar result to that of scikit-learn.

主站蜘蛛池模板: 通渭县| 台南市| 龙里县| 宜君县| 蓬莱市| 永平县| 云梦县| 紫云| 吉安县| 华亭县| 南召县| 玉龙| 砀山县| 阳西县| 惠州市| 锦州市| 甘南县| 遂溪县| 永昌县| 惠来县| 南开区| 安溪县| 衡东县| 平南县| 浪卡子县| 濉溪县| 云霄县| 肇东市| 西贡区| 通化县| 益阳市| 隆回县| 巴彦县| 曲麻莱县| 潜江市| 杭锦后旗| 阳曲县| 罗甸县| 平湖市| 岢岚县| 云安县|