- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 2448字
- 2021-06-18 18:24:31
Predicting house prices in Boston
Now that we understand how linear regression works, let's move on to looking at a real dataset where we can demonstrate a more practical use case.
The Boston dataset is a small set representing the house prices in the city of Boston. It contains 506 samples and 13 features. Let's load the data into a DataFrame, as follows:
from sklearn.datasets import load_boston
boston = load_boston()
df_dataset = pd.DataFrame(
boston.data,
columns=boston.feature_names,
)
df_dataset['target'] = boston.target
Data exploration
It's important to make sure you do not have any null values in your data; otherwise, scikit-learn will complain about it. Here, I will count the sum of the null values in each column, then take the sum of it. If I get 0, then I am a happy man:
df_dataset.isnull().sum().sum() # Luckily, the result is zero
For a regression problem, the most important thing to do is to understand the distribution of your target. If a target ranges between 1 and 10, and after training our model we get a mean absolute error of 5, we can tell that the error is large in this context.
However, the same error for a target that ranges between 500,000 and 1,000,000 is negligible. Histograms are your friend when you want to visualize distributions. In addition to the target's distribution, let's also plot the mean values for each feature:
fig, axs = plt.subplots(1, 2, figsize=(16, 8))
df_dataset['target'].plot(
title='Distribution of target prices', kind='hist', ax=axs[0]
)
df_dataset[boston.feature_names].mean().plot(
title='Mean of features', kind='bar', ax=axs[1]
)
fig.show()
This gives us the following graphs:

In the preceding graph, it is observed that:
- The prices range between 5 and 50. Obviously, these are not real prices, probably normalized values, but this doesn't matter for now.
- Furthermore, we can tell from the histogram that most of the prices are below 35. We can use the following code snippet to see that 90% of the prices are below 34.8:
df_dataset['target'].describe(percentiles=[.9, .95, .99])
You can always go deeper with your data exploration, but we will stop here on this occasion.
Splitting the data
When it comes to small datasets, it's advised that you allocate enough data for testing. So, we will split our data into 60% for training and 40% for testing using the train_test_split function:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df_dataset, test_size=0.4)
x_train = df_train[boston.feature_names]
x_test = df_test[boston.feature_names]
y_train = df_train['target']
y_test = df_test['target']
Once you have the training and test sets, split them further into x sets and y sets. Then, we are ready to move to the next step.
Calculating a baseline
The distribution of the target gave us an idea of what level of error we can tolerate. Nevertheless, it is always useful to compare our final model to something. If we were in the real estate business and human agents were used to estimate house prices, then we would most likely be expected to build a model that can do better than the human agents. Nevertheless, since we do not know any real estimations to compare our model to, we can come up with our own baseline instead. The mean house price is 22.5. If we build a dummy model that returns the mean price regardless of the data given to it, then it would make a reasonable baseline.
Keep in mind that the value of 22.5 is calculated for the entire dataset, but since we are pretending to only have access to the training data, then it makes sense to calculate the mean price for the training set only. To save us all this effort, scikit-learn has dummy regressors available that do all this work for us.
Here, we will create a dummy regressor and use it to calculate baseline predictions for the test set:
from sklearn.dummy import DummyRegressor
baselin = DummyRegressor(strategy='mean')
baselin.fit(x_train, y_train)
y_test_baselin = baselin.predict(x_test)
There are other strategies that we can use, such as finding the median (the 50th quantile) or any other Nth quantile. Keep in mind that for the same data, using the mean as an estimation gives a lower MSE compared to when the median is used. Conversely, the median gives a lower Mean Absolute Error (MAE). We want our model to beat the baseline for both the MAE and MSE.
Training the linear regressor
Isn't the code for the baseline model almost identical to the one for the actual models? That's the beauty of scikit-learn's API. It means that when we decide to try a different algorithm—say, the decision tree algorithm from the previous chapter—we only need to change a few lines of code. Anyway, here is the code for the linear regressor:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train, y_train)
y_test_pred = reg.predict(x_test)
We are going to stick to the default configuration for now.
Evaluating our model's accuracy
There are three commonly used metrics for regression: R2, MAE, and MSE. Let's first write the code that calculates the three metrics and prints the results:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
print(
'R2 Regressor = {:.2f} vs Baseline = {:.2f}'.format(
r2_score(y_test, y_test_pred),
r2_score(y_test, y_test_baselin)
)
)
print(
'MAE Regressor = {:.2f} vs Baseline = {:.2f}'.format(
mean_absolute_error(y_test, y_test_pred),
mean_absolute_error(y_test, y_test_baselin)
)
)
print(
'MSE Regressor = {:.2f} vs Baseline = {:.2f}'.format(
mean_squared_error(y_test, y_test_pred),
mean_squared_error(y_test, y_test_baselin)
)
)
Here are the results we get:
R2 Regressor = 0.74 vs Baseline = -0.00 MAE Regressor = 3.19 vs Baseline = 6.29 MSE Regressor = 19.70 vs Baseline = 76.11
By now, you should already know how MAE and MSE are calculated. Just keep in mind that MSE is more sensitive to outliers than MAE. That's why the mean estimations for the baseline scored badly there. As for the R2, let's look at its formula:
Here's an explanation of the preceding formula:
- The numerator probably reminds you of MSE. We basically calculate the squared differences between all the predicted values and their corresponding actual values.
- As for the denominator, we use the mean of the actual values as pseudo estimations.
- Basically, this metric tells us how much better our predictions are compared to using the target's mean as an estimation.
- An R2 score of 1 is the best we could get, and a score of 0 means that we offered no additional value in comparison to using a biased model that just relies on the mean as an estimation.
- A negative score means that we should throw our model in the trash and use the target's mean instead.
- Obviously, in the baseline model, we already used the target's mean as the prediction. That's why its R2 score is 0.
Now, if we compare the scores, it is clear that our model scored better than the baseline in all of the three scores used. Congratulations!
Showing feature coefficients
We know that a linear model multiplies each of the features by a certain coefficient, and then gets the sum of these products as its final prediction. We can use the regressor's coef_ method after the model is trained to print these coefficients:
df_feature_importance = pd.DataFrame(
{
'Features': x_train.columns,
'Coeff': reg.coef_,
'ABS(Coeff)': abs(reg.coef_),
}
).set_index('Features').sort_values('Coeff', ascending=False)
As we can see in these results, some coefficients are positive and others are negative. A positive coefficient means that the feature correlates positively with the target and vice versa. I also added another column for the absolute values of the coefficients:

In the preceding screenshot, the following is observed :
- Ideally, the value for each coefficient should tell us how important each feature is. A higher absolute value, regardless of its sign, reflects high importance.
- However, I made a mistake here. If you check the data, you will notice that the maximum value for NOX is 0.87, while TAX goes up to 711. This means that if NOX has just marginal importance, its coefficient will still be high to balance its small value, while for TAX , its coefficient will always be small compared to the high values of the feature itself.
- So, we want to scale the features to keep them all in the comparable ranges. In the next section, we are going to see how to scale our features.
Scaling for more meaningful coefficients
scikit-learn has a number of scalers. We are going to use MinMaxScaler for now. Using it with its default configuration will squeeze out all the values for all the features between 0 and 1. The scaler needs to be fitted first to learn the features' ranges. Fitting should be done on the training x set only. Then, we use the scaler's transform function to scale both the training and test x sets:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
reg = LinearRegression()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
reg.fit(x_train_scaled, y_train)
y_test_pred = reg.predict(x_test_scaled)
There is a shorthand version of this code for fitting one dataset and then transforming it. In other words, the following uncommented line takes the place of the two commented ones:
# scaler.fit(x_train)
# x_train_scaled = scaler.transform(x_train)
x_train_scaled = scaler.fit_transform(x_train)
We will be using the fit_transform() function a lot from now on where needed.
Now that we have scaled our features and retrained the model, we can print the features and their coefficients again:

Notice how NOX is less important now than before.
Adding polynomial features
Now that we know what the most important features are, we can plot the target against them to see how they correlate with them:

In the preceding screenshot, the following is observed:
- These plots don't seem to be very linear to me, and a linear model will not be able to capture this non-linearity.
- Although we cannot turn a linear model into a non-linear one, we can still transform the data instead.
- Think of it this way: if y is a function of x2, we can either use a non-linear model—one that is capable of capturing the quadratic relation between x and y—or we can just calculate x2 and give it to a linear model instead of x. Furthermore, linear regression algorithms do not capture feature interactions.
- The current model cannot capture interactions between multiple features.
A polynomial transformation can solve both the non-linearity and feature interaction issues for us. Given the original data, scikit-learn's polynomial transformer will transform the features into higher dimensions (for example, it will add the quadratic and cubic values for each feature). Additionally, it will also add the products to each feature-pair (or triplets). PolynomialFeatures works in a similar fashion to the scaler we used earlier in this chapter. We are going to use its fit_transform variable and a transform() method, as follows:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
x_train_poly = poly.fit_transform(x_train)
x_test_poly = poly.transform(x_test)
To get both the quadratic and cubic feature transformation, we set the degree parameter to 3.
One annoying thing about PolynomialFeatures is that it doesn't keep track of the DataFrame's column names. It replaces the feature names with x0, x1, x2, and so on. However, with our Python skills at hand, we can reclaim our column names. Let's do exactly that using the following block of code:
feature_translator = [
(f'x{i}', feature) for i, feature in enumerate(x_train.columns, 0)
]
def translate_feature_names(s):
for key, val in feature_translator:
s = s.replace(key, val)
return s
poly_features = [
translate_feature_names(f) for f in poly.get_feature_names()
]
x_train_poly = pd.DataFrame(x_train_poly, columns=poly_features)
x_test_poly = pd.DataFrame(x_test_poly, columns=poly_features)
We can now use the newly derived polynomial features instead of the original ones.
Fitting the linear regressor with the derived features
"When I was six, my sister was half my age. Now I am 60 years old, how old is my sister?"
This is a puzzle found on the internet. If your answer is 30, then you forgot to fit an intercept into your linear regression model.
Now, we are ready to use our linear regressor with the newly transformed features. One thing to keep in mind is that the PolynomialFeaturestransformer adds one additional column where all the values are 1. The coefficient this column gets after training is equivalent to the intercept. So, we will not fit an intercept by setting fit_intercept=Falsewhen training our regressor this time:
from sklearn.linear_model import LinearRegression
reg = LinearRegression(fit_intercept=False)
reg.fit(x_train_poly, y_train)
y_test_pred = reg.predict(x_test_poly)
Finally, as we print the R2, MAE, and MSE results, we face the following unpleasant surprise:
R2 Regressor = -84.887 vs Baseline = -0.0 MAE Regressor = 37.529 vs Baseline = 6.2 MSE Regressor = 6536.975 vs Baseline = 78.1
The regressor is way worse than before and even worse than the baseline. What did the polynomial features do to our model?
One major problem with the ordinary least squares regression algorithm is that it doesn't work well with highly correlated features (multicollinearity).
The polynomial feature transformation's kitchen-sink approach—where we add features, their squared and cubic values, and the product of the features' pairs and triples—will very likely give us multiple correlated features. This multi-collinearity harms the model's performance. Furthermore, if you print the shape of x_train_poly, you will see that it has 303 samples and 560 features. This is another problem known as the curse of dimensionality.
Thankfully, two centuries is long enough for people to find solutions to these two problems. Regularization is the solution we are going to have fun with in the next section.
- ClickHouse性能之巔:從架構(gòu)設(shè)計(jì)解讀性能之謎
- 程序員面試筆試寶典
- INSTANT MinGW Starter
- Mastering AndEngine Game Development
- PostgreSQL Replication(Second Edition)
- Modern JavaScript Applications
- Cocos2d-x學(xué)習(xí)筆記:完全掌握Lua API與游戲項(xiàng)目開發(fā) (未來書庫)
- Python項(xiàng)目實(shí)戰(zhàn)從入門到精通
- Learning Hadoop 2
- Android群英傳
- 大數(shù)據(jù)時(shí)代的企業(yè)升級(jí)之道(全3冊(cè))
- XML程序設(shè)計(jì)(第二版)
- Oracle SOA Suite 12c Administrator's Guide
- VMware vSphere Design Essentials
- jQuery Essentials