- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 1042字
- 2021-06-18 18:24:31
Understanding linear models
To be able to explain linear models well, I would like to start with an example where the solution can be found using a system of linear equations—a technique we all learned in school when we were around 12 years old. We will then see why this technique doesn't always work with real-life problems, and so a linear regression model is needed. Then, we will apply the regression model to a real-life regression problem and learn how to improve our solution along the way.
Linear equations
In this example, we have five passengers who have taken a taxi trip. Here, we have a record of the distance each taxi covered in kilometers and the fair displayed on its meter at the end of each trip:

We know that taxi meters usually start with a certain amount and then they add a fixed charge for each kilometer traveled. We can model the meter using the following equation:
Here, A is the meter's starting value and B is the charge added per kilometer. We also know that with two unknowns—A and B—we just need two data samples to figure out that A is 5 and B is 2.5. We can also plot the formula with the values for A and B, as follows:

We also know that the blue line will meet the y-axis at the value of A (5). So, we call A the intercept. We also know that the slope of the line equals B (2.5).
The passengers didn't always have change, so they sometimes rounded up the amount shown on the meter to add a tip for the driver. Here is the data for the amount each passenger ended up paying:

After we add the tips, it's clear that the relationship between the distance traveled and the amount paid is no longer linear. The plot on the right-hand side shows that a straight line cannot be drawn to capture this relationship:

We now know that our usual method of solving equationswill not work this time. Nevertheless, we can tell that there is still a line that can somewhat approximate this relationship. In the next section, we will use a linear regression algorithm to find this approximation.
Linear regression
Algorithms are all about objectives. Our objective earlier was to find a single line that goes through all the points in the graph. We have seen that this objective is not feasible if a linear relationship does not exist between the points. Therefore, we will use the linear regression algorithm since it has a different objective. The linear regression algorithm tries to find a line where the mean of the squared errors between the estimated points on the line and the actual points is minimal. Visually speaking, in the following graph, we want a dotted line that makes the average squared lengths of the vertical lines minimal:

The method of ordinary least squares is about two centuries old and it uses simple mathematics to estimate the parameters. That's why some may argue that this algorithm is not actually a machine learning one. Personally, I follow a more liberal approach when categorizing what is machine learning and what is not. As long as the algorithm automatically learns from data and we use that data to evaluate it, then for me, it falls within the machine learning paradigm.
Estimating the amount paid to the taxi driver
Now that we know how linear regression works, let's take a look at how to estimate the amount paid to the taxi driver.
- Let's use scikit-learn to build a regression model to estimate the amount paid to the taxi driver:
from sklearn.linear_model import LinearRegression
# Initialize and train the model
reg = LinearRegression()
reg.fit(df_taxi[['Kilometres']], df_taxi['Paid (incl. tips)'])
# Make predictions
df_taxi['Paid (Predicted)'] = reg.predict(df_taxi[['Kilometres']])
Clearly, scikit-learn has a consistent interface. We have used the same fit() and predict() methods as in the previous chapter, but this time with the LinearRegression object.
We only have one feature this time, Kilometres; nevertheless, the fit() and predict() methods expect a two-dimensional ax, which is why we enclosed Kilometers in an extra set of square brackets—df_taxi[['Kilometres']].
- We put our predictions in the same DataFrame under Paid (Predicted). We can then plot the actual values versus the estimated ones using the following code:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))
df_taxi.set_index('Kilometres')['Meter'].plot(
title='Meter', kind='line', ax=axs[0]
)
df_taxi.set_index('Kilometres')['Paid (incl. tips)'].plot(
title='Paid (incl. tips)', label='actual', kind='line', ax=axs[1]
)
df_taxi.set_index('Kilometres')['Paid (Predicted)'].plot(
title='Paid (incl. tips)', label='estimated', kind='line', ax=axs[1]
)
fig.show()
I cut out the formatting parts of the code to keep it short and to the point. Here is the final result:

- Once a linear model is trained, you can get its intercept and coefficients using the intercept_ and coef_ parameters. So, we can use the following code snippet to create the linear equations of the estimated line:
print(
'Amount Paid = {:.1f} + {:.1f} * Distance'.format(
reg.intercept_, reg.coef_[0],
)
)
The following equation is then printed:
Getting the parameters for the linear equation can be handy in cases where you want to build a model in scikit-learn and then use it in another language or even in your favorite spreadsheet software. Knowing the coefficient also helps us understand why the model made certain decisions. More on this later in this chapter.
- Raspberry Pi Networking Cookbook(Second Edition)
- C#程序設計(慕課版)
- Dependency Injection in .NET Core 2.0
- Animate CC二維動畫設計與制作(微課版)
- 編寫高質量代碼:改善C程序代碼的125個建議
- C語言程序設計
- 大數據分析與應用實戰:統計機器學習之數據導向編程
- HTML5從入門到精通(第4版)
- 軟件測試實用教程
- Android驅動開發權威指南
- 交互式程序設計(第2版)
- CodeIgniter Web Application Blueprints
- NGUI for Unity
- HTML5游戲開發實戰
- Java服務端研發知識圖譜