官术网_书友最值得收藏!

Understanding linear models

To be able to explain linear models well, I would like to start with an example where the solution can be found using a system of linear equations—a technique we all learned in school when we were around 12 years old. We will then see why this technique doesn't always work with real-life problems, and so a linear regression model is needed. Then, we will apply the regression model to a real-life regression problem and learn how to improve our solution along the way.

Linear equations

"Mathematics is the most beautiful and most powerful creation of the human spirit."
– Stefan Banach

In this example, we have five passengers who have taken a taxi trip. Here, we have a record of the distance each taxi covered in kilometers and the fair displayed on its meter at the end of each trip:

We know that taxi meters usually start with a certain amount and then they add a fixed charge for each kilometer traveled. We can model the meter using the following equation:

Here, A is the meter's starting value and B is the charge added per kilometer. We also know that with two unknowns—A and B—we just need two data samples to figure out that A is 5 and B is 2.5. We can also plot the formula with the values for A and B, as follows:

We also know that the blue line will meet the y-axis at the value of A (5). So, we call A the intercept. We also know that the slope of the line equals B (2.5).

The passengers didn't always have change, so they sometimes rounded up the amount shown on the meter to add a tip for the driver. Here is the data for the amount each passenger ended up paying:

After we add the tips, it's clear that the relationship between the distance traveled and the amount paid is no longer linear. The plot on the right-hand side shows that a straight line cannot be drawn to capture this relationship:

We now know that our usual method of solving equationswill not work this time. Nevertheless, we can tell that there is still a line that can somewhat approximate this relationship. In the next section, we will use a linear regression algorithm to find this approximation.

Linear regression

Algorithms are all about objectives. Our objective earlier was to find a single line that goes through all the points in the graph. We have seen that this objective is not feasible if a linear relationship does not exist between the points. Therefore, we will use the linear regression algorithm since it has a different objective. The linear regression algorithm tries to find a line where the mean of the squared errors between the estimated points on the line and the actual points is minimal. Visually speaking, in the following graph, we want a dotted line that makes the average squared lengths of the vertical lines minimal:

The method used here to find a line that minimizes the Mean Squared Error ( MSE) is known as ordinary least squares. Often, linear regression just means ordinary least squares. Nevertheless, throughout this chapter, I will be using the term LinearRegression (as a single word) to refer to scikit-learn's implementation of ordinary least squares, and I will reserve the term linear regression (as two separate words) for referring to the general concept of linear regression, whether the ordinary least squares method is used or a different method is being employed.

The method of ordinary least squares is about two centuries old and it uses simple mathematics to estimate the parameters. That's why some may argue that this algorithm is not actually a machine learning one. Personally, I follow a more liberal approach when categorizing what is machine learning and what is not. As long as the algorithm automatically learns from data and we use that data to evaluate it, then for me, it falls within the machine learning paradigm.

Estimating the amount paid to the taxi driver

Now that we know how linear regression works, let's take a look at how to estimate the amount paid to the taxi driver.

  1. Let's use scikit-learn to build a regression model to estimate the amount paid to the taxi driver:
from sklearn.linear_model import LinearRegression

# Initialize and train the model
reg = LinearRegression()
reg.fit(df_taxi[['Kilometres']], df_taxi['Paid (incl. tips)'])

# Make predictions
df_taxi['Paid (Predicted)'] = reg.predict(df_taxi[['Kilometres']])

Clearly, scikit-learn has a consistent interface. We have used the same fit() and predict() methods as in the previous chapter, but this time with the LinearRegression object.

We only have one feature this time, Kilometres; nevertheless, the fit() and predict() methods expect a two-dimensional ax, which is why we enclosed Kilometers in an extra set of square brackets—df_taxi[['Kilometres']].

  1. We put our predictions in the same DataFrame under Paid (Predicted). We can then plot the actual values versus the estimated ones using the following code:
fig, axs = plt.subplots(1, 2, figsize=(16, 5))

df_taxi.set_index('Kilometres')['Meter'].plot(
title='Meter', kind='line', ax=axs[0]
)

df_taxi.set_index('Kilometres')['Paid (incl. tips)'].plot(
title='Paid (incl. tips)', label='actual', kind='line', ax=axs[1]
)
df_taxi.set_index('Kilometres')['Paid (Predicted)'].plot(
title='Paid (incl. tips)', label='estimated', kind='line', ax=axs[1]
)

fig.show()

I cut out the formatting parts of the code to keep it short and to the point. Here is the final result:

  1. Once a linear model is trained, you can get its intercept and coefficients using the intercept_ and coef_ parameters. So, we can use the following code snippet to create the linear equations of the estimated line:
print(
'Amount Paid = {:.1f} + {:.1f} * Distance'.format(
reg.intercept_, reg.coef_[0],
)
)

The following equation is then printed:

Getting the parameters for the linear equation can be handy in cases where you want to build a model in scikit-learn and then use it in another language or even in your favorite spreadsheet software. Knowing the coefficient also helps us understand why the model made certain decisions. More on this later in this chapter.

In software, the input to functions and methods is referred to as parameters. In machine learning, the weights learned for a model are also referred to as parameters. When setting a model, we pass its configuration to its __init__ method. Thus, to prevent any confusion, the model's configurations are called hyperparameters.
主站蜘蛛池模板: 福建省| 禄劝| 龙里县| 信阳市| 余庆县| 久治县| 高邑县| 论坛| 西贡区| 当涂县| 浦东新区| 酉阳| 荣成市| 法库县| 谷城县| 原平市| 手游| 临安市| 巴彦淖尔市| 馆陶县| 桐梓县| 沿河| 广平县| 汪清县| 台前县| 长沙市| 常德市| 临漳县| 鄂伦春自治旗| 颍上县| 忻城县| 鹤庆县| 霍邱县| 东乌珠穆沁旗| 南丰县| 卫辉市| 西林县| 邓州市| 淮滨县| 涿州市| 冀州市|