官术网_书友最值得收藏!

Extending to linear regression

Linear regression tries to fit a line through a given set of points, choosing the best fit. The best fit is the line that minimizes the summed squared difference between the value dictated by the line for a certain value of x and its corresponding y values. (It is optimizing the same squared error that we met before when checking how good a mean was as a predictor.)

Since linear regression is a line; in bi-dimensional space (x, y), it takes the form of the classical formula of a line in a Cartesian plane: y = mx + q, where m is the angular coefficient (expressing the angle between the line and the x axis) and q is the intercept between the line and the x axis.

Formally, machine learning indicates the correct expression for a linear regression as follows:

Extending to linear regression

Here, again, X is a matrix of the predictors, β is a matrix of coefficients, and β0 is a constant value called the bias (it is the same as the Cartesian formulation, only the notation is different).

We can better understand its functioning mechanism by seeing it in action with Python, first using the StatsModels package, then using the Scikit-learn one.

Regressing with Statsmodels

Statsmodels is a package designed with statistical analysis in mind; therefore, its function offers quite a rich output of statistical checks and information. Scalability is not an issue for the package; therefore, it is really a good starting point for learning, but is certainly not the optimal solution if you have to crunch quite large datasets (or even big data) because of its optimization algorithm.

There are two different methods (two modules) to work out a linear regression with Statsmodels:

  • statsmodels.api: This works with distinct predictor and answer variables and requires you to define any transformation of the variables on the predictor variable, including adding the intercept
  • statsmodels.formula.api: This works in a similar way to R, allowing you to specify a functional form (the formula of the summation of the predictors)

We will illustrate our example using the statsModels.api; however, we will also show you an alternative method with statsmodels.formula.api.

As a first step, let's upload both the modules of Statsmodels, naming them as conventionally indicated in the package documentation:

In: import statsmodels.api as sm
 import statsmodels.formula.api as smf

As a second step, it is necessary to define the y and X variables:

In: y = dataset['target']
 X = dataset['RM']
 X = sm.add_constant(X)

The X variable needs to be extended by a constant value (); the bias will be calculated accordingly. In fact, as you remember, the formula of a linear regression is as follows:

Regressing with Statsmodels

However, using StatsModels.api, the formula actually becomes the following:

Regressing with Statsmodels

This can be interpreted as a combination of the variables in X, multiplied by its corresponding β value.

Consequently, the predictor X now contains both the predictive variable and a unit constant. Also, β is no longer a single coefficient, but a vector of coefficients.

Let's have a visual confirmation of this by requiring the first values of the Pandas DataFrame using the head method:

In: X.head()

Regressing with Statsmodels

At this point, we just need to set the initialization of the linear regression calculation:

In: linear_regression = sm.OLS(y,X)

Also, we need to ask for the estimation of the regression coefficients, the β vector:

In: fitted_model = linear_regression.fit()

If we had wanted to manage the same result using the StatsModels.formula.api, we should have typed the following:

In: linear_regression = smf.ols(formula='target ~ RM', data=dataset)
 fitted_model = linear_regression.fit()

The previous two code lines simultaneously comprise both steps seen together, without requiring any particular variable preparation since the bias is automatically incorporated. In fact, the specification about how the linear regression should work is incorporated into the string target ~ RM, where the variable name left of the tilde (~) indicates the answer variable, the variable name (or names, in the case of a multiple regression analysis) on the right being for the predictor.

Actually, smf.ols expects quite a different input compared to sm.OLS, because it can accept our entire original dataset (it selects what variables are to be used by using the provided formula), whereas sm.OLS expects a matrix containing just the features to be used for prediction. Consequently, some caution has to be exercised when using two such different approaches.

A summary (a method of the fitted model) can quickly tell you everything that you need to know about regression analysis. In case you have tried statsmodesl.formula.api, we also re-initialize the linear regression using the StatsModels.api since they are not working on the same X and our following code relies on sm.OLS specifications:

In: linear_regression = sm.OLS(y,X)
 fitted_model = linear_regression.fit()
 fitted_model.summary()

Regressing with Statsmodels

Regressing with Statsmodels

Regressing with Statsmodels

You will receive quite a long series of tables containing many statistical tests and information. Though quite daunting at the beginning, you actually do not need all these outputs, unless the purpose of your analysis is a statistical one. Data science is mainly concerned with real models working on predicting real data, not on formally correct specifications of statistical problems. Nevertheless, some of these outputs are still useful for successful model building and we are going to provide you with an insight into the main figures.

Before explaining the outputs, we first need to extract two elements from the fitted model: the coefficients and the predictions calculated on the data on which we built the model. They both are going to come in very handy during the following explanations:

In: print (fitted_model.params)
 betas = np.array(fitted_model.params)
 fitted_values = fitted_model.predict(X)

The coefficient of determination

Let's start from the first table of results. The first table is pided into two columns. The first one contains a description of the fitted model:

  • Dep. Variable: It just reminds you what the target variable was
  • Model: Another reminder of the model that you have fitted, the OLS is ordinary least squares, another way to refer to linear regression
  • Method: The parameters fitting method (in this case least squares, the classical computation method)
  • No. Observations: The number of observations that have been used
  • DF Residuals: The degrees of freedom of the residuals, which is the number of observations minus the number of parameters
  • DF Model: The number of estimated parameters in the model (excluding the constant term from the count)

The second table gives a more interesting picture, focusing how good the fit of the linear regression model is and pointing out any possible problems with the model:

  • R-squared: This is the coefficient of determination, a measure of how well the regression does with respect to a simple mean.
  • Adj. R-squared: This is the coefficient of determination adjusted based on the number of parameters in a model and the number of observations that helped build it.
  • F-statistic: This is a measure telling you if, from a statistical point of view, all your coefficients, apart from the bias and taken together, are different from zero. In simple words, it tells you if your regression is really better than a simple average.
  • Prob (F-statistic): This is the probability that you got that F-statistic just by lucky chance due to the observations that you have used (such a probability is actually called the p-value of F-statistic). If it is low enough you can be confident that your regression is really better than a simple mean. Usually in statistics and science a test probability has to be equal or lower than 0.05 (a conventional criterion of statistical significance) for having such a confidence.
  • AIC: This is the Akaike Information Criterion. AIC is a score that evaluates the model based on the number of observations and the complexity of the model itself. The lesser the AIC score, the better. It is very useful for comparing different models and for statistical variable selection.
  • BIC: This is the Bayesian Information Criterion. It works as AIC, but it presents a higher penalty for models with more parameters.

Most of these statistics make sense when we are dealing with more than one predictor variable, so they will be discussed in the next chapter. Thus, for the moment, as we are working with a simple linear regression, the two measures that are worth examining closely are F-statistic and R-squared. F-statistic is actually a test that doesn't tell you too much if you have enough observations and you can count on a minimally correlated predictor variable. Usually it shouldn't be much of a concern in a data science project.

R-squared is instead much more interesting because it tells you how much better your regression model is in comparison to a single mean. It does so by providing you with a percentage of the unexplained variance of a mean as a predictor that actually your model was able to explain.

If you want to compute the measure yourself, you just have to calculate the sum of squared errors of the mean of the target variable. That's your baseline of unexplained variance (the variability in house prices that in our example we want to explain by a model). If from that baseline you subtract the sum of squared errors of your regression model, you will get the residual sum of squared errors, which can be compared using a pision with your baseline:

In: mean_sum_squared_errors = np.sum((dataset['target']-\dataset['target'].mean())**2)
 regr_sum_squared_errors = np.sum((dataset['target']-\fitted_values)**2)
 (mean_sum_squared_errors-\regr_sum_squared_errors) / mean_sum_squared_errors

Out: 0.48352545599133412
Tip

When working with floats, rounding errors are possible, so don't be afraid if some of the lesser decimals don't match in your calculations; if they match the 8th decimal, you can be quite confident that the result is the same.

Ideally, if you can reduce your sum of squared errors of the regression to zero, you will get the maximum percentage of explained variance—that is, a score of 1.

The R-squared measure is also comparable with the percentage that you obtain squaring the correlation between your predictor and the target variable.

In our example, it is 0.484, which actually is exactly our R-squared correlation:

In: (pearsonr(dataset['RM'], dataset['target'])[0])**2

Out: 0.4835254559913339

As we have seen, R-squared is perfectly aligned with the squared errors that the linear regression is trying to minimize; thus, a better R-squared means a better model. However, there are some problems with the measure (and with linear regression itself, actually) when working with more predictors at once. Again, we have to wait until we model more predictors at once; therefore, just for a simple linear regression, a better R-squared should hint at a better model.

Meaning and significance of coefficients

The second output table informs us about the coefficients and provides us with a series of tests. These tests can make us confident that we have not been fooled by a few extreme observations in the foundations of our analysis or by some other problem:

  • coef: The estimated coefficient
  • std err: The standard error of the estimate of the coefficient; the larger it is, the more uncertain the estimation of the coefficient
  • t: The t-statistic value, a measure indicating whether the coefficient true value is different from zero
  • P > |t|: The p-value indicating the probability that the coefficient is different from zero just by chance
  • [95.0% Conf. Interval]: The lower and upper values of the coefficient, considering 95% of all the chances of having different observations and so different estimated coefficients

From a data science viewpoint, t-tests and confidence bounds are not very useful because we are mostly interested in verifying whether our regression is working while predicting answer variables. Consequently, we will focus just on the coef value (the estimated coefficients) and on their standard error.

The coefficients are the most important output that we can obtain from our regression model because they allow us to re-create the weighted summation that can predict our outcomes.

In our example, our coefficients are ?34.6706 for the bias (also called the intercept, recalling the formula for a line in a Cartesian space) and 9.1021 for the RM variable. Recalling our formula, we can plug in the numbers we obtained:

Meaning and significance of coefficients

Now, if you replace the betas and X with the estimated coefficients, and the variables' names with ?34.6706 and 9.1021, everything becomes the following:

Meaning and significance of coefficients

Now, if you know the average number of rooms in an area of Boston, you can make a quick estimate of the expected value. For instance, xRM is 4.55:

In: 9.1021*4.55-34.6706

Out: 6.743955

We have to notice two points here. First, in such a formulation, the beta of each variable becomes its unit change measure, which corresponds to the change the outcome will undergo if the variable increases by one unit. In our case, our average room space becomes 5.55:

In: 9.1021*5.55-34.6706

Out: 15.846055

The increase for a unit change in xRM corresponds to a change in the outcome equivalent to βRM. The other point to be noticed is that, if our average room space becomes 1 or 2, our estimated value will turn negative, which is completely unrealistic. This is because the mapping between predictor and the target variable happened in a delimited bound of values of the predictor:

In: (np.min(dataset['RM']),np.max(dataset['RM']))
Out: (3.5609999999999999, 8.7799999999999994)

Whenever we try to estimate our answer values using an x (or a set of X) that is outside the boundaries we used for fitting the model, we risk a response that has not been optimized at all by the linear regression calculations. Expressed in another way, linear regression can learn what it sees, and, unless there is a clear linear functional form between the predictor and the target (they can be truly expressed as a line), you risk weird estimations when your predictors have an unusual value. In other words, a linear regression can always work within the range of values it learned from (this is called interpolation) but can provide correct values for its learning boundaries (a different predictive activity called extrapolation) only in certain conditions.

Tip

As we previously mentioned, the number of observations used for fitting the model is of paramount importance to obtain a robust and reliable linear regression model. The more observations, the less likely the model is to be surprised by unusual values when running in production.

Standard errors instead are very important because they signal a weak or unclear relationship between the predictor and the answer. You can notice this by piding the standard error by its beta. If the ratio is 0.5 or even larger, then it's a clear sign that the model has little confidence that it provided you with the right coefficient estimates. Having more cases is always the solution because it can reduce the standard errors of the coefficients and improve our estimates; however, there are also other methods to reduce errors, such as removing the redundant variance present among the features by a principal component analysis or selecting a parsimonious set of predictors by greedy selections. All these topics will be discussed when we work with multiple predictors; at this point in the book, we will illustrate the remedies to such a problem.

Evaluating the fitted values

The last table deals with an analysis of the residuals of the regression. The residuals are the difference between the target values and the predicted fitted values:

  • Skewness: This is a measure of the symmetry of the residuals around the mean. For symmetric distributed residuals, the value should be around zero. A positive value indicates a long tail to the right; a negative value a long tail to the left.
  • Kurtosis: This is a measure of the shape of the distribution of the residuals. A bell-shaped distribution has a zero measure. A negative value points to a too flat distribution; a positive one has too great a peak.
  • Omnibus D'Angostino's test: This is a combined statistical test for skewness and kurtosis.
  • Prob(Omnibus): This is the Omnibus statistic turned into a probability.
  • Jarque-Bera: This is another test of skewness and kurtosis.
  • Prob (JB): This is the JB statistic turned into a probability.
  • Durbin-Watson: This is a test for the presence of correlation among the residuals (relevant during analysis of time-based data).
  • Cond. No: This is a test for multicollinearity (we will deal with the concept of multicollinearity when working with many predictors).

A close analysis of residuals is quite relevant in statistical practice since it can highlight the presence of serious problems with regression analysis. When working with a single variable it is interesting to visually check its residuals to figure out if there are strange cases or if the residuals don't distribute randomly. In particular, it is important to keep an eye out for any of these three problems showing up:

  1. Values too far from the average. Large standardized residuals hint at a serious difficulty when modeling such observations. Also, in the process of learning these values, the regression coefficients may have been distorted.
  2. Different variance in respect of the value of the predictor. If the linear regression is an average conditioned on the predictor, dishomogeneous variance points out that the regression is not working properly when the predictor has certain values.
  3. Strange shapes in the cloud of residual points may indicate that you need a more complex model for the data you are analyzing.

In our case, we can easily compute the residuals by subtracting the fitted values from the answer variable and then plotting the resulting standardized residuals in a graph:

In: residuals = dataset['target']-fitted_values
 normalized_residuals = standardize(residuals)

In: residual_scatter_plot = plt.plot(dataset['RM'], normalized_residuals,'bp')
mean_residual = plt.plot([int(x_range[0]),round(x_range[1],0)], [0,0], '-', color='red', linewidth=2)
upper_bound = plt.plot([int(x_range[0]),round(x_range[1],0)], [3,3], '--', color='red', linewidth=1)
lower_bound = plt.plot([int(x_range[0]),round(x_range[1],0)], [-3,-3], '--', color='red', linewidth=1)
plt.grid()

Evaluating the fitted values

The resulting scatterplot indicates that the residuals show some of the problems we previously indicated as a warning that something is not going well with your regression analysis. First, there are a few points lying outside the band delimited by the two dotted lines at normalized residual values ?3 and +3 (a range that should hypothetically cover 99.7% of values if the residuals have a normal distribution). These are surely influential points with large errors and they can actually make the entire linear regression under-perform. We will talk about possible solutions to this problem when we discuss outliers in the next chapter.

Then, the cloud of points is not at all randomly scattered, showing different variances at different values of the predictor variable (the abscissa axis) and you can spot unexpected patterns (points in a straight line, or the core points placed in a kind of U shape).

We are not at all surprised; the average number of rooms is likely a good predictor but it is not the only cause, or it has to be rethought as a direct cause (the number of rooms indicates a larger house, but what if the rooms are smaller than average?). This leads us to discuss whether a strong correlation really makes a variable a good working candidate for a linear relationship.

Correlation is not causation

Actually, seeing a correlation between your predictor and your target variable, and managing to model it successfully using a linear regression, doesn't really mean that there is a causal relation between the two (though your regression may work very well, and even optimally).

Though using a data science approach, instead of a statistical one, will guarantee a certain efficacy in your model, it is easy to fall into some mistakes when having no clue why your target variable is correlated with a predictor.

We will tell you about six different reasons, and offer a cautionary word to help you handle such predictors without difficulty:

  • Direct causation: x causes y; for instance, in the real estate business the value is directly proportional to the size of the house in square meters.
  • Reciprocal effects: x causes y but it is also influenced by y. This is quite typical of many macro-economic dynamics where the effect of a policy augments or diminishes its effects. As an example in real estate, high crime rates in an area can lower its prices but lower prices mean that the area could quickly become even more degraded and dangerous.
  • Spurious causation: This happens when the real cause is actually z, which causes both x and y; consequently it is just a fallacious illusion that x implies y because it is z behind the scenes. For instance, the presence of expensive art shops and galleries may seem to correlate with house prices; in reality, both are determined by the presence of affluent residents.
  • Indirect causation: x in reality is not causing y but it is causing something else, which then causes y. A good municipality investing in infrastructures after higher taxes can indirectly affect house prices because the area becomes more comfortable to live in, thus attracting more demand. Higher taxes, and thus more investments, indirectly affect house prices.
  • Conditional effect: x causes y in respect of the values of another variable z; for instance, when z has certain values x is not influencing y but, when z takes particular values, the x starts impacting y. We also call this situation interaction. For instance the presence of schools in an area can become an attractor when the crime rate is low, so it affects house prices only when there is little criminality.
  • Random effect: Any recorded correlation between x and y has been due to a lucky sampling selection; in reality there is no relationship with y at all.

The ideal case is when you have a direct causation; then, you will have a predictor in your model that will always provide you with the best values to derive your responses.

In the other cases, it is likely that the imperfect cause-effect relationship with the target variable will lead to more noisy estimates, especially in production when you will have to work with data not seen before by the model.

Reciprocal effects are more typical of econometric models. They require special types of regression analysis. Including them in your regression analysis may improve your model; however, their role may be underestimated.

Spurious and indirect causes will add some noise to your x and y relationship; this could bring noisier estimates (larger standard errors). Often, the solution is to get more observations for your analysis.

Conditional effects, if not caught, can limit your model's ability to produce accurate estimates. If you are not aware of any of them, given your domain knowledge of the problem, it is a good step to check for any of them using some automatic procedure to test possible interactions between the variables.

Random effects are the worst possible thing that could happen to your model, but they are easily avoided if you follow the data science procedure that we will be describing in Chapter 6, Achieving Generalization, when we deal with all the actions necessary to validate your model's results.

Predicting with a regression model

When we plug the coefficients into the regression formula, predicting is just a matter of applying new data to the vector of coefficients by a matrix multiplication.

First, you can rely on the fitted model by providing it with an array containing new cases. In the following example, you can see how, given the Xp variable with a single new case, this is easily predicted using the predict method on the fitted model:

In: RM = 5
 Xp = np.array([1,RM])
 print ("Our model predicts if RM = %01.f the answer value \is %0.1f" % (RM, fitted_model.predict(Xp)))

Out: Our model predicts if RM = 5 the answer value is 10.8

A nice usage of the predict method is to project the fitted predictions on our previous scatterplot to allow us to visualize the price dynamics in respect of our predictor, the average number of rooms:

In: x_range = [dataset['RM'].min(),dataset['RM'].max()]
 y_range = [dataset['target'].min(),dataset['target'].max()]
 scatter_plot = dataset.plot(kind='scatter', x='RM', y='target',\xlim=x_range, ylim=y_range)
 meanY = scatter_plot.plot(x_range,\[dataset['target'].mean(),dataset['target'].mean()], '--',\color='red', linewidth=1)
 meanX =scatter_plot.plot([dataset['RM'].mean(),\dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)
 regression_line = scatter_plot.plot(dataset['RM'], fitted_values,\'-', color='orange', linewidth=1)

Predicting with a regression model

Using the previous code snippet, we will obtain the preceding graphical representation of the regression line containing a further indication of how such a line crosses the cloud of data points. For instance, thanks to this graphical display, we can notice that the regression line exactly passes at the intersection of the x and y averages.

Besides the predict method, generating the predictions is quite easy by just using the dot function in NumPy. After preparing an X matrix containing both the variable data and the bias (a column of ones) and the coefficient vectors, all you have to do is to multiply the matrix by the vector. The result will itself be a vector of length equal to the number of observations:

In: predictions_by_dot_product = np.dot(X,betas)
 print ("Using the prediction method: %s" % fitted_values[:10])
 print ("Using betas and a dot product: %s" % 
predictions_by_dot_product[:10])

Out: Using the prediction method: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211
 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]
Using betas and a dot product: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211
 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]

A comparison of the results obtained by the predict method and this simple multiplication will reveal a perfect match. Because predicting from a linear regression is simple, if necessary you could even implement this multiplication on your application in a language different from Python. In such a case, you will just need to find a matrix calculation library or program a function by yourself. To our knowledge, you can easily write such a function even in the SQL script language.

Regressing with Scikit-learn

As we have seen while working with the StatsModels package, a linear model can be built using a more oriented machine learning package such as Scikit-learn. Using the linear_model module, we can set a linear regression model specifying that the predictors shouldn't be normalized and that our model should have a bias:

In: from sklearn import linear_model
 linear_regression = \linear_model.LinearRegression(normalize=False,\fit_intercept=True)

Data preparation, instead, requires counting the observations and carefully preparing the predictor array to specify its two dimensions (if left as a vector, the fitting procedure will raise an error):

In: observations = len(dataset)
X = dataset['RM'].values.reshape((observations,1)) 
# X should be always a matrix, never a vector
y = dataset['target'].values # y can be a vector

After completing all the previous steps, we can fit the model using the fit method:

In: linear_regression.fit(X,y)

A very convenient feature of the Scikit-learn package is that all the models, no matter their type of complexity, share the same methods. The fit method is always used for fitting and it expects an X and a y (when the model is a supervised one). Instead, the two common methods for making an exact prediction (always for regression) and its probability (when the model is probabilistic) are predict and predict_proba, respectively.

After fitting the model, we can inspect the vector of the coefficients and the bias constant:

In: print (linear_regression.coef_)
 print (linear_regression.intercept_)

Out: [ 9.10210898]
 -34.6706207764

Using the predict method and slicing the first 10 elements of the resulting list, we output the first 10 predicted values:

In: print (linear_regression.predict(X)[:10])

Out: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211
 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]

As previously seen, if we prepare a new matrix and we add a constant, we can calculate the results by ourselves using a simple matrix–vector multiplication:

In: Xp = np.column_stack((X,np.ones(observations)))
 v_coef = list(linear_regression.coef_) +\[linear_regression.intercept_]

As expected, the result of the product provides us with the same estimates as the predict method:

In: np.dot(Xp,v_coef)[:10]

Out: array([ 25.17574577, 23.77402099, 30.72803225, 29.02593787,
 30.38215211, 23.85593997, 20.05125842, 21.50759586,
 16.5833549 , 19.97844155])

At this point, it would be natural to question the usage of such a linear_model module. Compared with the previous functions offered by Statsmodels, Scikit-learn seems to offer little statistical output, and one seemingly with many linear regression features stripped out. In reality, it offers exactly what is needed in data science and it is perfectly fast-performing when dealing with large datasets.

If you are working on IPython, just try the following simple test to generate a large dataset and check the performance of the two versions of linear regression:

In: from sklearn.datasets import make_regression
 HX, Hy = make_regression(n_samples=10000000, n_features=1,\n_targets=1, random_state=101)

After generating ten million observations of a single variable, start by measuring using the %%time magic function for IPython. This magic function automatically computes how long it takes to complete the calculations in the IPython cell:

In: %%time
 sk_linear_regression = linear_model.LinearRegression(\normalize=False,fit_intercept=True)
 sk_linear_regression.fit(HX,Hy)

Out: Wall time: 647 ms

Now, it is the turn of the Statsmodels package:

In: %%time
 sm_linear_regression = sm.OLS(Hy,sm.add_constant(HX))
 sm_linear_regression.fit()

Out: Wall time: 2.13 s

Though a single variable is involved in the model, Statsmodels's default algorithms prove to be three times slower than Scikit-learn. We will repeat this test in the next chapter, too, when using more predictive variables in one go and other different fit methods.

主站蜘蛛池模板: 琼海市| 高雄市| 宁安市| 彩票| 启东市| 贵港市| 加查县| 康乐县| 佛山市| 永平县| 富川| 介休市| 密山市| 瑞金市| 鹤庆县| 许昌市| 于都县| 正蓝旗| 新闻| 临高县| 双鸭山市| 太仆寺旗| 贵南县| 叶城县| 称多县| 绩溪县| 桃园市| 昆山市| 泸溪县| 朝阳县| 隆安县| 图们市| 清水县| 光山县| 庆城县| 武城县| 静宁县| 云林县| 云霄县| 陈巴尔虎旗| 嘉义市|