- Regression Analysis with Python
- Luca Massaron Alberto Boschetti
- 4764字
- 2021-07-16 12:47:25
Extending to linear regression
Linear regression tries to fit a line through a given set of points, choosing the best fit. The best fit is the line that minimizes the summed squared difference between the value dictated by the line for a certain value of x and its corresponding y values. (It is optimizing the same squared error that we met before when checking how good a mean was as a predictor.)
Since linear regression is a line; in bi-dimensional space (x, y), it takes the form of the classical formula of a line in a Cartesian plane: y = mx + q, where m is the angular coefficient (expressing the angle between the line and the x axis) and q is the intercept between the line and the x axis.
Formally, machine learning indicates the correct expression for a linear regression as follows:
Here, again, X is a matrix of the predictors, β is a matrix of coefficients, and β0 is a constant value called the bias (it is the same as the Cartesian formulation, only the notation is different).
We can better understand its functioning mechanism by seeing it in action with Python, first using the StatsModels
package, then using the Scikit-learn one.
Regressing with Statsmodels
Statsmodels is a package designed with statistical analysis in mind; therefore, its function offers quite a rich output of statistical checks and information. Scalability is not an issue for the package; therefore, it is really a good starting point for learning, but is certainly not the optimal solution if you have to crunch quite large datasets (or even big data) because of its optimization algorithm.
There are two different methods (two modules) to work out a linear regression with Statsmodels:
statsmodels.api
: This works with distinct predictor and answer variables and requires you to define any transformation of the variables on the predictor variable, including adding the interceptstatsmodels.formula.api
: This works in a similar way to R, allowing you to specify a functional form (the formula of the summation of the predictors)
We will illustrate our example using the statsModels.api
; however, we will also show you an alternative method with statsmodels.formula.api
.
As a first step, let's upload both the modules of Statsmodels, naming them as conventionally indicated in the package documentation:
In: import statsmodels.api as sm import statsmodels.formula.api as smf
As a second step, it is necessary to define the y and X variables:
In: y = dataset['target'] X = dataset['RM'] X = sm.add_constant(X)
The X variable needs to be extended by a constant value ()
; the bias will be calculated accordingly. In fact, as you remember, the formula of a linear regression is as follows:
However, using StatsModels.api
, the formula actually becomes the following:
This can be interpreted as a combination of the variables in X, multiplied by its corresponding β value.
Consequently, the predictor X now contains both the predictive variable and a unit constant. Also, β is no longer a single coefficient, but a vector of coefficients.
Let's have a visual confirmation of this by requiring the first values of the Pandas DataFrame using the head
method:
In: X.head()
At this point, we just need to set the initialization of the linear regression calculation:
In: linear_regression = sm.OLS(y,X)
Also, we need to ask for the estimation of the regression coefficients, the β vector:
In: fitted_model = linear_regression.fit()
If we had wanted to manage the same result using the StatsModels.formula.api
, we should have typed the following:
In: linear_regression = smf.ols(formula='target ~ RM', data=dataset) fitted_model = linear_regression.fit()
The previous two code lines simultaneously comprise both steps seen together, without requiring any particular variable preparation since the bias is automatically incorporated. In fact, the specification about how the linear regression should work is incorporated into the string target ~ RM
, where the variable name left of the tilde (~
) indicates the answer variable, the variable name (or names, in the case of a multiple regression analysis) on the right being for the predictor.
Actually, smf.ols
expects quite a different input compared to sm.OLS
, because it can accept our entire original dataset (it selects what variables are to be used by using the provided formula), whereas sm.OLS
expects a matrix containing just the features to be used for prediction. Consequently, some caution has to be exercised when using two such different approaches.
A summary (a method of the fitted model) can quickly tell you everything that you need to know about regression analysis. In case you have tried statsmodesl.formula.api
, we also re-initialize the linear regression using the StatsModels.api
since they are not working on the same X and our following code relies on sm.OLS
specifications:
In: linear_regression = sm.OLS(y,X) fitted_model = linear_regression.fit() fitted_model.summary()
You will receive quite a long series of tables containing many statistical tests and information. Though quite daunting at the beginning, you actually do not need all these outputs, unless the purpose of your analysis is a statistical one. Data science is mainly concerned with real models working on predicting real data, not on formally correct specifications of statistical problems. Nevertheless, some of these outputs are still useful for successful model building and we are going to provide you with an insight into the main figures.
Before explaining the outputs, we first need to extract two elements from the fitted model: the coefficients and the predictions calculated on the data on which we built the model. They both are going to come in very handy during the following explanations:
In: print (fitted_model.params) betas = np.array(fitted_model.params) fitted_values = fitted_model.predict(X)
The coefficient of determination
Let's start from the first table of results. The first table is pided into two columns. The first one contains a description of the fitted model:
- Dep. Variable: It just reminds you what the target variable was
- Model: Another reminder of the model that you have fitted, the OLS is ordinary least squares, another way to refer to linear regression
- Method: The parameters fitting method (in this case least squares, the classical computation method)
- No. Observations: The number of observations that have been used
- DF Residuals: The degrees of freedom of the residuals, which is the number of observations minus the number of parameters
- DF Model: The number of estimated parameters in the model (excluding the constant term from the count)
The second table gives a more interesting picture, focusing how good the fit of the linear regression model is and pointing out any possible problems with the model:
- R-squared: This is the coefficient of determination, a measure of how well the regression does with respect to a simple mean.
- Adj. R-squared: This is the coefficient of determination adjusted based on the number of parameters in a model and the number of observations that helped build it.
- F-statistic: This is a measure telling you if, from a statistical point of view, all your coefficients, apart from the bias and taken together, are different from zero. In simple words, it tells you if your regression is really better than a simple average.
- Prob (F-statistic): This is the probability that you got that F-statistic just by lucky chance due to the observations that you have used (such a probability is actually called the p-value of F-statistic). If it is low enough you can be confident that your regression is really better than a simple mean. Usually in statistics and science a test probability has to be equal or lower than 0.05 (a conventional criterion of statistical significance) for having such a confidence.
- AIC: This is the Akaike Information Criterion. AIC is a score that evaluates the model based on the number of observations and the complexity of the model itself. The lesser the AIC score, the better. It is very useful for comparing different models and for statistical variable selection.
- BIC: This is the Bayesian Information Criterion. It works as AIC, but it presents a higher penalty for models with more parameters.
Most of these statistics make sense when we are dealing with more than one predictor variable, so they will be discussed in the next chapter. Thus, for the moment, as we are working with a simple linear regression, the two measures that are worth examining closely are F-statistic and R-squared. F-statistic is actually a test that doesn't tell you too much if you have enough observations and you can count on a minimally correlated predictor variable. Usually it shouldn't be much of a concern in a data science project.
R-squared is instead much more interesting because it tells you how much better your regression model is in comparison to a single mean. It does so by providing you with a percentage of the unexplained variance of a mean as a predictor that actually your model was able to explain.
If you want to compute the measure yourself, you just have to calculate the sum of squared errors of the mean of the target variable. That's your baseline of unexplained variance (the variability in house prices that in our example we want to explain by a model). If from that baseline you subtract the sum of squared errors of your regression model, you will get the residual sum of squared errors, which can be compared using a pision with your baseline:
In: mean_sum_squared_errors = np.sum((dataset['target']-\dataset['target'].mean())**2) regr_sum_squared_errors = np.sum((dataset['target']-\fitted_values)**2) (mean_sum_squared_errors-\regr_sum_squared_errors) / mean_sum_squared_errors Out: 0.48352545599133412
Tip
When working with floats, rounding errors are possible, so don't be afraid if some of the lesser decimals don't match in your calculations; if they match the 8th decimal, you can be quite confident that the result is the same.
Ideally, if you can reduce your sum of squared errors of the regression to zero, you will get the maximum percentage of explained variance—that is, a score of 1.
The R-squared measure is also comparable with the percentage that you obtain squaring the correlation between your predictor and the target variable.
In our example, it is 0.484, which actually is exactly our R-squared correlation:
In: (pearsonr(dataset['RM'], dataset['target'])[0])**2 Out: 0.4835254559913339
As we have seen, R-squared is perfectly aligned with the squared errors that the linear regression is trying to minimize; thus, a better R-squared means a better model. However, there are some problems with the measure (and with linear regression itself, actually) when working with more predictors at once. Again, we have to wait until we model more predictors at once; therefore, just for a simple linear regression, a better R-squared should hint at a better model.
Meaning and significance of coefficients
The second output table informs us about the coefficients and provides us with a series of tests. These tests can make us confident that we have not been fooled by a few extreme observations in the foundations of our analysis or by some other problem:
- coef: The estimated coefficient
- std err: The standard error of the estimate of the coefficient; the larger it is, the more uncertain the estimation of the coefficient
- t: The t-statistic value, a measure indicating whether the coefficient true value is different from zero
- P > |t|: The p-value indicating the probability that the coefficient is different from zero just by chance
- [95.0% Conf. Interval]: The lower and upper values of the coefficient, considering 95% of all the chances of having different observations and so different estimated coefficients
From a data science viewpoint, t-tests and confidence bounds are not very useful because we are mostly interested in verifying whether our regression is working while predicting answer variables. Consequently, we will focus just on the coef
value (the estimated coefficients) and on their standard error.
The coefficients are the most important output that we can obtain from our regression model because they allow us to re-create the weighted summation that can predict our outcomes.
In our example, our coefficients are ?34.6706 for the bias (also called the intercept, recalling the formula for a line in a Cartesian space) and 9.1021 for the RM
variable. Recalling our formula, we can plug in the numbers we obtained:
Now, if you replace the betas and X with the estimated coefficients, and the variables' names with ?34.6706 and 9.1021, everything becomes the following:
Now, if you know the average number of rooms in an area of Boston, you can make a quick estimate of the expected value. For instance, xRM is 4.55
:
In: 9.1021*4.55-34.6706 Out: 6.743955
We have to notice two points here. First, in such a formulation, the beta of each variable becomes its unit change measure, which corresponds to the change the outcome will undergo if the variable increases by one unit. In our case, our average room space becomes 5.55
:
In: 9.1021*5.55-34.6706 Out: 15.846055
The increase for a unit change in xRM corresponds to a change in the outcome equivalent to βRM. The other point to be noticed is that, if our average room space becomes 1 or 2, our estimated value will turn negative, which is completely unrealistic. This is because the mapping between predictor and the target variable happened in a delimited bound of values of the predictor:
In: (np.min(dataset['RM']),np.max(dataset['RM'])) Out: (3.5609999999999999, 8.7799999999999994)
Whenever we try to estimate our answer values using an x (or a set of X) that is outside the boundaries we used for fitting the model, we risk a response that has not been optimized at all by the linear regression calculations. Expressed in another way, linear regression can learn what it sees, and, unless there is a clear linear functional form between the predictor and the target (they can be truly expressed as a line), you risk weird estimations when your predictors have an unusual value. In other words, a linear regression can always work within the range of values it learned from (this is called interpolation) but can provide correct values for its learning boundaries (a different predictive activity called extrapolation) only in certain conditions.
Tip
As we previously mentioned, the number of observations used for fitting the model is of paramount importance to obtain a robust and reliable linear regression model. The more observations, the less likely the model is to be surprised by unusual values when running in production.
Standard errors instead are very important because they signal a weak or unclear relationship between the predictor and the answer. You can notice this by piding the standard error by its beta. If the ratio is 0.5 or even larger, then it's a clear sign that the model has little confidence that it provided you with the right coefficient estimates. Having more cases is always the solution because it can reduce the standard errors of the coefficients and improve our estimates; however, there are also other methods to reduce errors, such as removing the redundant variance present among the features by a principal component analysis or selecting a parsimonious set of predictors by greedy selections. All these topics will be discussed when we work with multiple predictors; at this point in the book, we will illustrate the remedies to such a problem.
Evaluating the fitted values
The last table deals with an analysis of the residuals of the regression. The residuals are the difference between the target values and the predicted fitted values:
- Skewness: This is a measure of the symmetry of the residuals around the mean. For symmetric distributed residuals, the value should be around zero. A positive value indicates a long tail to the right; a negative value a long tail to the left.
- Kurtosis: This is a measure of the shape of the distribution of the residuals. A bell-shaped distribution has a zero measure. A negative value points to a too flat distribution; a positive one has too great a peak.
- Omnibus D'Angostino's test: This is a combined statistical test for skewness and kurtosis.
- Prob(Omnibus): This is the Omnibus statistic turned into a probability.
- Jarque-Bera: This is another test of skewness and kurtosis.
- Prob (JB): This is the JB statistic turned into a probability.
- Durbin-Watson: This is a test for the presence of correlation among the residuals (relevant during analysis of time-based data).
- Cond. No: This is a test for multicollinearity (we will deal with the concept of multicollinearity when working with many predictors).
A close analysis of residuals is quite relevant in statistical practice since it can highlight the presence of serious problems with regression analysis. When working with a single variable it is interesting to visually check its residuals to figure out if there are strange cases or if the residuals don't distribute randomly. In particular, it is important to keep an eye out for any of these three problems showing up:
- Values too far from the average. Large standardized residuals hint at a serious difficulty when modeling such observations. Also, in the process of learning these values, the regression coefficients may have been distorted.
- Different variance in respect of the value of the predictor. If the linear regression is an average conditioned on the predictor, dishomogeneous variance points out that the regression is not working properly when the predictor has certain values.
- Strange shapes in the cloud of residual points may indicate that you need a more complex model for the data you are analyzing.
In our case, we can easily compute the residuals by subtracting the fitted values from the answer variable and then plotting the resulting standardized residuals in a graph:
In: residuals = dataset['target']-fitted_values normalized_residuals = standardize(residuals) In: residual_scatter_plot = plt.plot(dataset['RM'], normalized_residuals,'bp') mean_residual = plt.plot([int(x_range[0]),round(x_range[1],0)], [0,0], '-', color='red', linewidth=2) upper_bound = plt.plot([int(x_range[0]),round(x_range[1],0)], [3,3], '--', color='red', linewidth=1) lower_bound = plt.plot([int(x_range[0]),round(x_range[1],0)], [-3,-3], '--', color='red', linewidth=1) plt.grid()
The resulting scatterplot indicates that the residuals show some of the problems we previously indicated as a warning that something is not going well with your regression analysis. First, there are a few points lying outside the band delimited by the two dotted lines at normalized residual values ?3 and +3 (a range that should hypothetically cover 99.7% of values if the residuals have a normal distribution). These are surely influential points with large errors and they can actually make the entire linear regression under-perform. We will talk about possible solutions to this problem when we discuss outliers in the next chapter.
Then, the cloud of points is not at all randomly scattered, showing different variances at different values of the predictor variable (the abscissa axis) and you can spot unexpected patterns (points in a straight line, or the core points placed in a kind of U shape).
We are not at all surprised; the average number of rooms is likely a good predictor but it is not the only cause, or it has to be rethought as a direct cause (the number of rooms indicates a larger house, but what if the rooms are smaller than average?). This leads us to discuss whether a strong correlation really makes a variable a good working candidate for a linear relationship.
Correlation is not causation
Actually, seeing a correlation between your predictor and your target variable, and managing to model it successfully using a linear regression, doesn't really mean that there is a causal relation between the two (though your regression may work very well, and even optimally).
Though using a data science approach, instead of a statistical one, will guarantee a certain efficacy in your model, it is easy to fall into some mistakes when having no clue why your target variable is correlated with a predictor.
We will tell you about six different reasons, and offer a cautionary word to help you handle such predictors without difficulty:
- Direct causation: x causes y; for instance, in the real estate business the value is directly proportional to the size of the house in square meters.
- Reciprocal effects: x causes y but it is also influenced by y. This is quite typical of many macro-economic dynamics where the effect of a policy augments or diminishes its effects. As an example in real estate, high crime rates in an area can lower its prices but lower prices mean that the area could quickly become even more degraded and dangerous.
- Spurious causation: This happens when the real cause is actually z, which causes both x and y; consequently it is just a fallacious illusion that x implies y because it is z behind the scenes. For instance, the presence of expensive art shops and galleries may seem to correlate with house prices; in reality, both are determined by the presence of affluent residents.
- Indirect causation: x in reality is not causing y but it is causing something else, which then causes y. A good municipality investing in infrastructures after higher taxes can indirectly affect house prices because the area becomes more comfortable to live in, thus attracting more demand. Higher taxes, and thus more investments, indirectly affect house prices.
- Conditional effect: x causes y in respect of the values of another variable z; for instance, when z has certain values x is not influencing y but, when z takes particular values, the x starts impacting y. We also call this situation interaction. For instance the presence of schools in an area can become an attractor when the crime rate is low, so it affects house prices only when there is little criminality.
- Random effect: Any recorded correlation between x and y has been due to a lucky sampling selection; in reality there is no relationship with y at all.
The ideal case is when you have a direct causation; then, you will have a predictor in your model that will always provide you with the best values to derive your responses.
In the other cases, it is likely that the imperfect cause-effect relationship with the target variable will lead to more noisy estimates, especially in production when you will have to work with data not seen before by the model.
Reciprocal effects are more typical of econometric models. They require special types of regression analysis. Including them in your regression analysis may improve your model; however, their role may be underestimated.
Spurious and indirect causes will add some noise to your x and y relationship; this could bring noisier estimates (larger standard errors). Often, the solution is to get more observations for your analysis.
Conditional effects, if not caught, can limit your model's ability to produce accurate estimates. If you are not aware of any of them, given your domain knowledge of the problem, it is a good step to check for any of them using some automatic procedure to test possible interactions between the variables.
Random effects are the worst possible thing that could happen to your model, but they are easily avoided if you follow the data science procedure that we will be describing in Chapter 6, Achieving Generalization, when we deal with all the actions necessary to validate your model's results.
Predicting with a regression model
When we plug the coefficients into the regression formula, predicting is just a matter of applying new data to the vector of coefficients by a matrix multiplication.
First, you can rely on the fitted model by providing it with an array containing new cases. In the following example, you can see how, given the Xp
variable with a single new case, this is easily predicted using the predict
method on the fitted model:
In: RM = 5 Xp = np.array([1,RM]) print ("Our model predicts if RM = %01.f the answer value \is %0.1f" % (RM, fitted_model.predict(Xp))) Out: Our model predicts if RM = 5 the answer value is 10.8
A nice usage of the predict
method is to project the fitted predictions on our previous scatterplot to allow us to visualize the price dynamics in respect of our predictor, the average number of rooms:
In: x_range = [dataset['RM'].min(),dataset['RM'].max()] y_range = [dataset['target'].min(),dataset['target'].max()] scatter_plot = dataset.plot(kind='scatter', x='RM', y='target',\xlim=x_range, ylim=y_range) meanY = scatter_plot.plot(x_range,\[dataset['target'].mean(),dataset['target'].mean()], '--',\color='red', linewidth=1) meanX =scatter_plot.plot([dataset['RM'].mean(),\dataset['RM'].mean()], y_range, '--', color='red', linewidth=1) regression_line = scatter_plot.plot(dataset['RM'], fitted_values,\'-', color='orange', linewidth=1)
Using the previous code snippet, we will obtain the preceding graphical representation of the regression line containing a further indication of how such a line crosses the cloud of data points. For instance, thanks to this graphical display, we can notice that the regression line exactly passes at the intersection of the x and y averages.
Besides the predict
method, generating the predictions is quite easy by just using the dot
function in NumPy
. After preparing an X matrix containing both the variable data and the bias (a column of ones) and the coefficient vectors, all you have to do is to multiply the matrix by the vector. The result will itself be a vector of length equal to the number of observations:
In: predictions_by_dot_product = np.dot(X,betas) print ("Using the prediction method: %s" % fitted_values[:10]) print ("Using betas and a dot product: %s" % predictions_by_dot_product[:10]) Out: Using the prediction method: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155] Using betas and a dot product: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]
A comparison of the results obtained by the predict
method and this simple multiplication will reveal a perfect match. Because predicting from a linear regression is simple, if necessary you could even implement this multiplication on your application in a language different from Python. In such a case, you will just need to find a matrix calculation library or program a function by yourself. To our knowledge, you can easily write such a function even in the SQL script language.
Regressing with Scikit-learn
As we have seen while working with the StatsModels
package, a linear model can be built using a more oriented machine learning package such as Scikit-learn. Using the linear_model
module, we can set a linear regression model specifying that the predictors shouldn't be normalized and that our model should have a bias:
In: from sklearn import linear_model linear_regression = \linear_model.LinearRegression(normalize=False,\fit_intercept=True)
Data preparation, instead, requires counting the observations and carefully preparing the predictor array to specify its two dimensions (if left as a vector, the fitting procedure will raise an error):
In: observations = len(dataset) X = dataset['RM'].values.reshape((observations,1)) # X should be always a matrix, never a vector y = dataset['target'].values # y can be a vector
After completing all the previous steps, we can fit the model using the fit
method:
In: linear_regression.fit(X,y)
A very convenient feature of the Scikit-learn package is that all the models, no matter their type of complexity, share the same methods. The fit
method is always used for fitting and it expects an X and a y (when the model is a supervised one). Instead, the two common methods for making an exact prediction (always for regression) and its probability (when the model is probabilistic) are predict
and predict_proba
, respectively.
After fitting the model, we can inspect the vector of the coefficients and the bias constant:
In: print (linear_regression.coef_) print (linear_regression.intercept_) Out: [ 9.10210898] -34.6706207764
Using the predict
method and slicing the first 10 elements of the resulting list, we output the first 10 predicted values:
In: print (linear_regression.predict(X)[:10]) Out: [ 25.17574577 23.77402099 30.72803225 29.02593787 30.38215211 23.85593997 20.05125842 21.50759586 16.5833549 19.97844155]
As previously seen, if we prepare a new matrix and we add a constant, we can calculate the results by ourselves using a simple matrix–vector multiplication:
In: Xp = np.column_stack((X,np.ones(observations))) v_coef = list(linear_regression.coef_) +\[linear_regression.intercept_]
As expected, the result of the product provides us with the same estimates as the predict
method:
In: np.dot(Xp,v_coef)[:10] Out: array([ 25.17574577, 23.77402099, 30.72803225, 29.02593787, 30.38215211, 23.85593997, 20.05125842, 21.50759586, 16.5833549 , 19.97844155])
At this point, it would be natural to question the usage of such a linear_model
module. Compared with the previous functions offered by Statsmodels, Scikit-learn seems to offer little statistical output, and one seemingly with many linear regression features stripped out. In reality, it offers exactly what is needed in data science and it is perfectly fast-performing when dealing with large datasets.
If you are working on IPython, just try the following simple test to generate a large dataset and check the performance of the two versions of linear regression:
In: from sklearn.datasets import make_regression HX, Hy = make_regression(n_samples=10000000, n_features=1,\n_targets=1, random_state=101)
After generating ten million observations of a single variable, start by measuring using the %%time
magic function for IPython. This magic function automatically computes how long it takes to complete the calculations in the IPython cell:
In: %%time sk_linear_regression = linear_model.LinearRegression(\normalize=False,fit_intercept=True) sk_linear_regression.fit(HX,Hy) Out: Wall time: 647 ms
Now, it is the turn of the Statsmodels package:
In: %%time sm_linear_regression = sm.OLS(Hy,sm.add_constant(HX)) sm_linear_regression.fit() Out: Wall time: 2.13 s
Though a single variable is involved in the model, Statsmodels's default algorithms prove to be three times slower than Scikit-learn. We will repeat this test in the next chapter, too, when using more predictive variables in one go and other different fit
methods.
- 程序員修煉之道:程序設計入門30講
- Expert C++
- 軟件項目管理(第2版)
- 軟件界面交互設計基礎
- Interactive Data Visualization with Python
- 算法訓練營:入門篇(全彩版)
- Mastering Natural Language Processing with Python
- FFmpeg入門詳解:音視頻流媒體播放器原理及應用
- Learning Informatica PowerCenter 10.x(Second Edition)
- 基于差分進化的優化方法及應用
- Python數據可視化之Matplotlib與Pyecharts實戰
- 嚴密系統設計:方法、趨勢與挑戰
- Spring Boot Cookbook
- Python深度學習原理、算法與案例
- ActionScript 3.0從入門到精通(視頻實戰版)