書名： Statistics for Machine Learning
作者名： Pratap Dangeti
本章字數(shù)： 625字
更新時間： 2021-07-02 19:05:58

Assumptions of linear regression

Linear regression has the following assumptions, failing which the linear regression model does not hold true:

The dependent variable should be a linear combination of independent variables
No autocorrelation in error terms
Errors should have zero mean and be normally distributed
No or little multi-collinearity
Error terms should be homoscedastic

These are explained in detail as follows:

The dependent variable should be a linear combination of independent variables: Y should be a linear combination of X variables. Please note, in the following equation, X2 has raised to the power of 2, the equation is still holding the assumption of a linear combination of variables:

How to diagnose: Look into residual plots of residual versus independent variables. Also try to include polynomial terms and see any decrease in residual values, as polynomial terms may capture more signals from the data in case simple linear models do not capture them.

In the preceding sample graph, initially, linear regression was applied and the errors seem to have a pattern rather than being pure white noise; in this case, it is simply showing the presence of non-linearity. After increasing the power of the polynomial value, now the errors simply look like white noise.

No autocorrelation in error terms: Presence of correlation in error terms penalized model accuracy.

How to diagnose: Look for the Durbin-Watson test. Durbin-Watson's d tests the null hypothesis that the residuals are not linearly auto correlated. While d can lie between 0 and 4, if d ≈ 2 indicates no autocorrelation, 0<d<2 implies positive autocorrelation, and 2<d<4 indicates negative autocorrelation.

Error should have zero mean and be normally distributed: Errors should have zero mean for the model to create an unbiased estimate. Plotting the errors will show the distribution of errors. Whereas, if error terms are not normally distributed, it implies confidence intervals will become too wide or narrow, which leads to difficulty in estimating coefficients based on minimization of least squares:

How to diagnose: Look into Q-Q plot and also tests such as Kolmogorov-Smirnov tests will be helpful. By looking into the above Q-Q plot, it is evident that the first chart shows errors are normally distributed, as the residuals do not seem to be deviating much compared with the diagonal-like line, whereas in the right-hand chart, it is clearly showing that errors are not normally distributed; in these scenarios, we need to reevaluate the variables by taking log transformations and so on to make residuals look as they do on the left-hand chart.

No or little multi-collinearity: Multi-collinearity is the case in which independent variables are correlated with each other and this situation creates unstable models by inflating the magnitude of coefficients/estimates. It also becomes difficult to determine which variable is contributing to predict the response variable. VIF is calculated for each independent variable by calculating the R-squared value with respect to all the other independent variables and tries to eliminate which variable has the highest VIF value one by one:

How to diagnose: Look into scatter plots, run correlation coefficient on all the variables of data. Calculate the variance inflation factor (VIF). If VIF <= 4 suggests no multi-collinearity, in banking scenarios, people use VIF <= 2 also!

Errors should be homoscedastic: Errors should have constant variance with respect to the independent variable, which leads to impractically wide or narrow confidence intervals for estimates, which degrades the model's performance. One reason for not holding homoscedasticity is due to the presence of outliers in the data, which drags the model fit toward them with higher weights:

How to diagnose: Look into the residual versus dependent variables plot; if any pattern of cone or divergence does exist, it indicates the errors do not have constant variance, which impacts its predictions.

官术网_书友最值得收藏!

Statistics for Machine Learning

Assumptions of linear regression