- Statistics for Machine Learning
- Pratap Dangeti
- 625字
- 2021-07-02 19:05:58
Assumptions of linear regression
Linear regression has the following assumptions, failing which the linear regression model does not hold true:
- The dependent variable should be a linear combination of independent variables
- No autocorrelation in error terms
- Errors should have zero mean and be normally distributed
- No or little multi-collinearity
- Error terms should be homoscedastic
These are explained in detail as follows:
- The dependent variable should be a linear combination of independent variables: Y should be a linear combination of X variables. Please note, in the following equation, X2 has raised to the power of 2, the equation is still holding the assumption of a linear combination of variables:


How to diagnose: Look into residual plots of residual versus independent variables. Also try to include polynomial terms and see any decrease in residual values, as polynomial terms may capture more signals from the data in case simple linear models do not capture them.
In the preceding sample graph, initially, linear regression was applied and the errors seem to have a pattern rather than being pure white noise; in this case, it is simply showing the presence of non-linearity. After increasing the power of the polynomial value, now the errors simply look like white noise.
- No autocorrelation in error terms: Presence of correlation in error terms penalized model accuracy.
How to diagnose: Look for the Durbin-Watson test. Durbin-Watson's d tests the null hypothesis that the residuals are not linearly auto correlated. While d can lie between 0 and 4, if d ≈ 2 indicates no autocorrelation, 0<d<2 implies positive autocorrelation, and 2<d<4 indicates negative autocorrelation.
- Error should have zero mean and be normally distributed: Errors should have zero mean for the model to create an unbiased estimate. Plotting the errors will show the distribution of errors. Whereas, if error terms are not normally distributed, it implies confidence intervals will become too wide or narrow, which leads to difficulty in estimating coefficients based on minimization of least squares:

How to diagnose: Look into Q-Q plot and also tests such as Kolmogorov-Smirnov tests will be helpful. By looking into the above Q-Q plot, it is evident that the first chart shows errors are normally distributed, as the residuals do not seem to be deviating much compared with the diagonal-like line, whereas in the right-hand chart, it is clearly showing that errors are not normally distributed; in these scenarios, we need to reevaluate the variables by taking log transformations and so on to make residuals look as they do on the left-hand chart.
- No or little multi-collinearity: Multi-collinearity is the case in which independent variables are correlated with each other and this situation creates unstable models by inflating the magnitude of coefficients/estimates. It also becomes difficult to determine which variable is contributing to predict the response variable. VIF is calculated for each independent variable by calculating the R-squared value with respect to all the other independent variables and tries to eliminate which variable has the highest VIF value one by one:

How to diagnose: Look into scatter plots, run correlation coefficient on all the variables of data. Calculate the variance inflation factor (VIF). If VIF <= 4 suggests no multi-collinearity, in banking scenarios, people use VIF <= 2 also!
- Errors should be homoscedastic: Errors should have constant variance with respect to the independent variable, which leads to impractically wide or narrow confidence intervals for estimates, which degrades the model's performance. One reason for not holding homoscedasticity is due to the presence of outliers in the data, which drags the model fit toward them with higher weights:

How to diagnose: Look into the residual versus dependent variables plot; if any pattern of cone or divergence does exist, it indicates the errors do not have constant variance, which impacts its predictions.
- PHP程序設計(慕課版)
- C++面向對象程序設計(微課版)
- OpenCV 3和Qt5計算機視覺應用開發
- SQL Server 2016數據庫應用與開發習題解答與上機指導
- Visual C++數字圖像處理技術詳解
- 用案例學Java Web整合開發
- QGIS Python Programming Cookbook(Second Edition)
- Python Data Science Cookbook
- Cocos2d-x Game Development Blueprints
- Python 3快速入門與實戰
- Clojure編程樂趣
- Jakarta EE Cookbook
- Getting Started with Windows Server Security
- Hands-On Data Visualization with Bokeh
- 熱處理常見缺陷分析與解決方案