官术网_书友最值得收藏!

Assumptions of linear regression

Linear regression has the following assumptions, failing which the linear regression model does not hold true:

  • The dependent variable should be a linear combination of independent variables
  • No autocorrelation in error terms
  • Errors should have zero mean and be normally distributed
  • No or little multi-collinearity
  • Error terms should be homoscedastic

These are explained in detail as follows:

  • The dependent variable should be a linear combination of independent variables: Y should be a linear combination of X variables. Please note, in the following equation, X2 has raised to the power of 2, the equation is still holding the assumption of a linear combination of variables:

How to diagnose: Look into residual plots of residual versus independent variables. Also try to include polynomial terms and see any decrease in residual values, as polynomial terms may capture more signals from the data in case simple linear models do not capture them.

In the preceding sample graph, initially, linear regression was applied and the errors seem to have a pattern rather than being pure white noise; in this case, it is simply showing the presence of non-linearity. After increasing the power of the polynomial value, now the errors simply look like white noise.

  • No autocorrelation in error terms: Presence of correlation in error terms penalized model accuracy.

How to diagnose: Look for the Durbin-Watson test. Durbin-Watson's d tests the null hypothesis that the residuals are not linearly auto correlated. While d can lie between 0 and 4, if d ≈ 2 indicates no autocorrelation, 0<d<2 implies positive autocorrelation, and 2<d<4 indicates negative autocorrelation.

  • Error should have zero mean and be normally distributed: Errors should have zero mean for the model to create an unbiased estimate. Plotting the errors will show the distribution of errors. Whereas, if error terms are not normally distributed, it implies confidence intervals will become too wide or narrow, which leads to difficulty in estimating coefficients based on minimization of least squares:

How to diagnose: Look into Q-Q plot and also tests such as Kolmogorov-Smirnov tests will be helpful. By looking into the above Q-Q plot, it is evident that the first chart shows errors are normally distributed, as the residuals do not seem to be deviating much compared with the diagonal-like line, whereas in the right-hand chart, it is clearly showing that errors are not normally distributed; in these scenarios, we need to reevaluate the variables by taking log transformations and so on to make residuals look as they do on the left-hand chart.

  • No or little multi-collinearity: Multi-collinearity is the case in which independent variables are correlated with each other and this situation creates unstable models by inflating the magnitude of coefficients/estimates. It also becomes difficult to determine which variable is contributing to predict the response variable. VIF is calculated for each independent variable by calculating the R-squared value with respect to all the other independent variables and tries to eliminate which variable has the highest VIF value one by one:

How to diagnose: Look into scatter plots, run correlation coefficient on all the variables of data. Calculate the variance inflation factor (VIF). If VIF <= 4 suggests no multi-collinearity, in banking scenarios, people use VIF <= 2 also!

  • Errors should be homoscedastic: Errors should have constant variance with respect to the independent variable, which leads to impractically wide or narrow confidence intervals for estimates, which degrades the model's performance. One reason for not holding homoscedasticity is due to the presence of outliers in the data, which drags the model fit toward them with higher weights:

How to diagnose: Look into the residual versus dependent variables plot; if any pattern of cone or divergence does exist, it indicates the errors do not have constant variance, which impacts its predictions.

主站蜘蛛池模板: 瑞金市| 百色市| 澜沧| 福建省| 兖州市| 宁强县| 舒城县| 鹤山市| 新营市| 衡阳县| 蓝山县| 新津县| 巨鹿县| 灵璧县| 溧阳市| 新蔡县| 桓台县| 宁津县| 公安县| 吉安市| 大洼县| 朝阳区| 玛沁县| 大连市| 呼图壁县| 崇信县| 乡城县| 蒙山县| 扬中市| 密山市| 临夏市| 南雄市| 静乐县| 罗城| 桑日县| 老河口市| 昌邑市| 乌兰浩特市| 高平市| 彭阳县| 庄浪县|