- Effective Amazon Machine Learning
- Alexis Perrier
- 399字
- 2021-07-03 00:17:51
Addressing multicollinearity
The following is the definition of multicollinearity according to Wikipedia (https://en.wikipedia.org/wiki/Multicollinearity):
Multicollinearity is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy.
Basically, let's say you have a model with three predictors:

And one of the predictors is a linear combination (perfect multicollinearity) or is approximated by a linear combination (near multicollinearity) of two other predictors.
For instance:
Here, is some noise variable.
In that case, changes in and
will drive changes in
, and as a consequence,
will be tied to
and
. The information already contained in
and
will be shared with the third predictor
, which will cause high uncertainty and instability in the model. Small changes in the predictors' values will bring large variations in the coefficients. The regression may no longer be reliable. In more technical terms, the standard errors of the coefficient would increase, which would lower the significance of otherwise important predictors.
There are several ways to detect multicollinearity. Calculating the correlation matrix of the predictors is a first step, but that would only detect collinearity between pairs of predictors.
A widely used detection method for multicollinearity is to calculate the variance inflation factor or VIF for each of the predictors:

Here, is the coefficient of determination of the regression equation in step one, with
on the left-hand side and all other predictor variables (all the other
variables) on the right-hand side.
A large value for one of the VIFs is an indication that the variance (the square of the standard error) of a particular coefficient is larger than it would be if that predictor was completely uncorrelated with all the other predictors. For instance, a VIF of 1.8 indicates that the variance of a predictor is 80% larger than what it would be in the uncorrelated case.
Once the attributes with high collinearity have been identified, the following options will reduce collinearity:
- Removing the high collinearity predictors from the dataset
- Using Partial Least Squares Regression (PLS) or Principal Components Analysis (PCA), regression methods (https://en.wikipedia.org/wiki/Principal_component_analysis); PLS and PCA will reduce the number of predictors to a smaller set of uncorrelated variables
Unfortunately, detection and removal of multicollinear variables is not available from the Amazon ML platform.
- Building Computer Vision Projects with OpenCV 4 and C++
- 劍破冰山:Oracle開發藝術
- MongoDB管理與開發精要
- Learning Spring Boot
- Lean Mobile App Development
- iOS and OS X Network Programming Cookbook
- 智能數據分析:入門、實戰與平臺構建
- 數據中心數字孿生應用實踐
- INSTANT Android Fragmentation Management How-to
- 大數據數學基礎(Python語言描述)
- R Machine Learning Essentials
- 大數據時代系列(套裝9冊)
- MySQL數據庫應用與管理
- 智能與數據重構世界
- 大數據隱私保護技術與治理機制研究