官术网_书友最值得收藏!

Checking no collinearity with correlations

To check no collinearity, we could use a number of different techniques. For example, for those familiar with linear algebra, the condition number is a measure of how singular a matrix is, where singularity would imply perfect collinearity among the covariates. This number could provide a measure of this collinearity. Another technique is to use the Variance Inflation Factor, which is a more formal technique that provides a measure of how much a regression's variance is increased because of collinearity. Another, and a more common, way of checking this is with simple correlations. Are any variables strongly correlated among themselves in the sense that there could be a direct relation among them? If so, then we may have a multicollinearity problem. To get a sense of how correlated our variables are, we will use the correlations matrix techniques shown in Chapter 2, Understanding Votes with Descriptive Statistics.

The following code shows how correlations work in R:

library(corrplot)
corrplot(corr = cor(data[, numerical_variables]), tl.col = "black", tl.cex = 0.6)

As you can see, the strong correlations (either positive or negative) are occurring intra-groups not inter-groups, meaning that variables that measure the same thing in different ways appear to be highly correlated, while variables that measure different things don't appear to be highly correlated.

For example, Age_18to44 and Age_45plus are variables that measure age, and we expect them to have a negative relation since the higher the percentage of young people in a ward is, by necessity, the percentage of older people is lower. The same relation can be seen in the housing group (Owned, OwnedOutright, SocialRent, and PrivateRent), the employment group (Unemp, UnempRate_EA, and HigherOccup), the deprived group (Deprived and MultiDepriv), ethnic group (White and NonWhite), the residency group (Residents and Households), and the education group (LowEducationLevel and HighEducationLevel). If you pick variables belonging to different groups, the number of strong correlations is significantly lower, but it's there. For example, HigherOccup is strongly correlated to HighEducationLevel and LowEducationLevel, positively and negatively, respectively. Also, variables in the housing group seem to be correlated with variables in the age group. These kinds of relations are expected and natural since highly educated people will most probably have better jobs, and young people probably can't afford a house yet, so they rent. As analysts, we can assume that these variables are in fact measuring different aspects of society and continue on with our analysis. However, these are still things you may want to keep in mind when interpreting the results, and we may also want to only include one of the variables in each group to avoid inter-group collinearity, but we'll avoid these complexities and continue with our analysis for now.

Linear regression is one of those types of models that require criteria from the analyst to be accepted or rejected. In our specific case, it seems that our model's assumptions are valid enough and we may safely use it to provide credible predictions as we will do in the following sections.

主站蜘蛛池模板: 浠水县| 原阳县| 疏附县| 改则县| 丰原市| 镇平县| 浑源县| 凤城市| 黑河市| 扶余县| 河北区| 泰宁县| 昭通市| 阳信县| 新余市| 新龙县| 贵港市| 宜丰县| 澄城县| 泰来县| 柳河县| 楚雄市| 尚志市| 卓资县| 白朗县| 陇西县| 大英县| 格尔木市| 乡城县| 叙永县| 北流市| 东城区| 新密市| 和林格尔县| 修文县| 寿阳县| 荔浦县| 洪江市| 郴州市| 巴中市| 长岛县|