官术网_书友最值得收藏!

Data understanding and preparation

To begin, we will load the dataset named water and define the structure of the str() function as follows:

    > data(water)

> str(water)
'data.frame': 43 obs. of 8 variables:
$ Year : int 1948 1949 1950 1951 1952 1953 1954
1955 1956 1957 ...

$ APMAM : num 9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7
10.5 9.1 ...

$ APSAB : num 3.58 4.82 3.77 4.46 4.99 5.65 1.45
7.44 5.85 6.13 ...

$ APSLAKE: num 3.91 5.2 3.67 3.93 4.88 4.91 1.77
6.51 3.38 4.08 ...

$ OPBPC : num 4.1 7.55 9.52 11.14 16.34 ...
$ OPRC : num 7.43 11.11 12.2 15.15 20.05 ...
$ OPSLAKE: num 6.47 10.26 11.35 11.13 22.81 ...
$ BSAAM : int 54235 67567 66161 68094 107080
67594 65356 67909 92715 70024 ...

Here we have eight features and one response variable, BSAAM. The observations start in 1943 and run for 43 consecutive years. Since for this exercise we are not concerned with what year the observations occurred in, it makes sense to create a new data frame excluding the year vector. This is quite easy to do. With one line of code, we can create the new data frame, and then verify that it works with the head() function:

    > socal.water <- water[ ,-1] #new dataframe with 
the deletion of
column 1


> head(socal.water)
APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
1 9.13 3.58 3.91 4.10 7.43 6.47 54235
2 5.28 4.82 5.20 7.55 11.11 10.26 67567
3 4.20 3.77 3.67 9.52 12.20 11.35 66161
4 4.60 4.46 3.93 11.14 15.15 11.13 68094
5 7.15 4.99 4.88 16.34 20.05 22.81 107080
6 9.70 5.65 4.91 8.88 8.15 7.41 67594

With all the features being quantitative, it makes sense to look at the correlation statistics and then produce a matrix of scatterplots. The correlation coefficient or Pearson's r, is a measure of both the strength and direction of the linear relationship between two variables. The statistic will be a number between -1 and 1, where -1 is the total negative correlation and +1 is the total positive correlation. The calculation of the coefficient is the covariance of the two variables Pided by the product of their standard deviations. As previously discussed, if you square the correlation coefficient, you will end up with R-squared.

There are a number of ways to produce a matrix of correlation plots. Some prefer to produce heatmaps, but I am a big fan of what is produced with the corrplot package. It can produce a number of different variations including ellipse, circle, square, number, shade, color, and pie. I like the ellipse method, but feel free to experiment with the others. Let's load the corrplot package, create a correlation object using the base cor() function, and examine the following results:

    > library(corrplot)

> water.cor <- cor(socal.water)

> water.cor
APMAM APSAB APSLAKE OPBPC
APMAM 1.0000000 0.82768637 0.81607595 0.12238567
APSAB 0.8276864 1.00000000 0.90030474 0.03954211
APSLAKE 0.8160760 0.90030474 1.00000000 0.09344773
OPBPC 0.1223857 0.03954211 0.09344773 1.00000000
OPRC 0.1544155 0.10563959 0.10638359 0.86470733
OPSLAKE 0.1075421 0.02961175 0.10058669 0.94334741
BSAAM 0.2385695 0.18329499 0.24934094 0.88574778
OPRC OPSLAKE BSAAM
APMAM 0.1544155 0.10754212 0.2385695
APSAB 0.1056396 0.02961175 0.1832950
APSLAKE 0.1063836 0.10058669 0.2493409
OPBPC 0.8647073 0.94334741 0.8857478
OPRC 1.0000000 0.91914467 0.9196270
OPSLAKE 0.9191447 1.00000000 0.9384360
BSAAM 0.9196270 0.93843604 1.0000000

So, what does this tell us? First of all, the response variable is highly and positively correlated with the OP features with OPBPC as 0.8857, OPRC as 0.9196, and OPSLAKE as 0.9384. Also note that the AP features are highly correlated with each other and the OP features as well. The implication is that we may run into the issue of multi-collinearity. The correlation plot matrix provides a nice visual of the correlations as follows:

    > corrplot(water.cor, method = "ellipse")

The output of the preceding code snippet is as follows:

Another popular visual is a scatterplot matrix. This can be called with the pairs() function. It reinforces what we saw in the correlation plot in the previous output:

    > pairs(~ ., data = socal.water)

The output of the preceding code snippet is as follows:

主站蜘蛛池模板: 罗甸县| 同江市| 青海省| 丽江市| 通江县| 沅陵县| 绩溪县| 石家庄市| 吴堡县| 建水县| 格尔木市| 综艺| 梨树县| 武定县| 页游| 大安市| 田东县| 湛江市| 清流县| 潮安县| 龙游县| 上犹县| 汽车| 界首市| 庆云县| 尼勒克县| 花垣县| 新宁县| 万盛区| 新龙县| 长治市| 淄博市| 梅河口市| 无棣县| 辰溪县| 仙桃市| 新田县| 青龙| 时尚| 山丹县| 肥乡县|