官术网_书友最值得收藏!

Univariate linear regression

We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as follows:

We can state it as the expected value of Y is a function of the parameters  (the intercept) plus  (the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences:

This, with simple substitution, yields the following:

 

Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, Rand any other software, for that matter—will give us a solution regardless of the input. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.

To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I've used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is an excellent place to start should you have a similar need. It's a brief digression before moving on to serious modeling:

> #call up and explore the data

> data(anscombe)

> attach(anscombe)

> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89

As we shall see, each of the pairs has the same correlation coefficient: 0.816. The first two are as follows:

> cor(x1, y1) #correlation of x1 and y1
[1] 0.8164205

> cor(x2, y2) #correlation of x2 and y2

[1] 0.8164205

The real insight here, as Anscombe intended, is when we plot all four pairs together, as follows:

> par(mfrow = c(2,2)) #create a 2x2 grid for plotting

> plot(x1, y1, main = "Plot 1")

> plot(x2, y2, main = "Plot 2")

> plot(x3, y3, main = "Plot 3")

> plot(x4, y4, main = "Plot 4")
Downloading the example code
You can download the example code files for all Packt books you've purchased from your account at http://www.packtpub.com. If you bought this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

The output of the preceding code is as follows:

We can see the following:

  • Plot 1 appears to have a true linear relationship
  • Plot 2 is curvilinear, Plot 3 has a dangerous outlier
  • Plot 4 is driven by one outlier

There you have it: a cautionary tale about the dangers of solely relying on correlation.

主站蜘蛛池模板: 巴塘县| 昌都县| 津南区| 烟台市| 永福县| 眉山市| 都匀市| 贵南县| 楚雄市| 兰州市| 广州市| 辉南县| 高密市| 吕梁市| 杂多县| 九龙坡区| 郧西县| 淮阳县| 辽阳县| 阜城县| 库车县| 遵义市| 平谷区| 新宁县| 阿尔山市| 手游| 永嘉县| 开阳县| 孟连| 苏州市| 冷水江市| 镇康县| 扎鲁特旗| 监利县| 寿光市| 益阳市| 阳东县| 雅江县| 淅川县| 连平县| 郯城县|