官术网_书友最值得收藏!

Univariate linear regression

We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as follows:

We can state it as the expected value of Y is a function of the parameters  (the intercept) plus  (the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences:

This, with simple substitution, yields the following:

 

Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, Rand any other software, for that matter—will give us a solution regardless of the input. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.

To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I've used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is an excellent place to start should you have a similar need. It's a brief digression before moving on to serious modeling:

> #call up and explore the data

> data(anscombe)

> attach(anscombe)

> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89

As we shall see, each of the pairs has the same correlation coefficient: 0.816. The first two are as follows:

> cor(x1, y1) #correlation of x1 and y1
[1] 0.8164205

> cor(x2, y2) #correlation of x2 and y2

[1] 0.8164205

The real insight here, as Anscombe intended, is when we plot all four pairs together, as follows:

> par(mfrow = c(2,2)) #create a 2x2 grid for plotting

> plot(x1, y1, main = "Plot 1")

> plot(x2, y2, main = "Plot 2")

> plot(x3, y3, main = "Plot 3")

> plot(x4, y4, main = "Plot 4")
Downloading the example code
You can download the example code files for all Packt books you've purchased from your account at http://www.packtpub.com. If you bought this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.

The output of the preceding code is as follows:

We can see the following:

  • Plot 1 appears to have a true linear relationship
  • Plot 2 is curvilinear, Plot 3 has a dangerous outlier
  • Plot 4 is driven by one outlier

There you have it: a cautionary tale about the dangers of solely relying on correlation.

主站蜘蛛池模板: 湖南省| 农安县| 仙桃市| 海林市| 河池市| 赤水市| 扎囊县| 噶尔县| 防城港市| 雷波县| 濮阳县| 利津县| 玉屏| 宾川县| 肇州县| 金平| 冷水江市| 山东省| 乌拉特后旗| 通榆县| 南投县| 平武县| 瑞金市| 靖宇县| 千阳县| 高州市| 丹凤县| 抚宁县| 韶山市| 常州市| 体育| 濮阳县| 内江市| 达日县| 广平县| 社旗县| 双桥区| 和静县| 肥乡县| 西藏| 射阳县|