- Advanced Machine Learning with R
- Cory Lesmeister Dr. Sunil Kumar Chinnamgari
- 662字
- 2021-06-24 14:24:34
Univariate linear regression
We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as follows:

We can state it as the expected value of Y is a function of the parameters (the intercept) plus
(the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences:

This, with simple substitution, yields the following:
Before we begin with an application, I want to point out that if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, R—and any other software, for that matter—will give us a solution regardless of the input. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.
To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I've used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is an excellent place to start should you have a similar need. It's a brief digression before moving on to serious modeling:
> #call up and explore the data
> data(anscombe)
> attach(anscombe)
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
As we shall see, each of the pairs has the same correlation coefficient: 0.816. The first two are as follows:
> cor(x1, y1) #correlation of x1 and y1
[1] 0.8164205
> cor(x2, y2) #correlation of x2 and y2
[1] 0.8164205
The real insight here, as Anscombe intended, is when we plot all four pairs together, as follows:
> par(mfrow = c(2,2)) #create a 2x2 grid for plotting
> plot(x1, y1, main = "Plot 1")
> plot(x2, y2, main = "Plot 2")
> plot(x3, y3, main = "Plot 3")
> plot(x4, y4, main = "Plot 4")
You can download the example code files for all Packt books you've purchased from your account at http://www.packtpub.com. If you bought this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you.
The output of the preceding code is as follows:

We can see the following:
- Plot 1 appears to have a true linear relationship
- Plot 2 is curvilinear, Plot 3 has a dangerous outlier
- Plot 4 is driven by one outlier
There you have it: a cautionary tale about the dangers of solely relying on correlation.
- 深入理解Spring Cloud與實戰
- 網絡服務器配置與管理(第3版)
- 顯卡維修知識精解
- 電腦組裝與維修從入門到精通(第2版)
- 電腦常見故障現場處理
- AMD FPGA設計優化寶典:面向Vivado/SystemVerilog
- scikit-learn:Machine Learning Simplified
- Apple Motion 5 Cookbook
- 嵌入式系統中的模擬電路設計
- 基于Proteus仿真的51單片機應用
- 筆記本電腦維修實踐教程
- Blender Game Engine:Beginner's Guide
- 新編電腦組裝與硬件維修從入門到精通
- Mastering Quantum Computing with IBM QX
- 施耐德M241/251可編程序控制器應用技術