官术网_书友最值得收藏!

Building a univariate model

Our first case focuses on the goal of predicting the water yield (in inches) of the Snake River Watershed in Wyoming, USA, as a function of the water content of the year's snowfall. This forecast will be useful in managing the water flow and reservoir levels, as the Snake River provides much-needed irrigation water for the farms and ranches of several western states. The snake dataset is available in the alr3 package (note that alr stands for applied linear regression):

> install.packages("alr3")
> library(alr3)
> data(snake)
> dim(snake)
[1] 17 2
> head(snake)
X Y
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

Now that we have 17 observations, data exploration can begin. But first, let's change X and Y to meaningful variable names, as follows:

> names(snake) <- c("content", "yield")
> attach(snake) # attach data with new names
> head(snake)

content yield
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

> plot(content,
yield, main = "Scatterplot of Snow vs. Yield",
xlab = "water content of snow",
ylab = "water yield")

The output of the preceding code is as follows:

This is an intriguing plot as the data is linear and has a slight curvilinear shape driven by two potential outliers at both ends of the extreme. 

To perform a linear regression in R, we use the lm() function to create a model in the standard form of fit = lm(Y ~ X). You can then test your assumptions using various functions on your fitted model by using the following code:

> yield_fit <- lm(yield ~ content)

> summary(yield_fit)

Call:
lm(formula = yield ~ content)

Residuals:
Min 1Q Median 3Q Max
-2.1793 -1.5149 -0.3624 1.6276 3.1973

Coefficients: Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.72538 1.54882 0.468 0.646
content 0.49808 0.04952 10.058 4.63e-08
***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 1.743 on 15 degrees of
freedom
Multiple R-squared: 0.8709, Adjusted R-squared:
0.8623
F-statistic: 101.2 on 1 and 15 DF, p-value:
4.632e-08

With the summary() function, we can examine some items, including the model specification, descriptive statistics about the residuals, the coefficients, codes to model significance, and a summary of the model error and fit. Right now, let's focus on the parameter coefficient estimates, and see whether our predictor variable has a significant p-value and whether the overall model F-test has a significant p-value. Looking at the parameter estimates, the model tells us that yield is equal to 0.72538 plus 0.49808 times content. We can state that for every one unit change in the content, the yield will increase by 0.49808 units. F-statistic is used to test the null hypothesis that the model coefficients are all zero.

Since p-value is highly significant, we can reject the null and move on to the t-test for content, which tests the null hypothesis that it's zero. Again, we can reject the null. Additionally, we can see the Multiple R-squared and Adjusted R-squared values. Adjusted R-squared will be covered under the multivariate regression topic, so let's zero in on Multiple R-squared; here, we see that it's 0.8709. In theory, it can range from zero to one and is a measure of the strength of the association between X and Y. The interpretation, in this case, is that the water content of snow can explain 87 percent of the variation in the water yield. On a side note, R-squared is nothing more than the correlation coefficient of [X, Y] squared.

We can recall our scatter plot and now add the best fit line produced by our model using the following code:

> plot(content, yield)

> abline(yield_fit, lwd = 3, col = "red")

The output of the preceding code is as follows:

主站蜘蛛池模板: 北安市| 上林县| 绿春县| 临清市| 隆尧县| 武川县| 庆安县| 江川县| 夏河县| 图木舒克市| 祁东县| 泸定县| 申扎县| 西青区| 平谷区| SHOW| 屯留县| 许昌市| 淅川县| 福清市| 闵行区| 丹东市| 闻喜县| 白河县| 南雄市| 绥棱县| 营口市| 格尔木市| 玛纳斯县| 万源市| 章丘市| 汉沽区| 汉沽区| 鄄城县| 达尔| 祁门县| 铜梁县| 卢龙县| 铅山县| 文山县| 龙胜|