官术网_书友最值得收藏!

Building a univariate model

Our first case focuses on the goal of predicting the water yield (in inches) of the Snake River Watershed in Wyoming, USA, as a function of the water content of the year's snowfall. This forecast will be useful in managing the water flow and reservoir levels, as the Snake River provides much-needed irrigation water for the farms and ranches of several western states. The snake dataset is available in the alr3 package (note that alr stands for applied linear regression):

> install.packages("alr3")
> library(alr3)
> data(snake)
> dim(snake)
[1] 17 2
> head(snake)
X Y
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

Now that we have 17 observations, data exploration can begin. But first, let's change X and Y to meaningful variable names, as follows:

> names(snake) <- c("content", "yield")
> attach(snake) # attach data with new names
> head(snake)

content yield
1 23.1 10.5
2 32.8 16.7
3 31.8 18.2
4 32.0 17.0
5 30.4 16.3
6 24.0 10.5

> plot(content,
yield, main = "Scatterplot of Snow vs. Yield",
xlab = "water content of snow",
ylab = "water yield")

The output of the preceding code is as follows:

This is an intriguing plot as the data is linear and has a slight curvilinear shape driven by two potential outliers at both ends of the extreme. 

To perform a linear regression in R, we use the lm() function to create a model in the standard form of fit = lm(Y ~ X). You can then test your assumptions using various functions on your fitted model by using the following code:

> yield_fit <- lm(yield ~ content)

> summary(yield_fit)

Call:
lm(formula = yield ~ content)

Residuals:
Min 1Q Median 3Q Max
-2.1793 -1.5149 -0.3624 1.6276 3.1973

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.72538 1.54882 0.468 0.646
content 0.49808 0.04952 10.058 4.63e-08
***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05
'.' 0.1 ' ' 1

Residual standard error: 1.743 on 15 degrees of
freedom
Multiple R-squared: 0.8709, Adjusted R-squared:
0.8623
F-statistic: 101.2 on 1 and 15 DF, p-value:
4.632e-08

With the summary() function, we can examine some items, including the model specification, descriptive statistics about the residuals, the coefficients, codes to model significance, and a summary of the model error and fit. Right now, let's focus on the parameter coefficient estimates, and see whether our predictor variable has a significant p-value and whether the overall model F-test has a significant p-value. Looking at the parameter estimates, the model tells us that yield is equal to 0.72538 plus 0.49808 times content. We can state that for every one unit change in the content, the yield will increase by 0.49808 units. F-statistic is used to test the null hypothesis that the model coefficients are all zero.

Since p-value is highly significant, we can reject the null and move on to the t-test for content, which tests the null hypothesis that it's zero. Again, we can reject the null. Additionally, we can see the Multiple R-squared and Adjusted R-squared values. Adjusted R-squared will be covered under the multivariate regression topic, so let's zero in on Multiple R-squared; here, we see that it's 0.8709. In theory, it can range from zero to one and is a measure of the strength of the association between X and Y. The interpretation, in this case, is that the water content of snow can explain 87 percent of the variation in the water yield. On a side note, R-squared is nothing more than the correlation coefficient of [X, Y] squared.

We can recall our scatter plot and now add the best fit line produced by our model using the following code:

> plot(content, yield)

> abline(yield_fit, lwd = 3, col = "red")

The output of the preceding code is as follows:

主站蜘蛛池模板: 定结县| 商丘市| 武隆县| 郧西县| 微博| 利津县| 九江县| 五原县| 陕西省| 治县。| 额济纳旗| 水富县| 揭西县| 平罗县| 鲁山县| 布尔津县| 定结县| 永和县| 阿克苏市| 霍山县| 邵东县| 名山县| 吴堡县| 高雄县| 绩溪县| 新建县| 稻城县| 临沧市| 华蓥市| 永仁县| 湘潭县| 大埔县| 菏泽市| 闸北区| 阿拉善右旗| 双柏县| 体育| 筠连县| 名山县| 黄平县| 峨边|