官术网_书友最值得收藏!

Solving OLS for simple linear regression

In this section, we will work through solving OLS for simple linear regression. Recall that simple linear regression is given by the equation y = α + βx and that our goal is to solve for the values of β and α to minimize the cost function. We will solve for β first. To do so, we will calculate the variance of x and the covariance of x and y. Variance is a measure of how far a set of values are spread out. If all the numbers in the set are equal, the variance of the set is zero. A small variance indicates that the numbers are near the mean of the set, while a set containing numbers that are far from the mean and from each other will have a large variance. Variance can be calculated using the following equation:

is the mean of x, xi is the value of x for the ith training instance, and n is the number of training instances. Let's calculate variance of the pizza diameters in our training set:

# In[2]: 
import numpy as np

X = np.array([[6], [8], [10], [14], [18]]).reshape(-1, 1)
x_bar = X.mean()
print(x_bar)

# Note that we subtract one from the number of training instances when
calculating the sample variance.
# This technique is called Bessel's correction. It corrects the bias in the estimation of the population variance
# from a sample.
variance = ((X - x_bar)**2).sum() / (X.shape[0] - 1)
print(variance)

# Out[2]:
11.2
23.2

NumPy also provides the method var for calculating variance. The keyword parameter ddof can be used to set Bessel's correction to calculate the sample variance:

# In[3]:
print(np.var(X, ddof=1))

# Out[3]:
23.2

Covariance is a measure of how much two variables change together. If the variables increase together, their covariance is positive. If one variable tends to increase while the other decreases, their covariance is negative. If there is no linear relationship between the two variables, their covariance will be equal to zero; they are linearly uncorrelated but not necessarily independent. Covariance can be calculated using the following formula:

As with variance, xi is the diameter of the ith training instance, is the mean of the diameters, is the mean of the prices, yi is the price of the ith training instance, and n is the number of training instances. Let's calculate covariance of the diameters and prices of the pizzas in the training set:

# In[4]:
# We previously used a List to represent y.
# Here we switch to a NumPy ndarray, which provides a method to calulcate the sample mean.
y = np.array([7, 9, 13, 17.5, 18])

y_bar = y.mean()
# We transpose X because both operands must be row vectors
covariance = np.multiply((X - x_bar).transpose(), y - y_bar).sum() /
(X.shape[0] - 1)
print(covariance)
print(np.cov(X.transpose(), y)[0][1])

# Out[4]:
22.65
22.65

Now that we have calculated the variance of our explanatory variable and the covariance of the response and explanatory variables, we can solve for β using the following:

Having solved for β, we can solve for α using this formula:

Here, is the mean of y and is the mean of are the coordinates of the centroid, a point that the model must pass through.

Now that we have solved for the values of the model's parameters that minimize the cost function, we can plug in the diameters of the pizzas and predict their prices. For instance, an 11" pizza should be expected to cost about $12.70, and an 18" pizza should be expected to cost $19.54. Congratulations! You used simple linear regression to predict the price of a pizza.

主站蜘蛛池模板: 琼结县| 洛川县| 鹤岗市| 武宁县| 资溪县| 靖边县| 吴旗县| 阿拉善右旗| 新疆| 城口县| 长顺县| 泰和县| 铜鼓县| 建瓯市| 襄樊市| 闻喜县| 岗巴县| 简阳市| 武鸣县| 虹口区| 高唐县| 晋州市| 夏邑县| 甘谷县| 海盐县| 南漳县| 长岭县| 尖扎县| 芜湖市| 镇原县| 桂林市| 筠连县| 遵义市| 嵊泗县| 泽普县| 岑溪市| 望江县| 阳春市| 万荣县| 社会| 吉木乃县|