官术网_书友最值得收藏!

Finding regression intervals

"Exploring the unknown requires tolerating uncertainty."
– Brian Greene

It's not always guaranteed that we have accurate models. Sometimes, our data is inherently noisy and we cannot model it using a regressor. In these cases, it is important to be able to quantify how certain we arein our estimations. Usually, regressors make point predictions. These are the expected values (typically the mean) of the target (y) at each value of x. A Bayesian ridge regressor is capable of returning the expected values as usual, yet it also returns the standard deviation of the target (y) at each value of x.

To demonstrate how this works, let's create a noisy dataset, where :

import numpy as np
import pandas as pd

df_noisy = pd.DataFrame(
{
'x': np.random.random_integers(0, 30, size=150),
'noise': np.random.normal(loc=0.0, scale=5.0, size=150)
}
)

df_noisy['y'] = df_noisy['x'] + df_noisy['noise']

Then, we can plot it in the form of a scatter plot:

df_noisy.plot(
kind='scatter', x='x', y='y'
)

Plotting the resulting data frame will give us the following plot:

Now, let's train two regressors on the same data—LinearRegression and BayesianRidge. I will stick to the default values for the Bayesian ridge hyperparameters here:

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge

lr = LinearRegression()
br = BayesianRidge()

lr.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_lr_pred'] = lr.predict(df_noisy[['x']])

br.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_br_pred'], df_noisy['y_br_std'] = br.predict(df_noisy[['x']], return_std=True)

Notice how the Bayesian ridge regressor returns two values when predicting.

The Bayesian approach to linear regression differs from the aforementioned algorithms in the way that it sees its coefficients. For all the algorithms we have seen so far, each coefficient takes a single value after training, but for a Bayesian model, a coefficient is rather a distribution with an estimated mean and standard deviation. A coefficient is initialized using a prior distribution, which gets updated by the training data to reach a posterior distribution via Bayes' theorem. The Bayesian ridge regressor is a regularized Bayesian regressor.

The predictions made by the two models are very similar. Nevertheless, we can use the standard deviation returned to calculate a range around the values that we expect most of the future data to fall into.The following code snippet creates plots for the two models and their predictions:

fig, axs = plt.subplots(1, 3, figsize=(16, 6), sharex=True, sharey=True)

# We plot the data 3 times
df_noisy.sort_values('x').plot(
title='Data', kind='scatter', x='x', y='y', ax=axs[0]
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[1], marker='o', alpha=0.25
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[2], marker='o', alpha=0.25
)

# Here we plot the Linear Regression predictions
df_noisy.sort_values('x').plot(
title='LinearRegression', kind='scatter', x='x', y='y_lr_pred',
ax=axs[1], marker='o', color='k', label='Predictions'
)

# Here we plot the Bayesian Ridge predictions
df_noisy.sort_values('x').plot(
title='BayesianRidge', kind='scatter', x='x', y='y_br_pred',
ax=axs[2], marker='o', color='k', label='Predictions'
)

# Here we plot the range around the expected values
# We multiply by 1.96 for a 95% Confidence Interval
axs[2].fill_between(
df_noisy.sort_values('x')['x'],
df_noisy.sort_values('x')['y_br_pred'] - 1.96 *
df_noisy.sort_values('x')['y_br_std'],
df_noisy.sort_values('x')['y_br_pred'] + 1.96 *
df_noisy.sort_values('x')['y_br_std'],
color="k", alpha=0.2, label="Predictions +/- 1.96 * Std Dev"
)

fig.show()

Running the preceding code gives us the following graphs. In the BayesianRidge case, the shaded area shows where we expect 95% of our targets to fall:

Regression intervals are handy when we want to quantify our uncertainties. In Chapter 8, Ensembles – When One Model Is Not Enough, we will revisit regression intervals

主站蜘蛛池模板: 凤冈县| 五家渠市| 六安市| 肇东市| 新绛县| 温泉县| 甘洛县| 颍上县| 花莲县| 昭平县| 铁岭市| 洛川县| 泰安市| 普兰店市| 呈贡县| 镇江市| 石阡县| 白沙| 兴安盟| 津市市| 托里县| 柏乡县| 罗平县| 东乡族自治县| 西华县| 乃东县| 伊宁县| 广南县| 屯留县| 汽车| 德令哈市| 高州市| 麻城市| 海阳市| 卢湾区| 长治县| 平乡县| 含山县| 华阴市| 涞水县| 灵武市|