官术网_书友最值得收藏!

Finding regression intervals

"Exploring the unknown requires tolerating uncertainty."
– Brian Greene

It's not always guaranteed that we have accurate models. Sometimes, our data is inherently noisy and we cannot model it using a regressor. In these cases, it is important to be able to quantify how certain we arein our estimations. Usually, regressors make point predictions. These are the expected values (typically the mean) of the target (y) at each value of x. A Bayesian ridge regressor is capable of returning the expected values as usual, yet it also returns the standard deviation of the target (y) at each value of x.

To demonstrate how this works, let's create a noisy dataset, where :

import numpy as np
import pandas as pd

df_noisy = pd.DataFrame(
{
'x': np.random.random_integers(0, 30, size=150),
'noise': np.random.normal(loc=0.0, scale=5.0, size=150)
}
)

df_noisy['y'] = df_noisy['x'] + df_noisy['noise']

Then, we can plot it in the form of a scatter plot:

df_noisy.plot(
kind='scatter', x='x', y='y'
)

Plotting the resulting data frame will give us the following plot:

Now, let's train two regressors on the same data—LinearRegression and BayesianRidge. I will stick to the default values for the Bayesian ridge hyperparameters here:

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge

lr = LinearRegression()
br = BayesianRidge()

lr.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_lr_pred'] = lr.predict(df_noisy[['x']])

br.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_br_pred'], df_noisy['y_br_std'] = br.predict(df_noisy[['x']], return_std=True)

Notice how the Bayesian ridge regressor returns two values when predicting.

The Bayesian approach to linear regression differs from the aforementioned algorithms in the way that it sees its coefficients. For all the algorithms we have seen so far, each coefficient takes a single value after training, but for a Bayesian model, a coefficient is rather a distribution with an estimated mean and standard deviation. A coefficient is initialized using a prior distribution, which gets updated by the training data to reach a posterior distribution via Bayes' theorem. The Bayesian ridge regressor is a regularized Bayesian regressor.

The predictions made by the two models are very similar. Nevertheless, we can use the standard deviation returned to calculate a range around the values that we expect most of the future data to fall into.The following code snippet creates plots for the two models and their predictions:

fig, axs = plt.subplots(1, 3, figsize=(16, 6), sharex=True, sharey=True)

# We plot the data 3 times
df_noisy.sort_values('x').plot(
title='Data', kind='scatter', x='x', y='y', ax=axs[0]
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[1], marker='o', alpha=0.25
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[2], marker='o', alpha=0.25
)

# Here we plot the Linear Regression predictions
df_noisy.sort_values('x').plot(
title='LinearRegression', kind='scatter', x='x', y='y_lr_pred',
ax=axs[1], marker='o', color='k', label='Predictions'
)

# Here we plot the Bayesian Ridge predictions
df_noisy.sort_values('x').plot(
title='BayesianRidge', kind='scatter', x='x', y='y_br_pred',
ax=axs[2], marker='o', color='k', label='Predictions'
)

# Here we plot the range around the expected values
# We multiply by 1.96 for a 95% Confidence Interval
axs[2].fill_between(
df_noisy.sort_values('x')['x'],
df_noisy.sort_values('x')['y_br_pred'] - 1.96 *
df_noisy.sort_values('x')['y_br_std'],
df_noisy.sort_values('x')['y_br_pred'] + 1.96 *
df_noisy.sort_values('x')['y_br_std'],
color="k", alpha=0.2, label="Predictions +/- 1.96 * Std Dev"
)

fig.show()

Running the preceding code gives us the following graphs. In the BayesianRidge case, the shaded area shows where we expect 95% of our targets to fall:

Regression intervals are handy when we want to quantify our uncertainties. In Chapter 8, Ensembles – When One Model Is Not Enough, we will revisit regression intervals

主站蜘蛛池模板: 舒城县| 邻水| 敦化市| 郧西县| 高邮市| 苗栗市| 双辽市| 靖西县| 波密县| 滨海县| 永丰县| 甘孜县| 大厂| 察雅县| 贡山| 鸡泽县| 宽甸| 沽源县| 抚宁县| 河东区| 娄烦县| 巧家县| 大冶市| 阿巴嘎旗| 德惠市| 西乡县| 内江市| 永安市| 南陵县| 景洪市| 临安市| 阿合奇县| 贺州市| 怀来县| 上栗县| 冕宁县| 嘉义市| 屏南县| 垫江县| 曲阳县| 连江县|