- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- Tarek Amr
- 536字
- 2021-06-18 18:24:32
Finding regression intervals
It's not always guaranteed that we have accurate models. Sometimes, our data is inherently noisy and we cannot model it using a regressor. In these cases, it is important to be able to quantify how certain we arein our estimations. Usually, regressors make point predictions. These are the expected values (typically the mean) of the target (y) at each value of x. A Bayesian ridge regressor is capable of returning the expected values as usual, yet it also returns the standard deviation of the target (y) at each value of x.
To demonstrate how this works, let's create a noisy dataset, where :
import numpy as np
import pandas as pd
df_noisy = pd.DataFrame(
{
'x': np.random.random_integers(0, 30, size=150),
'noise': np.random.normal(loc=0.0, scale=5.0, size=150)
}
)
df_noisy['y'] = df_noisy['x'] + df_noisy['noise']
Then, we can plot it in the form of a scatter plot:
df_noisy.plot(
kind='scatter', x='x', y='y'
)
Plotting the resulting data frame will give us the following plot:

Now, let's train two regressors on the same data—LinearRegression and BayesianRidge. I will stick to the default values for the Bayesian ridge hyperparameters here:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge
lr = LinearRegression()
br = BayesianRidge()
lr.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_lr_pred'] = lr.predict(df_noisy[['x']])
br.fit(df_noisy[['x']], df_noisy['y'])
df_noisy['y_br_pred'], df_noisy['y_br_std'] = br.predict(df_noisy[['x']], return_std=True)
Notice how the Bayesian ridge regressor returns two values when predicting.
The predictions made by the two models are very similar. Nevertheless, we can use the standard deviation returned to calculate a range around the values that we expect most of the future data to fall into.The following code snippet creates plots for the two models and their predictions:
fig, axs = plt.subplots(1, 3, figsize=(16, 6), sharex=True, sharey=True)
# We plot the data 3 times
df_noisy.sort_values('x').plot(
title='Data', kind='scatter', x='x', y='y', ax=axs[0]
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[1], marker='o', alpha=0.25
)
df_noisy.sort_values('x').plot(
kind='scatter', x='x', y='y', ax=axs[2], marker='o', alpha=0.25
)
# Here we plot the Linear Regression predictions
df_noisy.sort_values('x').plot(
title='LinearRegression', kind='scatter', x='x', y='y_lr_pred',
ax=axs[1], marker='o', color='k', label='Predictions'
)
# Here we plot the Bayesian Ridge predictions
df_noisy.sort_values('x').plot(
title='BayesianRidge', kind='scatter', x='x', y='y_br_pred',
ax=axs[2], marker='o', color='k', label='Predictions'
)
# Here we plot the range around the expected values
# We multiply by 1.96 for a 95% Confidence Interval
axs[2].fill_between(
df_noisy.sort_values('x')['x'],
df_noisy.sort_values('x')['y_br_pred'] - 1.96 *
df_noisy.sort_values('x')['y_br_std'],
df_noisy.sort_values('x')['y_br_pred'] + 1.96 *
df_noisy.sort_values('x')['y_br_std'],
color="k", alpha=0.2, label="Predictions +/- 1.96 * Std Dev"
)
fig.show()
Running the preceding code gives us the following graphs. In the BayesianRidge case, the shaded area shows where we expect 95% of our targets to fall:

Regression intervals are handy when we want to quantify our uncertainties. In Chapter 8, Ensembles – When One Model Is Not Enough, we will revisit regression intervals
- 深入理解Bootstrap
- SQL學習指南(第3版)
- Clojure for Domain:specific Languages
- The React Workshop
- PhpStorm Cookbook
- Gradle for Android
- Oracle Exadata專家手冊
- RabbitMQ Essentials
- 微服務架構深度解析:原理、實踐與進階
- 后臺開發:核心技術與應用實踐
- Building Business Websites with Squarespace 7(Second Edition)
- 軟件工程與UML案例解析(第三版)
- Drupal Search Engine Optimization
- Mastering Machine Learning with R
- Python人工智能項目實戰