一世情缘主题曲

書名： Regression Analysis with Python
作者名： Luca Massaron Alberto Boschetti
本章字數： 1963字
更新時間： 2021-07-16 12:47:25

Starting from the basics

We will start exploring the first dataset, the Boston dataset, but before delving into numbers, we will upload a series of helpful packages that will be used during the rest of the chapter:

In: import numpy as np
 import pandas as pd
 import matplotlib.pyplot as plt
 import matplotlib as mpl

If you are working from an IPython Notebook, running the following command in a cell will instruct the Notebook to represent any graphic output in the Notebook itself (otherwise, if you are not working on IPython, just ignore the command because it won't work in IDEs such as Python's IDLE or Spyder):

In: %matplotlib inline
 # If you are using IPython, this will make the images available in the Notebook

To immediately select the variables that we need, we just frame all the data available into a Pandas data structure, DataFrame.

Inspired by a similar data structure present in the R statistical language, a DataFrame renders data vectors of different types easy to handle under the same dataset variable, offering at the same time much convenient functionality for handling missing values and manipulating data:

In: dataset = pd.DataFrame(boston.data, columns=boston.feature_names)
 dataset['target'] = boston.target

At this point, we are ready to build our first regression model, learning directly from the data present in our Pandas DataFrame.

As we mentioned, linear regression is just a simple summation, but it is indeed not the simplest model possible. The simplest is the statistical mean. In fact, you can simply guess by always using the same constant number, and the mean very well absolves such a role because it is a powerful descriptive number for data summary.

The mean works very well with normally distributed data but often it is quite suitable even for different distributions. A normally distributed curve is a distribution of data that is symmetric and has certain characteristics regarding its shape (a certain height and spread).

The characteristics of a normal distribution are defined by formulas and there are appropriate statistical tests to find out if your variable is normal or not, since many other distributions resemble the bell shape of the normal one and many different normal distributions are generated by different mean and variance parameters.

The key to understanding if a distribution is normal is the probability density function (PDF), a function describing the probability of values in the distribution.

In the case of a normal distribution, the PDF is as follows:

Starting from the basics

In such a formulation, the symbol μ represents the mean (which coincides with the median and the mode) and the symbol σ is the variance. Based on different means and variances, we can calculate different value distributions, as the following code demonstrates and visualizes:

In: import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
import math
x = np.linspace(-4,4,100)
for mean, variance in [(0,0.7),(0,1),(1,1.5),(-2,0.5)]:
 plt.plot(x,mlab.normpdf(x,mean,variance))
plt.show()

Starting from the basics

Because of its properties, the normal distribution is a fundamental distribution in statistics since all statistical models involve working on normal variables. In particular, when the mean is zero and the variance is one (unit variance), the normal distribution, called a standard normal distribution under such conditions, has even more favorable characteristics for statistical models.

Anyway, in the real world, normally distributed variables are instead rare. Consequently, it is important to verify that the actual distribution we are working on is not so far from an ideal normal one or it will pose problems in your expected results. Normally distributed variables are an important requirement for statistical models (such as mean and, in certain aspects, linear regression). On the contrary, machine learning models do not depend on any previous assumption about how your data should be distributed. But, as a matter of fact, even machine learning models work well if data has certain characteristics, so working with a normally distributed variable is preferable to other distributions. Throughout the book, we will provide warnings about what to look for and check when building and applying machine learning solutions.

For the calculation of a mean, relevant problems can arise if the distribution is not symmetric and there are extreme cases. In such an occurrence, the extreme cases will tend to draw the mean estimate towards them, which consequently won't match with the bulk of the data. Let's then calculate the mean of the value of the 506 tracts in Boston:

In: mean_expected_value = dataset['target'].mean()

In this case, we calculated the mean using a method available in the Pandas DataFrame; however, the NumPy function mean can be also called to calculate a mean from an array of data:

In: np.mean(dataset['target'])

In terms of a mathematical formulation, we can express this simple solution as follows:

Starting from the basics

We can now evaluate the results by measuring the error produced in predicting the real y values by this rule. Statistics suggest that, to measure the difference between the prediction and the real value, we should square the differences and then sum them all. This is called the squared sum of errors:

In: Squared_errors = pd.Series(mean_expected_value -\
 dataset['target'])**2
 SSE = np.sum(Squared_errors)
 print ('Sum of Squared Errors (SSE): %01.f' % SSE)

Now that we have calculated it, we can visualize it as a distribution of errors:

In: density_plot = Squared_errors.plot('hist')

Starting from the basics

The plot shows how frequent certain errors are in respect of their values. Therefore, you will immediately notice that most errors are around zero (there is a high density around that value). Such a situation can be considered a good one, since in most cases the mean is a good approximation, but some errors are really very far from the zero and they can attain considerable values (don't forget that the errors are squared, anyway, so the effect is emphasized). When trying to figure out such values, your approach will surely lead to a relevant error and we should find a way to minimize it using a more sophisticated approach.

A measure of linear relationship

Evidently, the mean is not a good representative of certain values, but it is certainly a good baseline to start from. Certainly, an important problem with the mean is its being fixed, whereas the target variable is changeable. However, if we assume that the target variable changes because of the effect of some other variable we are measuring, then we can adjust the mean with respect to the variations in cause.

One improvement on our previous approach could be to build a mean conditional on certain values of another variable (or even more than one) actually related to our target, whose variation is somehow similar to the variation of the target one.

Intuitively, if we know the dynamics we want to predict with our model, we can try to look for variables that we know can impact the answer values.

In the real estate business, we actually know that usually the larger a house is, the more expensive it is; however, this rule is just part of the story and the price is affected by many other considerations. For the moment, we will keep it simple and just assume that an extension to a house is a factor that positively affects the price, and consequently, more space equals more costs when building the house (more land, more construction materials, more work, and consequently a higher price).

Now, we have a variable that we know should change with our target and we just need to measure it and extend our initial formula based on constant values with something else.

In statistics, there is a measure that helps to measure how (in the sense of how much and in what direction) two variables relate to each other: correlation.

In correlation, a few steps are to be considered. First, your variables have to be standardized (or your result won't be a correlation but a covariation, a measure of association that is affected by the scale of the variables you are working with).

In statistical Z score standardization, you subtract from each variable its mean and then you pide the result by the standard deviation. The resulting transformed variable will have a mean of 0 and a standard deviation of 1 (or unit variance, since variance is the squared standard deviation).

The formula for standardizing a variable is as follows:

A measure of linear relationship

This can be achieved in Python using a simple function:

In: def standardize(x):
 return (x-np.mean(x))/np.std(x)

After standardizing, you compare the squared difference of each variable with its own mean. If the two differences agree in sign, their multiplication will become positive (evidence that they have the same directionality); however, if they differ, the multiplication will turn negative. By summing all the multiplications between the squared differences, and piding them by the number of observations, you will finally get the correlation which will be a number ranging from -1 to 1.

The absolute value of the correlation will provide you with the intensity of the relation between the two variables compared, 1 being a sign of a perfect match and zero a sign of complete independence between them (they have no relation between them). The sign instead will hint at the proportionality; positive is direct (when one grows the other does the same), negative is indirect (when one grows, the other shrinks).

Covariance can be expressed as follows:

A measure of linear relationship

Whereas, Pearson's correlation can be expressed as follows:

A measure of linear relationship

Let's check these two formulations directly on Python. As you may have noticed, Pearson's correlation is really covariance calculated on standardized variables, so we define the correlation function as a wrapper of both the covariance and standardize ones (you can find all these functions ready to be imported from Scipy; we are actually recreating them here just to help you understand how they work):

In: 
def covariance(variable_1, variable_2, bias=0):
 observations = float(len(variable_1))
 return np.sum((variable_1 - np.mean(variable_1)) * \
 (variable_2 - np.mean(variable_2)))/(observations-min(bias,1))

 def standardize(variable):
 return (variable - np.mean(variable)) / np.std(variable)

 def correlation(var1,var2,bias=0):
 return covariance(standardize(var1), standardize(var2),bias)

 from scipy.stats.stats import pearsonr
 print ('Our correlation estimation: %0.5f' % (correlation(dataset['RM'], dataset['target'])))
 print ('Correlation from Scipy pearsonr estimation: %0.5f' % pearsonr(dataset['RM'], dataset['target'])[0])

Out: Our correlation estimation: 0.69536
 Correlation from Scipy pearsonr estimation: 0.69536

Our correlation estimation for the relation between the value of the target variable and the average number of rooms in houses in the area is 0.695, which is positive and remarkably strong, since the maximum positive score of a correlation is 1.0.

Tip

As a way to estimate if a correlation is relevant or not, just square it; the result will represent the percentage of the variance shared by the two variables.

Let's graph what happens when we correlate two variables. Using a scatterplot, we can easily visualize the two involved variables. A scatterplot is a graph where the values of two variables are treated as Cartesian coordinates; thus, for every (x, y) value a point is represented in the graph:

In: x_range = [dataset['RM'].min(),dataset['RM'].max()]
 y_range = [dataset['target'].min(),dataset['target'].max()]
 scatter_plot = dataset.plot(kind='scatter', x='RM', y='target',\xlim=x_range, ylim=y_range)
 meanY = scatter_plot.plot(x_range, [dataset['target'].mean(),\ dataset['target'].mean()], '--' , color='red', linewidth=1)
 meanX = scatter_plot.plot([dataset['RM'].mean(),\dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)

A measure of linear relationship

The scatterplot also plots the average value for both the target and the predictor variables as dashed lines. This pides the plot into four quadrants. If we compare it with the previous covariance and correlation formulas, we can understand why the correlation value was close to 1: in the bottom-right and in top-left quadrants, there are just a few mismatching points where one of variables is above its average and the other is below its own.

A perfect match (correlation values of 1 or -1) is possible only when the points are in a straight line (and all points are therefore concentrated in the right-uppermost and left-lowermost quadrants). Thus, correlation is a measure of linear association, of how close to a straight line your points are. Ideally, having all your points on a single line favors a perfect mapping of your predictor variable to your target.

官术网_书友最值得收藏!

Regression Analysis with Python

Starting from the basics

A measure of linear relationship

Tip