- Regression Analysis with Python
- Luca Massaron Alberto Boschetti
- 1963字
- 2021-07-16 12:47:25
Starting from the basics
We will start exploring the first dataset, the Boston dataset, but before delving into numbers, we will upload a series of helpful packages that will be used during the rest of the chapter:
In: import numpy as np import pandas as pd import matplotlib.pyplot as plt import matplotlib as mpl
If you are working from an IPython Notebook, running the following command in a cell will instruct the Notebook to represent any graphic output in the Notebook itself (otherwise, if you are not working on IPython, just ignore the command because it won't work in IDEs such as Python's IDLE or Spyder):
In: %matplotlib inline # If you are using IPython, this will make the images available in the Notebook
To immediately select the variables that we need, we just frame all the data available into a Pandas data structure, DataFrame
.
Inspired by a similar data structure present in the R statistical language, a DataFrame
renders data vectors of different types easy to handle under the same dataset variable, offering at the same time much convenient functionality for handling missing values and manipulating data:
In: dataset = pd.DataFrame(boston.data, columns=boston.feature_names) dataset['target'] = boston.target
At this point, we are ready to build our first regression model, learning directly from the data present in our Pandas DataFrame.
As we mentioned, linear regression is just a simple summation, but it is indeed not the simplest model possible. The simplest is the statistical mean. In fact, you can simply guess by always using the same constant number, and the mean very well absolves such a role because it is a powerful descriptive number for data summary.
The mean works very well with normally distributed data but often it is quite suitable even for different distributions. A normally distributed curve is a distribution of data that is symmetric and has certain characteristics regarding its shape (a certain height and spread).
The characteristics of a normal distribution are defined by formulas and there are appropriate statistical tests to find out if your variable is normal or not, since many other distributions resemble the bell shape of the normal one and many different normal distributions are generated by different mean and variance parameters.
The key to understanding if a distribution is normal is the probability density function (PDF), a function describing the probability of values in the distribution.
In the case of a normal distribution, the PDF is as follows:
In such a formulation, the symbol μ represents the mean (which coincides with the median and the mode) and the symbol σ is the variance. Based on different means and variances, we can calculate different value distributions, as the following code demonstrates and visualizes:
In: import matplotlib.pyplot as plt import numpy as np import matplotlib.mlab as mlab import math x = np.linspace(-4,4,100) for mean, variance in [(0,0.7),(0,1),(1,1.5),(-2,0.5)]: plt.plot(x,mlab.normpdf(x,mean,variance)) plt.show()
Because of its properties, the normal distribution is a fundamental distribution in statistics since all statistical models involve working on normal variables. In particular, when the mean is zero and the variance is one (unit variance), the normal distribution, called a standard normal distribution under such conditions, has even more favorable characteristics for statistical models.
Anyway, in the real world, normally distributed variables are instead rare. Consequently, it is important to verify that the actual distribution we are working on is not so far from an ideal normal one or it will pose problems in your expected results. Normally distributed variables are an important requirement for statistical models (such as mean and, in certain aspects, linear regression). On the contrary, machine learning models do not depend on any previous assumption about how your data should be distributed. But, as a matter of fact, even machine learning models work well if data has certain characteristics, so working with a normally distributed variable is preferable to other distributions. Throughout the book, we will provide warnings about what to look for and check when building and applying machine learning solutions.
For the calculation of a mean, relevant problems can arise if the distribution is not symmetric and there are extreme cases. In such an occurrence, the extreme cases will tend to draw the mean estimate towards them, which consequently won't match with the bulk of the data. Let's then calculate the mean of the value of the 506 tracts in Boston:
In: mean_expected_value = dataset['target'].mean()
In this case, we calculated the mean using a method available in the Pandas DataFrame; however, the NumPy function mean
can be also called to calculate a mean from an array of data:
In: np.mean(dataset['target'])
In terms of a mathematical formulation, we can express this simple solution as follows:
We can now evaluate the results by measuring the error produced in predicting the real y values by this rule. Statistics suggest that, to measure the difference between the prediction and the real value, we should square the differences and then sum them all. This is called the squared sum of errors:
In: Squared_errors = pd.Series(mean_expected_value -\ dataset['target'])**2 SSE = np.sum(Squared_errors) print ('Sum of Squared Errors (SSE): %01.f' % SSE)
Now that we have calculated it, we can visualize it as a distribution of errors:
In: density_plot = Squared_errors.plot('hist')
The plot shows how frequent certain errors are in respect of their values. Therefore, you will immediately notice that most errors are around zero (there is a high density around that value). Such a situation can be considered a good one, since in most cases the mean is a good approximation, but some errors are really very far from the zero and they can attain considerable values (don't forget that the errors are squared, anyway, so the effect is emphasized). When trying to figure out such values, your approach will surely lead to a relevant error and we should find a way to minimize it using a more sophisticated approach.
A measure of linear relationship
Evidently, the mean is not a good representative of certain values, but it is certainly a good baseline to start from. Certainly, an important problem with the mean is its being fixed, whereas the target variable is changeable. However, if we assume that the target variable changes because of the effect of some other variable we are measuring, then we can adjust the mean with respect to the variations in cause.
One improvement on our previous approach could be to build a mean conditional on certain values of another variable (or even more than one) actually related to our target, whose variation is somehow similar to the variation of the target one.
Intuitively, if we know the dynamics we want to predict with our model, we can try to look for variables that we know can impact the answer values.
In the real estate business, we actually know that usually the larger a house is, the more expensive it is; however, this rule is just part of the story and the price is affected by many other considerations. For the moment, we will keep it simple and just assume that an extension to a house is a factor that positively affects the price, and consequently, more space equals more costs when building the house (more land, more construction materials, more work, and consequently a higher price).
Now, we have a variable that we know should change with our target and we just need to measure it and extend our initial formula based on constant values with something else.
In statistics, there is a measure that helps to measure how (in the sense of how much and in what direction) two variables relate to each other: correlation.
In correlation, a few steps are to be considered. First, your variables have to be standardized (or your result won't be a correlation but a covariation, a measure of association that is affected by the scale of the variables you are working with).
In statistical Z score standardization, you subtract from each variable its mean and then you pide the result by the standard deviation. The resulting transformed variable will have a mean of 0 and a standard deviation of 1 (or unit variance, since variance is the squared standard deviation).
The formula for standardizing a variable is as follows:
This can be achieved in Python using a simple function:
In: def standardize(x): return (x-np.mean(x))/np.std(x)
After standardizing, you compare the squared difference of each variable with its own mean. If the two differences agree in sign, their multiplication will become positive (evidence that they have the same directionality); however, if they differ, the multiplication will turn negative. By summing all the multiplications between the squared differences, and piding them by the number of observations, you will finally get the correlation which will be a number ranging from -1 to 1.
The absolute value of the correlation will provide you with the intensity of the relation between the two variables compared, 1 being a sign of a perfect match and zero a sign of complete independence between them (they have no relation between them). The sign instead will hint at the proportionality; positive is direct (when one grows the other does the same), negative is indirect (when one grows, the other shrinks).
Covariance can be expressed as follows:
Whereas, Pearson's correlation can be expressed as follows:
Let's check these two formulations directly on Python. As you may have noticed, Pearson's correlation is really covariance calculated on standardized variables, so we define the correlation
function as a wrapper of both the covariance
and standardize
ones (you can find all these functions ready to be imported from Scipy
; we are actually recreating them here just to help you understand how they work):
In: def covariance(variable_1, variable_2, bias=0): observations = float(len(variable_1)) return np.sum((variable_1 - np.mean(variable_1)) * \ (variable_2 - np.mean(variable_2)))/(observations-min(bias,1)) def standardize(variable): return (variable - np.mean(variable)) / np.std(variable) def correlation(var1,var2,bias=0): return covariance(standardize(var1), standardize(var2),bias) from scipy.stats.stats import pearsonr print ('Our correlation estimation: %0.5f' % (correlation(dataset['RM'], dataset['target']))) print ('Correlation from Scipy pearsonr estimation: %0.5f' % pearsonr(dataset['RM'], dataset['target'])[0]) Out: Our correlation estimation: 0.69536 Correlation from Scipy pearsonr estimation: 0.69536
Our correlation estimation for the relation between the value of the target variable and the average number of rooms in houses in the area is 0.695, which is positive and remarkably strong, since the maximum positive score of a correlation is 1.0.
Tip
As a way to estimate if a correlation is relevant or not, just square it; the result will represent the percentage of the variance shared by the two variables.
Let's graph what happens when we correlate two variables. Using a scatterplot, we can easily visualize the two involved variables. A scatterplot is a graph where the values of two variables are treated as Cartesian coordinates; thus, for every (x, y) value a point is represented in the graph:
In: x_range = [dataset['RM'].min(),dataset['RM'].max()] y_range = [dataset['target'].min(),dataset['target'].max()] scatter_plot = dataset.plot(kind='scatter', x='RM', y='target',\xlim=x_range, ylim=y_range) meanY = scatter_plot.plot(x_range, [dataset['target'].mean(),\ dataset['target'].mean()], '--' , color='red', linewidth=1) meanX = scatter_plot.plot([dataset['RM'].mean(),\dataset['RM'].mean()], y_range, '--', color='red', linewidth=1)
The scatterplot also plots the average value for both the target and the predictor variables as dashed lines. This pides the plot into four quadrants. If we compare it with the previous covariance and correlation formulas, we can understand why the correlation value was close to 1: in the bottom-right and in top-left quadrants, there are just a few mismatching points where one of variables is above its average and the other is below its own.
A perfect match (correlation values of 1 or -1) is possible only when the points are in a straight line (and all points are therefore concentrated in the right-uppermost and left-lowermost quadrants). Thus, correlation is a measure of linear association, of how close to a straight line your points are. Ideally, having all your points on a single line favors a perfect mapping of your predictor variable to your target.
- Koa開發(fā):入門、進階與實戰(zhàn)
- Mastering C# Concurrency
- jQuery從入門到精通 (軟件開發(fā)視頻大講堂)
- Redis Essentials
- jQuery開發(fā)基礎教程
- Linux C編程:一站式學習
- Python深度學習:模型、方法與實現(xiàn)
- Qt5 C++ GUI Programming Cookbook
- C++ Application Development with Code:Blocks
- 微課學人工智能Python編程
- jQuery for Designers Beginner's Guide Second Edition
- 遠方:兩位持續(xù)創(chuàng)業(yè)者的點滴思考
- 程序員必會的40種算法
- Java設計模式深入研究
- Data Manipulation with R(Second Edition)