官术网_书友最值得收藏!

Explore the Boston housing dataset

  1. Navigate to Subtopic Data exploration in the Jupyter Notebook and run the cell containing df.describe() :

This computes various properties including the mean, standard deviation, minimum, and maximum for each column. This table gives a high-level idea of how everything is distributed. Note that we have taken the transform of the result by adding a .T to the output; this swaps the rows and columns. Going forward with the analysis, we will specify a set of columns to focus on.

  1. Run the cell where these "focus columns" are defined:
    cols = ['RM', 'AGE', 'TAX', 'LSTAT', 'MEDV'] 
  1. This subset of columns can be selected from df using square brackets. Display this subset of the DataFrame by running df[cols].head() :

As a reminder, let's recall what each of these columns is. From the dataset documentation, we have the following:

    • RM average number of rooms per dwelling
    • AGE proportion of owner-occupied units built prior to 1940
    • TAX full-value property-tax rate per $10,000
    • LSTAT % lower status of the population
    • MEDV Median value of owner-occupied homes in $1000's

To look for patterns in this data, we can start by calculating the pairwise correlations using pd.DataFrame.corr.

  1. Calculate the pairwise correlations for our selected columns by running the cell containing the following code:
   df[cols].corr()

This resulting table shows the correlation score between each set of values. Large positive scores indicate a strong positive (that is, in the same direction) correlation. As expected, we see maximum values of 1 on the diagonal.

Pearson coefficient is defined as the co-variance between two variables, divided by the product of their standard deviations:


The co-variance, in turn, is defined as follows:

Here, n is the number of samples, xi and yi are the individual samples being summed over, and  and   are the means of each set.

Instead of straining our eyes to look at the preceding table, it's nicer to visualize it with a heatmap. This can be done easily with Seaborn.

  1. Run the next cell to initialize the plotting environment, as discussed earlier in the chapter. Then, to create the heatmap, run the cell containing the following code:
     import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

ax = sns.heatmap(df[cols].corr(),
cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
ax.xaxis.tick_top() # move labels to the top
plt.savefig('../figures/chapter-1-boston-housing-corr.png',
bbox_inches='tight', dpi=300)

We call sns.heatmap and pass the pairwise correlation matrix as input. We use a custom color palette here to override the Seaborn default. The function returns a matplotlib.axes object which is referenced by the variable ax. The final figure is then saved as a high-resolution PNG to the figures folder.

  1. For the final step in our dataset exploration exercise, we'll visualize our data using Seaborn's pairplot function.
  1. Visualize the DataFrame using Seaborn's pairplot function. Run the cell containing the following code:
     sns.pairplot(df[cols],
plot_kws={'alpha': 0.6},
diag_kws={'bins': 30})

Having previously used a heatmap to visualize a simple overview of the correlations, this plot allows us to see the relationships in far more detail.
Looking at the histograms on the diagonal, we see the following:

    • a: RM and MEDV have the closest shape to normal distributions.
    • b: AGE is skewed to the left and LSTAT is skewed to the right (this may seem counter intuitive but skew is defined in terms of where the mean is positioned in relation to the max).
    • c: For TAX, we find a large amount of the distribution is around 700. This is also
      evident from the scatter plots

Taking a closer look at the MEDV histogram in the bottom right, we actually see something similar to TAX where there is a large upper-limit bin around $50,000. Recall when we did df.describe(), the min and max of MDEV was 5k and 50k, respectively. This suggests that median house values in the dataset were capped at 50k.

主站蜘蛛池模板: 噶尔县| 赣州市| 湖北省| 黄石市| 博客| 吉水县| 桐梓县| 抚松县| 镇坪县| 瓦房店市| 平凉市| 清河县| 曲阳县| 吕梁市| 曲阜市| 芦山县| 禄丰县| 马公市| 长治县| 汪清县| 哈尔滨市| 东源县| 谷城县| 陆川县| 郓城县| 罗田县| 陆川县| 永泰县| 乐亭县| 浦县| 武陟县| 潞城市| 乡城县| 东乡县| 邯郸市| 海南省| 台北县| 应城市| 读书| 岳池县| 文昌市|