官术网_书友最值得收藏!

Graphing Anscombe's quartet

Anscombe's quartet is a classic example that illustrates why visualizing data is important. The quartet consists of four datasets with similar statistical properties. Each dataset has a series of x values and dependent y values. We will tabulate these metrics in an IPython notebook. However, if you plot the datasets, they look surprisingly different compared to each other.

How to do it...

For this recipe, you need to perform the following steps:

  1. Start with the following imports:
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    from dautil import report
    from dautil import plotting
    import numpy as np
    from tabulate import tabulate
  2. Define the following function to compute the mean, variance, and correlation of x and y within a dataset, the slope, and the intercept of a linear fit for each of the datasets:
    df = sns.load_dataset("anscombe")
    
     agg = df.groupby('dataset')\
     .agg([np.mean, np.var])\
     .transpose()
        groups = df.groupby('dataset')
    
        corr = [g.corr()['x'][1] for _, g in groups]
        builder = report.DFBuilder(agg.columns)
        builder.row(corr)
    
        fits = [np.polyfit(g['x'], g['y'], 1) for _, g in groups]
        builder.row([f[0] for f in fits])
        builder.row([f[1] for f in fits])
        bottom = builder.build(['corr', 'slope', 'intercept'])
    
        return df, pd.concat((agg, bottom))
  3. The following function returns a string, which is partly Markdown, partly restructured text, and partly HTML, because core Markdown does not officially support tables:
    def generate(table):
        writer = report.RSTWriter()
        writer.h1('Anscombe Statistics')
        writer.add(tabulate(table, tablefmt='html', floatfmt='.3f'))
        
        return writer.rst
  4. Plot the data and corresponding linear fits with the Seaborn lmplot() function:
    def plot(df):
        sns.set(style="ticks")
        g = sns.lmplot(x="x", y="y", col="dataset", 
             hue="dataset", data=df,
             col_wrap=2, ci=None, palette="muted", size=4,
             scatter_kws={"s": 50, "alpha": 1})
    
        plotting.embellish(g.fig.axes)
  5. Display a table with statistics, as follows:
    df, table = aggregate()
    from IPython.display import display_markdown
    display_markdown(generate(table), raw=True)

    The following table shows practically identical statistics for each dataset (I modified the custom.css file in my IPython profile to get the colors):

  6. The following lines plot the datasets:
    %matplotlib inline
    plot(df)

Refer to the following plot for the end result:

A picture says more than a thousand words. The source code is in the anscombe.ipynb file in this book's code bundle.

See also

主站蜘蛛池模板: 六枝特区| 乌兰浩特市| 吴桥县| 云霄县| 甘孜县| 安龙县| 南木林县| 林芝县| 庆阳市| 固原市| 麻栗坡县| 越西县| 凤山县| 洪湖市| 盐山县| 巫溪县| 从江县| 宁海县| 黄大仙区| 宜宾县| 神农架林区| 周宁县| 波密县| 汉沽区| 秦皇岛市| 老河口市| 乌恰县| 四平市| 苍山县| 沙雅县| 石楼县| 阜新| 客服| 邹平县| 阳东县| 皋兰县| 嫩江县| 毕节市| 高碑店市| 曲周县| 海盐县|