- Python Data Analysis Cookbook
- Ivan Idris
- 367字
- 2021-07-14 11:05:39
Graphing Anscombe's quartet
Anscombe's quartet is a classic example that illustrates why visualizing data is important. The quartet consists of four datasets with similar statistical properties. Each dataset has a series of x values and dependent y values. We will tabulate these metrics in an IPython notebook. However, if you plot the datasets, they look surprisingly different compared to each other.
How to do it...
For this recipe, you need to perform the following steps:
- Start with the following imports:
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import matplotlib as mpl from dautil import report from dautil import plotting import numpy as np from tabulate import tabulate
- Define the following function to compute the mean, variance, and correlation of
x
andy
within a dataset, the slope, and the intercept of a linear fit for each of the datasets:df = sns.load_dataset("anscombe") agg = df.groupby('dataset')\ .agg([np.mean, np.var])\ .transpose() groups = df.groupby('dataset') corr = [g.corr()['x'][1] for _, g in groups] builder = report.DFBuilder(agg.columns) builder.row(corr) fits = [np.polyfit(g['x'], g['y'], 1) for _, g in groups] builder.row([f[0] for f in fits]) builder.row([f[1] for f in fits]) bottom = builder.build(['corr', 'slope', 'intercept']) return df, pd.concat((agg, bottom))
- The following function returns a string, which is partly Markdown, partly restructured text, and partly HTML, because core Markdown does not officially support tables:
def generate(table): writer = report.RSTWriter() writer.h1('Anscombe Statistics') writer.add(tabulate(table, tablefmt='html', floatfmt='.3f')) return writer.rst
- Plot the data and corresponding linear fits with the Seaborn
lmplot()
function:def plot(df): sns.set(style="ticks") g = sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=df, col_wrap=2, ci=None, palette="muted", size=4, scatter_kws={"s": 50, "alpha": 1}) plotting.embellish(g.fig.axes)
- Display a table with statistics, as follows:
df, table = aggregate() from IPython.display import display_markdown display_markdown(generate(table), raw=True)
The following table shows practically identical statistics for each dataset (I modified the
custom.css
file in my IPython profile to get the colors): - The following lines plot the datasets:
%matplotlib inline plot(df)
Refer to the following plot for the end result:

A picture says more than a thousand words. The source code is in the anscombe.ipynb
file in this book's code bundle.
See also
- The Anscombe's quartet Wikipedia page at https://en.wikipedia.org/wiki/Anscombe%27s_quartet (retrieved July 2015)
- The seaborn documentation for the
lmplot()
function at https://web.stanford.edu/~mwaskom/software/seaborn/generated/seaborn.lmplot.html (retrieved July 2015)
推薦閱讀
- Android Wearable Programming
- Learn ECMAScript(Second Edition)
- Oracle Database In-Memory(架構(gòu)與實(shí)踐)
- Learning C++ Functional Programming
- JavaFX Essentials
- 精通軟件性能測(cè)試與LoadRunner實(shí)戰(zhàn)(第2版)
- Python 3網(wǎng)絡(luò)爬蟲實(shí)戰(zhàn)
- 深入淺出DPDK
- Mastering Apache Spark 2.x(Second Edition)
- Visual C#.NET程序設(shè)計(jì)
- 軟件品質(zhì)之完美管理:實(shí)戰(zhàn)經(jīng)典
- Service Mesh實(shí)戰(zhàn):基于Linkerd和Kubernetes的微服務(wù)實(shí)踐
- 視窗軟件設(shè)計(jì)和開發(fā)自動(dòng)化:可視化D++語(yǔ)言
- Software-Defined Networking with OpenFlow(Second Edition)
- 軟技能2:軟件開發(fā)者職業(yè)生涯指南