官术网_书友最值得收藏!

Highlighting data points with influence plots

Influence plots take into account residuals after a fit, influence, and leverage for individual data points similar to bubble plots. The size of the residuals is plotted on the vertical axis and can indicate that a data point is an outlier. To understand influence plots, take a look at the following equations:

The residuals according to the statsmodels documentation are scaled by standard deviation (2.1). In (2.2), n is the number of observations and p is the number of regressors. We have a so-called hat-matrix, which is given by (2.3).

The diagonal elements of the hat matrix give the special metric called leverage. Leverage serves as the horizontal axis and indicates potential influence of influence plots. In influence plots, influence determines the size of plotted points. Influential points tend to have high residuals and leverage. To measure influence, statsmodels can use either Cook's distance (2.4) or DFFITS (2.5).

How to do it...

  1. The imports are as follows:
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    from statsmodels.formula.api import ols
    from dautil import data
  2. Get the available country codes:
    dawb = data.Worldbank()
    
    countries = dawb.get_countries()[['name', 'iso2c']]
  3. Load the data from the Worldbank:
    population = dawb.download(indicator=[dawb.get_name('pop_grow'), dawb.get_name('gdp_pcap'),
                                        dawb.get_name('primary_education')],
                             country=countries['iso2c'], start=2014, end=2014)
    
    population = dawb.rename_columns(population)
  4. Define an ordinary least squares model, as follows:
    population_model = ols("pop_grow ~ gdp_pcap + primary_education",
                           data=population).fit()
  5. Display an influence plot of the model using Cook's distance:
    %matplotlib inline
    fig, ax = plt.subplots(figsize=(19.2, 14.4))
    fig = sm.graphics.influence_plot(population_model, ax=ax, criterion="cooks")
    plt.grid()

Refer to the following plot for the end result:

The code is in the highlighting_influence.ipynb file in this book's code bundle.

See also

主站蜘蛛池模板: 天镇县| 绵阳市| 徐州市| 泾源县| 桃江县| 祥云县| 南安市| 平南县| 昌都县| 呼图壁县| 呼和浩特市| 青冈县| 吴旗县| 阳春市| 景洪市| 绥滨县| 广州市| 尼勒克县| 三原县| 建始县| 阜平县| 无棣县| 达日县| 象州县| 清水河县| 射阳县| 锡林郭勒盟| 阿瓦提县| 冀州市| 吕梁市| 淮南市| 祥云县| 金堂县| 屯门区| 龙胜| 东乌珠穆沁旗| 禹州市| 周宁县| 泸州市| 贵州省| 三门峡市|