官术网_书友最值得收藏!

Correlating variables with the Spearman rank correlation

The Spearman rank correlation uses ranks to correlate two variables with the Pearson Correlation. Ranks are the positions of values in sorted order. Items with equal values get a rank, which is the average of their positions. For instance, if we have two items of equal value assigned position 2 and 3, the rank is 2.5 for both items. Have a look at the following equations:

In these equations, n is the sample size. (3.17) shows how the correlation is calculated. (3.19) gives the standard error. (3.20) is about the z-score, which we assume to be normally distributed. F(r) is here the same as in (3.14), since it is the same correlation but applied to ranks.

How to do it...

In this recipe we calculate the Spearman correlation between wind speed and temperature aggregated by the day of the year and the corresponding confidence interval. Then, we display the correlation matrix for all the weather data. The steps are as follows:

  1. The imports are as follows:
    import dautil as dl
    from scipy import stats
    import numpy as np
    import math
    import seaborn as sns
    import matplotlib.pyplot as plt
    from IPython.html import widgets
    from IPython.display import display
    from IPython.display import HTML
  2. Define the following function to compute the confidence interval:
    def get_ci(n, corr):
        z = math.sqrt((n - 3)/1.06) * np.arctanh(corr)
        se = 0.6325/(math.sqrt(n - 1))
        ci = z + np.array([-1, 1]) * se * stats.norm.ppf((1 + 0.95)/2)
    
        return np.tanh(ci)
  3. Load the data and display widgets so that you can correlate a different pair if you want:
    df = dl.data.Weather.load().dropna()
    df = dl.ts.groupby_yday(df).mean()
    
    drop1 = widgets.Dropdown(options=dl.data.Weather.get_headers(), 
                             selected_label='TEMP', description='Variable 1')
    drop2 = widgets.Dropdown(options=dl.data.Weather.get_headers(), 
                             selected_label='WIND_SPEED', description='Variable 2')
    display(drop1)
    display(drop2)
  4. Compute the Spearman rank correlation with SciPy:
    var1 = df[drop1.value].values
    var2 = df[drop2.value].values
    stats_corr = stats.spearmanr(var1, var2)
    dl.options.set_pd_options()
    html_builder = dl.report.HTMLBuilder()
    html_builder.h1('Spearman Correlation between {0} and {1}'.format(
        dl.data.Weather.get_header(drop1.value), dl.data.Weather.get_header(drop2.value)))
    html_builder.h2('scipy.stats.spearmanr()')
    dfb = dl.report.DFBuilder(['Correlation', 'p-value'])
    dfb.row([stats_corr[0], stats_corr[1]])
    html_builder.add_df(dfb.build())
  5. Compute the confidence interval as follows:
    n = len(df.index)
    ci = get_ci(n, stats_corr)
    html_builder.h2('Confidence intervale')
    dfb = dl.report.DFBuilder(['2.5 percentile', '97.5 percentile'])
    dfb.row(ci)
    html_builder.add_df(dfb.build())
  6. Display the correlation matrix as a Seaborn heatmap:
    corr = df.corr(method='spearman')
    
    %matplotlib inline
    plt.title('Spearman Correlation Matrix')
    sns.heatmap(corr)
    HTML(html_builder.html)

Refer to the following screenshot for the end result (see the correlating_spearman.ipynb file in this book's code bundle):

See also

主站蜘蛛池模板: 刚察县| 德阳市| 泗水县| 收藏| 阳东县| 修水县| 滦平县| 赞皇县| 甘泉县| 尖扎县| 新田县| 十堰市| 宁远县| 离岛区| 梅州市| 邯郸县| 察雅县| 东城区| 德阳市| 同德县| 贡山| 武城县| 通山县| 昭通市| 卓资县| 磐安县| 开原市| 张家港市| 怀来县| 芜湖市| 呼图壁县| 遂溪县| 含山县| 道真| 蒙山县| 师宗县| 忻州市| 密山市| 湘潭市| 柞水县| 鱼台县|