官术网_书友最值得收藏!

Correlating a binary and a continuous variable with the point biserial correlation

The point-biserial correlation correlates a binary variable Y and a continuous variable X. The coefficient is calculated as follows:

The subscripts in (3.21) correspond to the two groups of the binary variable. M1 is the mean of X for values corresponding to group 1 of Y. M2 is the mean of X for values corresponding to group 0 of Y.

In this recipe, the binary variable we will use is rain or no rain. We will correlate this variable with temperature.

How to do it...

We will calculate the correlation with the scipy.stats.pointbiserialr() function. We will also compute the rolling correlation using a 2 year window with the np.roll() function. The steps are as follows:

  1. The imports are as follows:
    import dautil as dl
    from scipy import stats
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    from IPython.display import HTML
  2. Load the data and correlate the two relevant arrays:
    df = dl.data.Weather.load().dropna()
    df['RAIN'] = df['RAIN'] > 0
    
    stats_corr = stats.pointbiserialr(df['RAIN'].values, df['TEMP'].values)
  3. Compute the 2 year rolling correlation as follows:
    N = 2 * 365
    corrs = []
    
    for i in range(len(df.index) - N):
        x = np.roll(df['RAIN'].values, i)[:N]
        y = np.roll(df['TEMP'].values, i)[:N]
        corrs.append(stats.pointbiserialr(x, y)[0])
    
    corrs = pd.DataFrame(corrs,
                         index=df.index[N:],
                         columns=['Correlation']).resample('A')
  4. Plot the results with the following code:
    plt.plot(corrs.index.values, corrs.values)
    plt.hlines(stats_corr[0], corrs.index.values[0], corrs.index.values[-1],
               label='Correlation using the whole data set')
    plt.title('Rolling Point-biserial Correlation of Rain and Temperature with a 2 Year Window')
    plt.xlabel('Year')
    plt.ylabel('Correlation')
    plt.legend(loc='best')
    HTML(dl.report.HTMLBuilder().watermark())

Refer to the following screenshot for the end result (see correlating_pointbiserial.ipynb file in this book's code bundle):

See also

  • The relevant SciPy documentation at 2015).
主站蜘蛛池模板: 牟定县| 布尔津县| 江津市| 江华| 曲松县| 富民县| 海丰县| 泰州市| 牙克石市| 罗源县| 岢岚县| 班玛县| 聂荣县| 武宁县| 尖扎县| 勃利县| 三亚市| 台北县| 吉安县| 昌都县| 奎屯市| 四子王旗| 上栗县| 永新县| 嘉禾县| 岳池县| 道真| 岚皋县| 专栏| 甘孜| 石阡县| 沂南县| 化州市| 罗山县| 治多县| 丰顺县| 宁阳县| 梨树县| 长宁区| 应用必备| 微博|