官术网_书友最值得收藏!

Categorical Dependent and Numeric/Continuous Independent Variables

Hypotheses 1 and 2 have a continuous independent variable. Referring to the figure in the previous section, we will opt for the chi-squared test. In the process of hypothesis testing, we start by defining a null hypothesis and an alternate hypothesis. Start with a negative approach, that is, assume the null hypothesis to be what we don't want to happen. The hypothesis test examines the chances that the pattern observed happens due to random chance or there if is certainty about the observation. This measure is quantified as probability. If the probability of the significance of the null hypothesis to happen is less than 5% (or a suitable cut-off), we reject the null hypothesis and confirm the validity of the alternate hypothesis.

Let's begin; for hypothesis 1, we define the following:

  • Null hypothesis: The campaign outcome has no relationship with the employee variance rate.
  • Alternate hypothesis: The campaign outcome has a relationship with employee variance rate.

We test the validity of our null hypothesis with simple logistic regression. We will discuss this topic in more detail in the following chapters. For now, we will quickly perform a simple check to test our hypothesis. The following exercise leverages R's built-in function for performing logistic regression.

Exercise 36: Hypothesis 1 Testing for Categorical Dependent Variables and Continuous Independent Variables

To perform hypothesis testing for categorical dependent variables and continuous independent variables, we will use the glm() function to fit the logistic regression model (more on this in Chapter 5, Classification). This exercise will help us statistically test whether a categorical dependent variable (for example, y) has any relationship with a continuous independent variable, for example,

emp.var.rate.

Perform the following steps to complete the exercise:

  1. Import the required libraries and create the DataFrame objects.
  2. First, convert the dependent variable into a factor type:

    df$y <- factor(df$y)

  3. Next, perform logistic regression:

    h.test <- glm(y ~ emp.var.rate, data = df, family = "binomial")

  4. Print the test summary:

    summary(h.test)

    The output is as follows:

    Call:

    glm(formula = y ~ emp.var.rate, family = "binomial", data = df)

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -1.0047 -0.4422 -0.3193 -0.2941 2.5150

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) -2.33228 0.01939 -120.31 <2e-16 ***

    emp.var.rate -0.56222 0.01018 -55.25 <2e-16 ***

    ---

    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28999 on 41187 degrees of freedom

    Residual deviance: 25597 on 41186 degrees of freedom

    AIC: 25601

    Number of Fisher Scoring iterations: 5

We convert the target variable, y, as a factor (if it was not already). We use the glm function provided by R for logistic regression. The glm function also performs other forms of regression, and we specify the family = 'binomial' parameter for using the function as a logistic regression. The formula in the first place of the function defines the dependent and independent variables.

There are quite a few results shared in the output. We will ignore most of them for now and focus only on the final output. One of the results provided is the significance probability, which confirms that there is less than a 2e-16 chance that our null hypothesis is true, and therefore we can reject it. Therefore, the target outcome has a statistically significant relationship with the employee variance rate and, as we can see, there is a higher chance of campaign conversion as the rate decreases.

Similarly, let's repeat the same test for our second hypothesis. We define the following:

  • Null hypothesis: The campaign outcome has no relationship with the euro interest rate.
  • Alternate hypothesis: The campaign outcome has a relationship with the euro interest rate.

Exercise 37: Hypothesis 2 Testing for Categorical Dependent Variables and Continuous Independent Variables

Once again, we will use logistic regression to statistically test whether there is a relationship between the target variable, y, and the independent variable. In this exercise, we will use the euribor3m variable.

Perform the following steps:

  1. Import the required libraries and create the DataFrame objects.
  2. First, convert the dependent variable into a factor type:

    df$y <- factor(df$y)

  3. Next, perform logistic regression:

    h.test2 <- glm(y ~ euribor3m, data = df, family = "binomial")

  4. Print the test summary:

    summary(h.test2)

    The output is as follows:

    Call:

    glm(formula = y ~ euribor3m, family = "binomial", data = df)

    Deviance Residuals:

    Min 1Q Median 3Q Max

    -0.8568 -0.3730 -0.2997 -0.2917 2.5380

    Coefficients:

    Estimate Std. Error z value Pr(>|z|)

    (Intercept) -0.472940 0.027521 -17.18 <2e-16 ***

    euribor3m -0.536582 0.009547 -56.21 <2e-16 ***

    ---

    Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    (Dispersion parameter for binomial family taken to be 1)

    Null deviance: 28999 on 41187 degrees of freedom

    Residual deviance: 25343 on 41186 degrees of freedom

    AIC: 25347

    Number of Fisher Scoring iterations: 5

Focusing exclusively on the previous output, we can confirm that we can reject the null hypothesis and accept the alternative hypothesis. Therefore, the target outcome has a statistically significant relationship with the Euro Interest rate and, as we can see, there is a higher chance of campaign conversion as the rate decreases.

主站蜘蛛池模板: 绍兴市| 呼和浩特市| 伊川县| 九江县| 仁布县| 晴隆县| 舒兰市| 平果县| 凯里市| 和顺县| 锡林浩特市| 建水县| 延川县| 黎城县| 炎陵县| 札达县| 平安县| 高尔夫| 南皮县| 灌云县| 分宜县| 九台市| 子洲县| 绥宁县| 瑞昌市| 武陟县| 武穴市| 天长市| 五常市| 城市| 辽宁省| 达尔| 罗定市| 英山县| 高雄市| 云龙县| 奉化市| 马关县| 乌兰浩特市| 涿鹿县| 东城区|