官术网_书友最值得收藏!

Analysis of Social Drinkers and Smokers

Let's begin with an analysis of the impact of being a drinker or smoker on employee absenteeism. As smoking and frequent drinking have a negative impact on health conditions, we would expect that certain diseases are more frequent in smokers and drinkers than others. Note that in the absenteeism dataset, 56% of the registered employees are drinkers, while only 7% are smokers. We can produce a figure, similar to Figure 2.6 for the social drinkers and smokers with the following code:

# plot reasons for absence against being a social drinker/smoker

plt.figure(figsize=(8, 6))

sns.countplot(data=preprocessed_data, x="Reason for absence", \

              hue="Social drinker", hue_order=["Yes", "No"])

plt.savefig('figs/absence_reasons_drinkers.png', \

            format='png', dpi=300)

plt.figure(figsize=(8, 6))

sns.countplot(data=preprocessed_data, x="Reason for absence", \

              hue="Social smoker", hue_order=["Yes", "No"])

plt.savefig('figs/absence_reasons_smokers.png', \

            format='png', dpi=300)

The following is the output of the preceding code:

Figure 2.7: Distribution of diseases over social drinkers

Similarly, the distribution of diseases for social smokers can be visualized as follows:

Figure 2.8: Distribution of diseases over social smokers

Next, calculate the actual count for social drinkers and smokers from the preprocessed data:

print(preprocessed_data["Social drinker"]\

      .value_counts(normalize=True))

print(preprocessed_data["Social smoker"]\

      .value_counts(normalize=True))

The output will be as follows:

Yes 0.567568

No 0.432432

Name: Social drinker, dtype: float64

No 0.927027

Yes 0.072973

Name: Social smoker, dtype: float64

As we can see from the resulting plots, a significant difference between drinkers and non-drinkers can be observed in absences related to Dental consultations (28). Furthermore, as the number of social smokers is quite small (only 7% of the entries), it is very hard to say whether there is actually a relationship between the absence reasons and smoking. A more rigorous approach in this direction would be to analyze the conditional probabilities of the different absence reasons, which are based on being a social drinker or smoker.

Conditional probability is a measure that tells us the probability of an event's occurrence, assuming that another event has occurred. From a mathematical perspective, given a set of events Ω and a probability measure P on Ω and given two events A and B in Ω with the unconditional probability of B being greater than zero (that is, P(B) > 0), we can define the conditional probability of A given B as follows:

Figure 2.9: Formula for conditional probability

In other words, the probability of A given B is equal to the probability of A and B both happening, pided by the probability of B happening. Let's consider a simple example that will help us understand the usage of conditional probability. This is a classic probability problem. Suppose that your friend has two children, and you know that one of them is male. We want to know what the probability is that your friend has two sons. First, we have to identify all the possible events in our event space Ω. If we denote with B the event of having a boy, and with G the event of having a girl, then the event space contains four possible events:

Figure 2.10: Event space Ω

They each have a probability of 0.25. Following the notations from the definition, we can define the first event like so:

Figure 2.11: Event A

We can define the latter event like so:

Figure 2.12: Event B

Now, our initial problem translates into computing P(A|B). With this, we get the following equation:

Figure 2.13: Probability of event A conditioned to B

We can also perform this example computationally:

# computation of conditional probability

sample_space = set(["BB", "BG", "GB", "GG"])

event_a = set(["BB"])

event_b = set(["BB", "BG", "GB"])

cond_prob = (0.25*len(event_a.intersection(event_b))) \

            / (0.25*len(event_b))

print(round(cond_prob, 4))

The output will be as follows:

0.3333

Note that by using the definition of conditional probability, we could address questions such as, "What is the probability of a reason for absence being related to laboratory examinations, assuming that an employee is a social drinker?" In other words, if we denote the "employee is absent for laboratory examinations" event with A, and the "employee is a social drinker" event with B, the probability of the "employee is absent due to laboratory examination reasons, given that employee is a social drinker" event can be computed by the previous formula.

The following exercise illustrates how the conditional probability formula can identify reasons for absence with higher probability among smokers and drinkers.

Exercise 2.02: Identifying Reasons of Absence with Higher Probability Among Drinkers and Smokers

In this exercise, you will compute the conditional probabilities of the different reasons for absence, assuming that the employee is a social drinker or smoker. Please execute the code mentioned in the previous section and Exercise 2.01, Identifying Disease Reasons for Absence before attempting this exercise. Now, follow these steps:

  1. To identify the conditional probabilities, first compute the unconditional probabilities of being a social drinker or smoker. Verify that both the probabilities are greater than zero, as they appear in the denominator of the conditional probabilities. Do this by counting the number of social drinkers and smokers and piding these values by the total number of entries, like so:

    Figure 2.14: Probability of being a social drinker

    Figure 2.15: Probability of being a social smoker

    The following code snippet does this for you:

    # compute probabilities of being a drinker and smoker

    drinker_prob = preprocessed_data["Social drinker"]\

                   .value_counts(normalize=True)["Yes"]

    smoker_prob = preprocessed_data["Social smoker"]\

                  .value_counts(normalize=True)["Yes"]

    print(f"P(social drinker) = {drinker_prob:.3f} \

    | P(social smoker) = {smoker_prob:.3f}")

    The output will be as follows:

    P(social drinker) = 0.568 | P(social smoker) = 0.073

    As you can see, the probability of being a drinker is almost 57%, while the probability of being a smoker is quite low (only 7.3%).

  2. Next, compute the probabilities of being a social drinker/smoker and being absent for each reason of absence. For a specific reason of absence (say Ri), these probabilities are defined as follows:

    Figure 2.16: Probability of being a drinker and absent

    Figure 2.17: Probability of being a smoker and absent

  3. In order to carry the required computations, define masks in the data, which only account for entries where employees are drinkers or smokers:

    #create mask for social drinkers/smokers drinker_mask = preprocessed_data["Social drinker"] == "Yes"

    smoker_mask = preprocessed_data["Social smoker"] == "Yes"

  4. Compute the total number of entries and the number of absence reasons, masked by drinkers/smokers:

    total_entries = preprocessed_data.shape[0]

    absence_drinker_prob = preprocessed_data["Reason for absence"]\

                           [drinker_mask].value_counts()/total_entries

    absence_smoker_prob = preprocessed_data["Reason for absence"]\

                          [smoker_mask].value_counts()/total_entries

  5. Compute the conditional probabilities by piding the computed probabilities for each reason of absence in Step 2 by the unconditional probabilities obtained in Step 1:

    # compute conditional probabilities

    cond_prob = pd.DataFrame(index=range(0,29))

    cond_prob["P(Absence | social drinker)"] = absence_drinker_prob\

                                               /drinker_prob

    cond_prob["P(Absence | social smoker)"] = absence_smoker_prob\

                                              /smoker_prob

  6. Create bar plots for the conditional probabilities:

    # plot probabilities

    plt.figure()

    ax = cond_prob.plot.bar(figsize=(10,6))

    ax.set_ylabel("Conditional probability")

    plt.savefig('figs/conditional_probabilities.png', \

                format='png', dpi=300)

    The output will be as follows:

Figure 2.18: Bar plots for conditional probabilities

As we can observe from the previous plot, the highest reason for absence for drinkers is dental consultations (28), followed by medical consultations (23). Smokers' absences, however, are mostly due to unknown reasons (0) and laboratory examinations (25).

Note

To access the source code for this specific section, please refer to https://packt.live/2Y7KQhv.

You can also run this example online at https://packt.live/3d7pFk3. You must execute the entire Notebook in order to get the desired result.

In the previous exercise, we saw how to compute the conditional probabilities of the reason for absence, conditioned on the employee being a social smoker or drinker. Furthermore, we saw that in order to perform the computation, we had to compute the probability of being absent and being a social smoker/drinker. Due to the nature of the problem, computing this value might be difficult, or we may only have one conditional probability (say, P(A|B)) where we actually need P(B|A). In these cases, the Bayesian theorem can be used:

Let Ω denote a set of events with probability measure P on Ω. Given two events A and B in Ω , with (P(B) > 0) the Bayesian theorem states the following:

Figure 2.19: Bayesian theorem

Before proceeding further, we will provide a practical example of applying the Bayesian theorem in practice. Suppose that we have two bags. The first one contains four blue and three red balls, while the second one contains two blue and five red balls. Let's assume that a ball is drawn at random from one of the two bags, and its color is blue. We want to know what the probability is that the ball has been drawn from the first bag. Let's use B1 to denote the "ball is drawn from the first bag" event and B2 to denote the "ball is drawn from the second bag" event. Since the number of balls is equal in both bags, the probability of the two events is equal to 0.5, as follows:

Figure 2.20: Probability of both events

If we use A to denote the "a blue ball has been drawn" event, then we have the following:

Figure 2.21: Probability of event A, where a blue ball is drawn

This is because we have four balls in the first bag and only two in the second one. Furthermore, based on the defined events, the probability we need to compute translates into P(B1 | A). By applying Bayes' theorem, we get the following:

Figure 2.22: Probability of the event that a blue ball is drawn

Now, let's apply Bayes' theorem to our dataset in the following exercise. In addition to applying Bayes' theorem, we will also be using the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test is used to determine whether two samples are statistically different from each other, i.e. whether or not they follow the same distribution. We can implement the Kolmogorov-Smirnov test directly from SciPy, as we will see in the exercise.

Exercise 2.03: Identifying the Probability of Being a Drinker/Smoker, Conditioned to Absence Reason

In this exercise, you will compute the conditional probability of being a social drinker or smoker, conditioned on the reason for absence. In other words (where Ri is the reason for which an employee is absent), we want to compute the probabilities of an employee being a social drinker P(social drinker |Ri), or smoker P(social smoker |Ri), as follows:

Figure 2.23: Conditional probability of being a drinker, conditioned to an absence reason Ri

Figure 2.24: Conditional probability of being a smoker, conditioned to an absence reason Ri

Execute the code mentioned in the previous section, as well as the previous exercises, before attempting this exercise. Now, follow these steps:

  1. Since you already computed P(Ri | social drinker), P(Ri | social smoker), P(social drinker), and P(social smoker), in the previous exercise, you only need to compute P(Ri) for each reason of absence R_i:

    # compute reason for absence probabilities

    absence_prob = preprocessed_data["Reason for absence"]\

                   .value_counts(normalize=True)

  2. Now that you have all the necessary values, compute the conditional probabilities according to the equations in Step 1:

    # compute conditional probabilities for drinker/smoker

    cond_prob_drinker_smoker = pd.DataFrame(index=range(0,29))

    cond_prob_drinker_smoker["P(social drinker | Absence)"] = \

    cond_prob["P(Absence | social drinker)"]*drinker_prob/absence_prob

    cond_prob_drinker_smoker["P(social smoker | Absence)"] = \

    cond_prob["P(Absence | social smoker)"]*smoker_prob/absence_prob

    plt.figure()

    ax = cond_prob_drinker_smoker.plot.bar(figsize=(10,6))

    ax.set_ylabel("Conditional probability")

    plt.savefig('figs/conditional_probabilities_drinker_smoker.png', \

                format='png', dpi=300)

    The following is the output of the preceding code:

    Figure 2.25: Conditional probabilities of being a drinker/smoker, conditioned to being absent

    As you can see from the resulting plot, the conditional probabilities of being a social drinker/smoker are quite high, once an absence with a certain reason occurs. This is due to the fact that the number of entries is very small; as such, if all the registered employees who were absent for a certain reason are smokers, the probability of being a smoker, once that reason has been registered, will be equal to one (based on the available data).

  3. To complete your analysis on the social drinkers and smokers, analyze the distribution of the hours of absenteeism based on the two classes (being a social drinker/smoker versus not being). A useful type of plot for this type of comparison is the violin plot, which can be produced using the seaborn violinplot() function:

    # create violin plots of the absenteeism time in hours

    plt.figure(figsize=(8,6))

    sns.violinplot(x="Social drinker", y="Absenteeism time in hours", \

                   data=preprocessed_data, order=["No", "Yes"])

    plt.savefig('figs/drinkers_hour_distribution.png', \

                format='png', dpi=300)

    plt.figure(figsize=(8,6))

    sns.violinplot(x="Social smoker", y="Absenteeism time in hours", \

                   data=preprocessed_data, order=["No", "Yes"])

    plt.savefig('figs/smokers_hour_distribution.png', \

                format='png', dpi=300)

    The following is the output of the preceding code:

    Figure 2.26: Violin plots of the absenteeism time in hours for social drinkers

    Figure 2.27: Violin plots of the absenteeism time in hours for social smokers

    As you can observe from Figure 2.27, despite some differences in the outliers between smokers and non-smokers, there is no substantial difference in the distribution of absenteeism hours in drinkers and smokers.

  4. To assess this statement in a rigorous statistical way, perform hypothesis testing on the absenteeism hours (with a null hypothesis stating that the average absenteeism time in hours is the same for drinkers and non-drinkers):

    from scipy.stats import ttest_ind

    hours_col = "Absenteeism time in hours"

    # test mean absenteeism time for drinkers

    drinkers_mask = preprocessed_data["Social drinker"] == "Yes"

    hours_drinkers = preprocessed_data.loc[drinker_mask, hours_col]

    hours_non_drinkers = preprocessed_data\

                         .loc[~drinker_mask, hours_col]

    drinkers_test = ttest_ind(hours_drinkers, hours_non_drinkers)

    print(f"Statistic value: {drinkers_test[0]}, \

    p-value: {drinkers_test[1]}")

    The output will be as follows:

    Statistic value: 1.7713833295243993, p-value: 0.07690961828294651

  5. Perform the same test on the social smokers:

    # test mean absenteeism time for smokers

    smokers_mask = preprocessed_data["Social smoker"] == "Yes"

    hours_smokers = preprocessed_data.loc[smokers_mask, hours_col]

    hours_non_smokers = preprocessed_data\

                        .loc[~smokers_mask, hours_col]

    smokers_test = ttest_ind(hours_smokers, hours_non_smokers)

    print(f"Statistic value: {smokers_test[0]}, \

    p-value: {smokers_test[1]}")

    The output will be as follows:

    Statistic value: -0.24277795417700243, p-value: 0.8082448720154971

    As you can see, the p-value of both tests is above the critical value of 0.05, which means that you cannot reject the null hypothesis. In other words, you cannot say that there is a statistically significant difference in the absenteeism hours between drinkers (and smokers) and non-drinkers (and non-smokers).

    Note that in the previous paragraph, you performed hypothesis tests, with a null hypothesis for the average absenteeism hours being equal for drinkers (and smokers) and non-drinkers (and non-smokers). Nevertheless, the average hours may still be equal, but their distributions may be different.

  6. Perform a Kolmogorov-Smirnov test to assess the difference in the distributions of two samples:

    # perform Kolmogorov-Smirnov test for comparing the distributions

    from scipy.stats import ks_2samp

    ks_drinkers = ks_2samp(hours_drinkers, hours_non_drinkers)

    ks_smokers = ks_2samp(hours_smokers, hours_non_smokers)

    print(f"Drinkers comparison: statistics={ks_drinkers[0]:.3f}, \

    pvalue={ks_drinkers[1]:.3f}")

    print(f"Smokers comparison: statistics={ks_smokers[0]:.3f}, \

    pvalue={ks_smokers[1]:.3f}")

    The output will be as follows:

    Drinkers comparison: statistics=0.135, pvalue=0.002

    Smokers comparison: statistics=0.104, pvalue=0.607

The p-value for the drinkers dataset is lower than the critical 0.05, which is strong evidence against the null hypothesis of the two distributions being equal. On the other hand, as the p-value for the smokers dataset is higher than 0.05, you cannot reject the null hypothesis.

Note

To access the source code for this specific section, please refer to https://packt.live/3hxt3I6.

You can also run this example online at https://packt.live/2BeAweq. You must execute the entire Notebook in order to get the desired result.

In this section, we investigated the relationship between the different reasons for absence, as well as social information about the employees (such as being smokers or drinkers). In the next section, we will analyze the impact of the employees' body mass index on their absenteeism.

主站蜘蛛池模板: 宾川县| 诸城市| 航空| 休宁县| 呼伦贝尔市| 历史| 车险| 汨罗市| 扎兰屯市| 名山县| 延津县| 屯留县| 建阳市| 雷州市| 中卫市| 黑龙江省| 柞水县| 屯留县| 阜新市| 井冈山市| 西充县| 财经| 文安县| 集贤县| 松潘县| 江安县| 玉门市| 江安县| 军事| 南华县| 祁东县| 台山市| 宁远县| 玉溪市| 刚察县| 玛纳斯县| 西乌珠穆沁旗| 金寨县| 大埔县| 华坪县| 白沙|