- The Data Analysis Workshop
- Gururajan Govindan Shubhangi Hora Konstantin Palagachev
- 2889字
- 2021-06-18 18:18:26
Analysis of Social Drinkers and Smokers
Let's begin with an analysis of the impact of being a drinker or smoker on employee absenteeism. As smoking and frequent drinking have a negative impact on health conditions, we would expect that certain diseases are more frequent in smokers and drinkers than others. Note that in the absenteeism dataset, 56% of the registered employees are drinkers, while only 7% are smokers. We can produce a figure, similar to Figure 2.6 for the social drinkers and smokers with the following code:
# plot reasons for absence against being a social drinker/smoker
plt.figure(figsize=(8, 6))
sns.countplot(data=preprocessed_data, x="Reason for absence", \
hue="Social drinker", hue_order=["Yes", "No"])
plt.savefig('figs/absence_reasons_drinkers.png', \
format='png', dpi=300)
plt.figure(figsize=(8, 6))
sns.countplot(data=preprocessed_data, x="Reason for absence", \
hue="Social smoker", hue_order=["Yes", "No"])
plt.savefig('figs/absence_reasons_smokers.png', \
format='png', dpi=300)
The following is the output of the preceding code:

Figure 2.7: Distribution of diseases over social drinkers
Similarly, the distribution of diseases for social smokers can be visualized as follows:

Figure 2.8: Distribution of diseases over social smokers
Next, calculate the actual count for social drinkers and smokers from the preprocessed data:
print(preprocessed_data["Social drinker"]\
.value_counts(normalize=True))
print(preprocessed_data["Social smoker"]\
.value_counts(normalize=True))
The output will be as follows:
Yes 0.567568
No 0.432432
Name: Social drinker, dtype: float64
No 0.927027
Yes 0.072973
Name: Social smoker, dtype: float64
As we can see from the resulting plots, a significant difference between drinkers and non-drinkers can be observed in absences related to Dental consultations (28). Furthermore, as the number of social smokers is quite small (only 7% of the entries), it is very hard to say whether there is actually a relationship between the absence reasons and smoking. A more rigorous approach in this direction would be to analyze the conditional probabilities of the different absence reasons, which are based on being a social drinker or smoker.
Conditional probability is a measure that tells us the probability of an event's occurrence, assuming that another event has occurred. From a mathematical perspective, given a set of events Ω and a probability measure P on Ω and given two events A and B in Ω with the unconditional probability of B being greater than zero (that is, P(B) > 0), we can define the conditional probability of A given B as follows:

Figure 2.9: Formula for conditional probability
In other words, the probability of A given B is equal to the probability of A and B both happening, pided by the probability of B happening. Let's consider a simple example that will help us understand the usage of conditional probability. This is a classic probability problem. Suppose that your friend has two children, and you know that one of them is male. We want to know what the probability is that your friend has two sons. First, we have to identify all the possible events in our event space Ω. If we denote with B the event of having a boy, and with G the event of having a girl, then the event space contains four possible events:

Figure 2.10: Event space Ω
They each have a probability of 0.25. Following the notations from the definition, we can define the first event like so:

Figure 2.11: Event A
We can define the latter event like so:

Figure 2.12: Event B
Now, our initial problem translates into computing P(A|B). With this, we get the following equation:

Figure 2.13: Probability of event A conditioned to B
We can also perform this example computationally:
# computation of conditional probability
sample_space = set(["BB", "BG", "GB", "GG"])
event_a = set(["BB"])
event_b = set(["BB", "BG", "GB"])
cond_prob = (0.25*len(event_a.intersection(event_b))) \
/ (0.25*len(event_b))
print(round(cond_prob, 4))
The output will be as follows:
0.3333
Note that by using the definition of conditional probability, we could address questions such as, "What is the probability of a reason for absence being related to laboratory examinations, assuming that an employee is a social drinker?" In other words, if we denote the "employee is absent for laboratory examinations" event with A, and the "employee is a social drinker" event with B, the probability of the "employee is absent due to laboratory examination reasons, given that employee is a social drinker" event can be computed by the previous formula.
The following exercise illustrates how the conditional probability formula can identify reasons for absence with higher probability among smokers and drinkers.
Exercise 2.02: Identifying Reasons of Absence with Higher Probability Among Drinkers and Smokers
In this exercise, you will compute the conditional probabilities of the different reasons for absence, assuming that the employee is a social drinker or smoker. Please execute the code mentioned in the previous section and Exercise 2.01, Identifying Disease Reasons for Absence before attempting this exercise. Now, follow these steps:
- To identify the conditional probabilities, first compute the unconditional probabilities of being a social drinker or smoker. Verify that both the probabilities are greater than zero, as they appear in the denominator of the conditional probabilities. Do this by counting the number of social drinkers and smokers and piding these values by the total number of entries, like so:
Figure 2.14: Probability of being a social drinker
Figure 2.15: Probability of being a social smoker
The following code snippet does this for you:
# compute probabilities of being a drinker and smoker
drinker_prob = preprocessed_data["Social drinker"]\
.value_counts(normalize=True)["Yes"]
smoker_prob = preprocessed_data["Social smoker"]\
.value_counts(normalize=True)["Yes"]
print(f"P(social drinker) = {drinker_prob:.3f} \
| P(social smoker) = {smoker_prob:.3f}")
The output will be as follows:
P(social drinker) = 0.568 | P(social smoker) = 0.073
As you can see, the probability of being a drinker is almost 57%, while the probability of being a smoker is quite low (only 7.3%).
- Next, compute the probabilities of being a social drinker/smoker and being absent for each reason of absence. For a specific reason of absence (say Ri), these probabilities are defined as follows:
Figure 2.16: Probability of being a drinker and absent
Figure 2.17: Probability of being a smoker and absent
- In order to carry the required computations, define masks in the data, which only account for entries where employees are drinkers or smokers:
#create mask for social drinkers/smokers drinker_mask = preprocessed_data["Social drinker"] == "Yes"
smoker_mask = preprocessed_data["Social smoker"] == "Yes"
- Compute the total number of entries and the number of absence reasons, masked by drinkers/smokers:
total_entries = preprocessed_data.shape[0]
absence_drinker_prob = preprocessed_data["Reason for absence"]\
[drinker_mask].value_counts()/total_entries
absence_smoker_prob = preprocessed_data["Reason for absence"]\
[smoker_mask].value_counts()/total_entries
- Compute the conditional probabilities by piding the computed probabilities for each reason of absence in Step 2 by the unconditional probabilities obtained in Step 1:
# compute conditional probabilities
cond_prob = pd.DataFrame(index=range(0,29))
cond_prob["P(Absence | social drinker)"] = absence_drinker_prob\
/drinker_prob
cond_prob["P(Absence | social smoker)"] = absence_smoker_prob\
/smoker_prob
- Create bar plots for the conditional probabilities:
# plot probabilities
plt.figure()
ax = cond_prob.plot.bar(figsize=(10,6))
ax.set_ylabel("Conditional probability")
plt.savefig('figs/conditional_probabilities.png', \
format='png', dpi=300)
The output will be as follows:
Figure 2.18: Bar plots for conditional probabilities
As we can observe from the previous plot, the highest reason for absence for drinkers is dental consultations (28), followed by medical consultations (23). Smokers' absences, however, are mostly due to unknown reasons (0) and laboratory examinations (25).
Note
To access the source code for this specific section, please refer to https://packt.live/2Y7KQhv.
You can also run this example online at https://packt.live/3d7pFk3. You must execute the entire Notebook in order to get the desired result.
In the previous exercise, we saw how to compute the conditional probabilities of the reason for absence, conditioned on the employee being a social smoker or drinker. Furthermore, we saw that in order to perform the computation, we had to compute the probability of being absent and being a social smoker/drinker. Due to the nature of the problem, computing this value might be difficult, or we may only have one conditional probability (say, P(A|B)) where we actually need P(B|A). In these cases, the Bayesian theorem can be used:
Let Ω denote a set of events with probability measure P on Ω. Given two events A and B in Ω , with (P(B) > 0) the Bayesian theorem states the following:

Figure 2.19: Bayesian theorem
Before proceeding further, we will provide a practical example of applying the Bayesian theorem in practice. Suppose that we have two bags. The first one contains four blue and three red balls, while the second one contains two blue and five red balls. Let's assume that a ball is drawn at random from one of the two bags, and its color is blue. We want to know what the probability is that the ball has been drawn from the first bag. Let's use B1 to denote the "ball is drawn from the first bag" event and B2 to denote the "ball is drawn from the second bag" event. Since the number of balls is equal in both bags, the probability of the two events is equal to 0.5, as follows:

Figure 2.20: Probability of both events
If we use A to denote the "a blue ball has been drawn" event, then we have the following:

Figure 2.21: Probability of event A, where a blue ball is drawn
This is because we have four balls in the first bag and only two in the second one. Furthermore, based on the defined events, the probability we need to compute translates into P(B1 | A). By applying Bayes' theorem, we get the following:

Figure 2.22: Probability of the event that a blue ball is drawn
Now, let's apply Bayes' theorem to our dataset in the following exercise. In addition to applying Bayes' theorem, we will also be using the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test is used to determine whether two samples are statistically different from each other, i.e. whether or not they follow the same distribution. We can implement the Kolmogorov-Smirnov test directly from SciPy, as we will see in the exercise.
Exercise 2.03: Identifying the Probability of Being a Drinker/Smoker, Conditioned to Absence Reason
In this exercise, you will compute the conditional probability of being a social drinker or smoker, conditioned on the reason for absence. In other words (where Ri is the reason for which an employee is absent), we want to compute the probabilities of an employee being a social drinker P(social drinker |Ri), or smoker P(social smoker |Ri), as follows:

Figure 2.23: Conditional probability of being a drinker, conditioned to an absence reason Ri

Figure 2.24: Conditional probability of being a smoker, conditioned to an absence reason Ri
Execute the code mentioned in the previous section, as well as the previous exercises, before attempting this exercise. Now, follow these steps:
- Since you already computed P(Ri | social drinker), P(Ri | social smoker), P(social drinker), and P(social smoker), in the previous exercise, you only need to compute P(Ri) for each reason of absence R_i:
# compute reason for absence probabilities
absence_prob = preprocessed_data["Reason for absence"]\
.value_counts(normalize=True)
- Now that you have all the necessary values, compute the conditional probabilities according to the equations in Step 1:
# compute conditional probabilities for drinker/smoker
cond_prob_drinker_smoker = pd.DataFrame(index=range(0,29))
cond_prob_drinker_smoker["P(social drinker | Absence)"] = \
cond_prob["P(Absence | social drinker)"]*drinker_prob/absence_prob
cond_prob_drinker_smoker["P(social smoker | Absence)"] = \
cond_prob["P(Absence | social smoker)"]*smoker_prob/absence_prob
plt.figure()
ax = cond_prob_drinker_smoker.plot.bar(figsize=(10,6))
ax.set_ylabel("Conditional probability")
plt.savefig('figs/conditional_probabilities_drinker_smoker.png', \
format='png', dpi=300)
The following is the output of the preceding code:
Figure 2.25: Conditional probabilities of being a drinker/smoker, conditioned to being absent
As you can see from the resulting plot, the conditional probabilities of being a social drinker/smoker are quite high, once an absence with a certain reason occurs. This is due to the fact that the number of entries is very small; as such, if all the registered employees who were absent for a certain reason are smokers, the probability of being a smoker, once that reason has been registered, will be equal to one (based on the available data).
- To complete your analysis on the social drinkers and smokers, analyze the distribution of the hours of absenteeism based on the two classes (being a social drinker/smoker versus not being). A useful type of plot for this type of comparison is the violin plot, which can be produced using the seaborn violinplot() function:
# create violin plots of the absenteeism time in hours
plt.figure(figsize=(8,6))
sns.violinplot(x="Social drinker", y="Absenteeism time in hours", \
data=preprocessed_data, order=["No", "Yes"])
plt.savefig('figs/drinkers_hour_distribution.png', \
format='png', dpi=300)
plt.figure(figsize=(8,6))
sns.violinplot(x="Social smoker", y="Absenteeism time in hours", \
data=preprocessed_data, order=["No", "Yes"])
plt.savefig('figs/smokers_hour_distribution.png', \
format='png', dpi=300)
The following is the output of the preceding code:
Figure 2.26: Violin plots of the absenteeism time in hours for social drinkers
Figure 2.27: Violin plots of the absenteeism time in hours for social smokers
As you can observe from Figure 2.27, despite some differences in the outliers between smokers and non-smokers, there is no substantial difference in the distribution of absenteeism hours in drinkers and smokers.
- To assess this statement in a rigorous statistical way, perform hypothesis testing on the absenteeism hours (with a null hypothesis stating that the average absenteeism time in hours is the same for drinkers and non-drinkers):
from scipy.stats import ttest_ind
hours_col = "Absenteeism time in hours"
# test mean absenteeism time for drinkers
drinkers_mask = preprocessed_data["Social drinker"] == "Yes"
hours_drinkers = preprocessed_data.loc[drinker_mask, hours_col]
hours_non_drinkers = preprocessed_data\
.loc[~drinker_mask, hours_col]
drinkers_test = ttest_ind(hours_drinkers, hours_non_drinkers)
print(f"Statistic value: {drinkers_test[0]}, \
p-value: {drinkers_test[1]}")
The output will be as follows:
Statistic value: 1.7713833295243993, p-value: 0.07690961828294651
- Perform the same test on the social smokers:
# test mean absenteeism time for smokers
smokers_mask = preprocessed_data["Social smoker"] == "Yes"
hours_smokers = preprocessed_data.loc[smokers_mask, hours_col]
hours_non_smokers = preprocessed_data\
.loc[~smokers_mask, hours_col]
smokers_test = ttest_ind(hours_smokers, hours_non_smokers)
print(f"Statistic value: {smokers_test[0]}, \
p-value: {smokers_test[1]}")
The output will be as follows:
Statistic value: -0.24277795417700243, p-value: 0.8082448720154971
As you can see, the p-value of both tests is above the critical value of 0.05, which means that you cannot reject the null hypothesis. In other words, you cannot say that there is a statistically significant difference in the absenteeism hours between drinkers (and smokers) and non-drinkers (and non-smokers).
Note that in the previous paragraph, you performed hypothesis tests, with a null hypothesis for the average absenteeism hours being equal for drinkers (and smokers) and non-drinkers (and non-smokers). Nevertheless, the average hours may still be equal, but their distributions may be different.
- Perform a Kolmogorov-Smirnov test to assess the difference in the distributions of two samples:
# perform Kolmogorov-Smirnov test for comparing the distributions
from scipy.stats import ks_2samp
ks_drinkers = ks_2samp(hours_drinkers, hours_non_drinkers)
ks_smokers = ks_2samp(hours_smokers, hours_non_smokers)
print(f"Drinkers comparison: statistics={ks_drinkers[0]:.3f}, \
pvalue={ks_drinkers[1]:.3f}")
print(f"Smokers comparison: statistics={ks_smokers[0]:.3f}, \
pvalue={ks_smokers[1]:.3f}")
The output will be as follows:
Drinkers comparison: statistics=0.135, pvalue=0.002
Smokers comparison: statistics=0.104, pvalue=0.607
The p-value for the drinkers dataset is lower than the critical 0.05, which is strong evidence against the null hypothesis of the two distributions being equal. On the other hand, as the p-value for the smokers dataset is higher than 0.05, you cannot reject the null hypothesis.
Note
To access the source code for this specific section, please refer to https://packt.live/3hxt3I6.
You can also run this example online at https://packt.live/2BeAweq. You must execute the entire Notebook in order to get the desired result.
In this section, we investigated the relationship between the different reasons for absence, as well as social information about the employees (such as being smokers or drinkers). In the next section, we will analyze the impact of the employees' body mass index on their absenteeism.
- C++ Primer習(xí)題集(第5版)
- ExtGWT Rich Internet Application Cookbook
- Raspberry Pi for Secret Agents(Third Edition)
- Magento 2 Development Cookbook
- Scratch 3.0少兒編程與邏輯思維訓(xùn)練
- Spring Cloud、Nginx高并發(fā)核心編程
- 從0到1:Python數(shù)據(jù)分析
- Extending Puppet(Second Edition)
- Ext JS 4 Plugin and Extension Development
- Python 3快速入門與實(shí)戰(zhàn)
- iOS Development with Xamarin Cookbook
- Java面向?qū)ο蟪绦蛟O(shè)計(jì)教程
- JSP應(yīng)用與開(kāi)發(fā)技術(shù)(第3版)
- Raspberry Pi Robotic Projects
- The Applied Data Science Workshop