- Mastering Python for Data Science
- Samir Madhavan
- 1158字
- 2021-07-16 20:14:18
Various forms of distribution
There are various kinds of probability distributions, and each distribution shows the probability of different outcomes for a random experiment. In this section, we'll explore the various kinds of probability distributions.
A normal distribution
A normal distribution is the most common and widely used distribution in statistics. It is also called a "bell curve" and "Gaussian curve" after the mathematician Karl Friedrich Gauss. A normal distribution occurs commonly in nature. Let's take the height example we saw previously. If you have data for the height of all the people of a particular gender in Hong Kong city, and you plot a bar chart where each bar represents the number of people at this particular height, then the curve that is obtained will look very similar to the following graph. The numbers in the plot are the standard deviation numbers from the mean, which is zero. The concept will become clearer as we proceed through the chapter.

Also, if you take an hourglass and observe the way sand stacks up when the hour glass is inverted, it forms a normal distribution. This is a good example that shows how normal distribution exists in nature.

Take the following figure: it shows three curves with normal distribution. The curve A has a standard deviation of 1, curve C has a standard deviation of 2, and curve B has a standard deviation of 3, which means that the curve B has the maximum spread of values, whereas curve A has the least spread of values. One more way of looking at it is if curve B represented the height of people of a country, then this country has a lot of people with diverse heights, whereas the country with the curve A distribution will have people whose heights are similar to each other.

Let's take a coin and flip it. The probability of getting a head or a tail is 50%. If you take the same coin and flip it six times, the probability of getting a head three times can be computed using the following formula:

In the preceding formula, n is the number of times the coin is flipped, p is the probability of success, and q is (1– p), which is the probability of failure.
The SciPy package of Python provides useful functions to perform statistical computations. You can install it from http://www.scipy.org/. The following commands helps in plotting the binomial distribution:
>>> from scipy.stats import binom >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(1, 1) >>> x = [0, 1, 2, 3, 4, 5, 6] >>> n, p = 6, 0.5 >>> rv = binom(n, p) >>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='Probablity') >>> ax.legend(loc='best', frameon=False) >>> plt.show()
The binom
function in the SciPy package helps generate binomial distributions and the necessary statistics related to it. If you observe the preceding commands, there are parts of it that are from the matplotlib, which we'll use right now to plot the binomial distribution. The matplotlib library will be covered in detail in later chapters. The plt.subplots
function helps in generating multiple plots on a screen. The binom
function takes in the number of attempts and the probability of success. The ax.vlines
function is used to plot vertical lines and rv.pmf
within it helps in calculating the probability at various values of x
. The ax.legend
function adds a legend to the graph, and finally, plt.show
displays the graph. The result is as follows:

As you can see in the graph, if the coin is flipped six times, then getting three heads has the maximum probability, whereas getting a single head or five heads has the least probability.
Now, let's increase the number of attempts and see the distribution:
>>> fig, ax = plt.subplots(1, 1) >>> x = range(101) >>> n, p = 100, 0.5 >>> rv = binom(n, p) >>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='Probablity') >>> ax.legend(loc='best', frameon=False) >>> plt.show()
Here, we try to flip the coin 100 times and see the distribution:

When the probability of success is changed to 0.4
, this is what you see:

When the probability is 0.6
, this is what you see:

When you flip the coin 1000 times at 0.5
probability:

As you can see, the binomial distribution has started to resemble a normal distribution.
A Poisson distribution
A Poisson distribution is the probability distribution of independent interval occurrences in an interval. A binomial distribution is used to determine the probability of binary occurrences, whereas, a Poisson distribution is used for count-based distributions. If lambda is the mean occurrence of the events per interval, then the probability of having a k occurrence within a given interval is given by the following formula:

Here, e is the Euler's number, k is the number of occurrences for which the probability is going to be determined, and lambda is the mean number of occurrences.
Let's understand this with an example. The number of cars that pass through a bridge in an hour is 20. What would be the probability of 23 cars passing through the bridge in an hour?
For this, we'll use the poisson function from SciPy:
>>> from scipy.stats import poisson >>> rv = poisson(20) >>> rv.pmf(23) 0.066881473662401172
With the Poisson function, we define the mean value, which is 20 cars. The rv.pmf
function gives the probability, which is around 6%, that 23 cars will pass the bridge.
A Bernoulli distribution
You can perform an experiment with two possible outcomes: success or failure. Success has a probability of p, and failure has a probability of 1 - p. A random variable that takes a 1 value in case of a success and 0 in case of failure is called a Bernoulli distribution. The probability distribution function can be written as:

It can also be written like this:

The distribution function can be written like this:

Following plot shows a Bernoulli distribution:

Voting in an election is a good example of the Bernoulli distribution.
A Bernoulli distribution can be generated using the bernoulli.rvs()
function of the SciPy package. The following function generates a Bernoulli distribution with a probability of 0.7:
>>> from scipy import stats >>> stats.bernoulli.rvs(0.7, size=100) array([1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1])])
If the preceding output is the number of votes for a candidate by people, then the candidate has 70% of the votes.
- DevOps with Kubernetes
- Hands-On Machine Learning with scikit:learn and Scientific Python Toolkits
- 軟件架構設計:大型網站技術架構與業務架構融合之道
- 編程珠璣(續)
- 零基礎學Java程序設計
- FLL+WRO樂高機器人競賽教程:機械、巡線與PID
- Building Android UIs with Custom Views
- JavaScript程序設計(第2版)
- Learning Ionic
- 大學計算機基礎實驗指導
- Oracle Data Guard 11gR2 Administration Beginner's Guide
- R語言數據挖掘:實用項目解析
- WebStorm Essentials
- Getting Started with Electronic Projects
- 算法精解:C語言描述