- Mastering Python for Data Science
- Samir Madhavan
- 1158字
- 2021-07-16 20:14:18
Various forms of distribution
There are various kinds of probability distributions, and each distribution shows the probability of different outcomes for a random experiment. In this section, we'll explore the various kinds of probability distributions.
A normal distribution
A normal distribution is the most common and widely used distribution in statistics. It is also called a "bell curve" and "Gaussian curve" after the mathematician Karl Friedrich Gauss. A normal distribution occurs commonly in nature. Let's take the height example we saw previously. If you have data for the height of all the people of a particular gender in Hong Kong city, and you plot a bar chart where each bar represents the number of people at this particular height, then the curve that is obtained will look very similar to the following graph. The numbers in the plot are the standard deviation numbers from the mean, which is zero. The concept will become clearer as we proceed through the chapter.

Also, if you take an hourglass and observe the way sand stacks up when the hour glass is inverted, it forms a normal distribution. This is a good example that shows how normal distribution exists in nature.

Take the following figure: it shows three curves with normal distribution. The curve A has a standard deviation of 1, curve C has a standard deviation of 2, and curve B has a standard deviation of 3, which means that the curve B has the maximum spread of values, whereas curve A has the least spread of values. One more way of looking at it is if curve B represented the height of people of a country, then this country has a lot of people with diverse heights, whereas the country with the curve A distribution will have people whose heights are similar to each other.

Let's take a coin and flip it. The probability of getting a head or a tail is 50%. If you take the same coin and flip it six times, the probability of getting a head three times can be computed using the following formula:

In the preceding formula, n is the number of times the coin is flipped, p is the probability of success, and q is (1– p), which is the probability of failure.
The SciPy package of Python provides useful functions to perform statistical computations. You can install it from http://www.scipy.org/. The following commands helps in plotting the binomial distribution:
>>> from scipy.stats import binom >>> import matplotlib.pyplot as plt >>> fig, ax = plt.subplots(1, 1) >>> x = [0, 1, 2, 3, 4, 5, 6] >>> n, p = 6, 0.5 >>> rv = binom(n, p) >>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='Probablity') >>> ax.legend(loc='best', frameon=False) >>> plt.show()
The binom
function in the SciPy package helps generate binomial distributions and the necessary statistics related to it. If you observe the preceding commands, there are parts of it that are from the matplotlib, which we'll use right now to plot the binomial distribution. The matplotlib library will be covered in detail in later chapters. The plt.subplots
function helps in generating multiple plots on a screen. The binom
function takes in the number of attempts and the probability of success. The ax.vlines
function is used to plot vertical lines and rv.pmf
within it helps in calculating the probability at various values of x
. The ax.legend
function adds a legend to the graph, and finally, plt.show
displays the graph. The result is as follows:

As you can see in the graph, if the coin is flipped six times, then getting three heads has the maximum probability, whereas getting a single head or five heads has the least probability.
Now, let's increase the number of attempts and see the distribution:
>>> fig, ax = plt.subplots(1, 1) >>> x = range(101) >>> n, p = 100, 0.5 >>> rv = binom(n, p) >>> ax.vlines(x, 0, rv.pmf(x), colors='k', linestyles='-', lw=1, label='Probablity') >>> ax.legend(loc='best', frameon=False) >>> plt.show()
Here, we try to flip the coin 100 times and see the distribution:

When the probability of success is changed to 0.4
, this is what you see:

When the probability is 0.6
, this is what you see:

When you flip the coin 1000 times at 0.5
probability:

As you can see, the binomial distribution has started to resemble a normal distribution.
A Poisson distribution
A Poisson distribution is the probability distribution of independent interval occurrences in an interval. A binomial distribution is used to determine the probability of binary occurrences, whereas, a Poisson distribution is used for count-based distributions. If lambda is the mean occurrence of the events per interval, then the probability of having a k occurrence within a given interval is given by the following formula:

Here, e is the Euler's number, k is the number of occurrences for which the probability is going to be determined, and lambda is the mean number of occurrences.
Let's understand this with an example. The number of cars that pass through a bridge in an hour is 20. What would be the probability of 23 cars passing through the bridge in an hour?
For this, we'll use the poisson function from SciPy:
>>> from scipy.stats import poisson >>> rv = poisson(20) >>> rv.pmf(23) 0.066881473662401172
With the Poisson function, we define the mean value, which is 20 cars. The rv.pmf
function gives the probability, which is around 6%, that 23 cars will pass the bridge.
A Bernoulli distribution
You can perform an experiment with two possible outcomes: success or failure. Success has a probability of p, and failure has a probability of 1 - p. A random variable that takes a 1 value in case of a success and 0 in case of failure is called a Bernoulli distribution. The probability distribution function can be written as:

It can also be written like this:

The distribution function can be written like this:

Following plot shows a Bernoulli distribution:

Voting in an election is a good example of the Bernoulli distribution.
A Bernoulli distribution can be generated using the bernoulli.rvs()
function of the SciPy package. The following function generates a Bernoulli distribution with a probability of 0.7:
>>> from scipy import stats >>> stats.bernoulli.rvs(0.7, size=100) array([1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1])])
If the preceding output is the number of votes for a candidate by people, then the candidate has 70% of the votes.
- Vue.js設計與實現
- OpenStack Cloud Computing Cookbook(Fourth Edition)
- 機械工程師Python編程:入門、實戰與進階
- Microsoft Dynamics GP 2013 Reporting, Second Edition
- Eclipse Plug-in Development:Beginner's Guide(Second Edition)
- SharePoint Development with the SharePoint Framework
- 智能手機APP UI設計與應用任務教程
- Practical Microservices
- Emotional Intelligence for IT Professionals
- Vue.js光速入門及企業項目開發實戰
- Mastering jQuery Mobile
- After Effects CC技術大全
- Joomla!Search Engine Optimization
- 零基礎PHP從入門到精通
- 精益軟件開發管理之道