Variability

If we would like to understand the variability of the data, there are three important measures that we can use to get a better understanding of our data set, and those include the following:

Variance
Standard deviation
Coefficient of variation

Variance measures the distance of the data points around their mean value. Variance has a central role in statistics, where some ideas that use it include descriptive statistics, statistical inference, hypothesis testing, goodness of fit, and Monte Carlo sampling. Variance is an important tool in the sciences, where statistical analysis of data is common. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself.

Standard deviation is a measure of how spread-out numbers are, and based on the number describing the standard deviation, you can see the extent to which a set of numbers lies apart. In addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. Unlike variance, the standard deviation is expressed in the same units as the data, which is very useful.

The coefficient of variation is a measure of the dispersion of probability, or a frequency distribution. It's usually a percentage value, which is defined as a ratio of the standard deviation to the mean.

SQL Server has several built-in functions that can be used to calculate variance and standard deviation. To find out the variability measures from our table, we can run the following query:

SELECT AVG(datapoint),VARP(datapoint), STDEVP(datapoint) FROM dataset

Don't get confused with those function names, as for variance there are are two functions VAR() and VARP() and the same for the standard deviation, STDEV() and STDEVP(). This is very important from a statistics perspective, depending on whether we work with the whole population or just a sample of the population.

Variance for sample and population are not the same, as if you would check the formulas behind these functions have different denominators:

The full query to get the variability measures on our table, which would include both population and sample measures, would look such as this one:

SELECT AVG(datapoint) as Mean,
VAR(datapoint) as 'Sample variation',
VARP(datapoint) as 'Population variation', 
STDEV(datapoint) as 'Sample standard deviation',
STDEVP(datapoint) as 'Population standard deviation'
FROM dataset

In such cases, the sample variability measures will be higher than the population measures, due to a lower denominator value. In the case of a sample, where you don't have the whole population data points, the concentrations of extremes and central tendencies can be different based on the whole range of values. When using such functions, you need to know whether you're working with just a sample of the data or the whole population of the data available, so the results are not skewed and you have the proper outcomes of these functions!

官术网_书友最值得收藏!

Hands-On Data Science with SQL Server 2017

Variability