官术网_书友最值得收藏!

Chi-square tests

The chi-square test is a statistical test commonly used to compare observed data with the expected data assuming that the data follows a certain hypothesis. In a sense, this is also a hypothesis test. You assume one hypothesis, which your data will follow and calculate the expected data according to that hypothesis. You already have the observed data. You calculate the deviation between the observed and expected data using the statistics defined in the following formula:

Where O is the observed value and E is the expected value while the summation is over all the data points.

The chi-square test can be used to do the following things:

  • Show a causal relationship or independence between one input and output variable. We assume that they are independent and calculate the expected values. Then we calculate the chi-square value. If the null hypothesis is rejected, it suggests a relationship between the two variables. The relationship is not just by chance but statistically proven.
  • Check whether the observed data is coming from a fair/unbiased source. If the observed data is more skewed towards one extreme, compared to the expected data, then it is not coming from a fair source. But, if it is very close to the expected value then it is.
  • Check whether a data is too good to be true. As, it is a random experiment and we don't expect the values to toe the assumed hypothesis. If they do toe the assumed hypothesis, then the data has probably been tampered to make it look good and is too good to be true.

Let us create a hypothetical experiment where a coin is tossed 10 times. How many times do you expect it to turn heads or tails? Five, right? Now, what if we do this experiment 1000 times and record the scores (number of heads and tails). Suppose we observed heads 553 times and a tails in the rest of the trials:

Let us calculate the chi-square value:

This chi-square value is compared to the value on a chi-square distribution for a given degree of freedom and a given significance level. The degrees of freedom is the number of categories -1. In this case, it is 2-1=1. Let us assume a significance level of 0.05.

The chi-square distribution looks a little different than the normal distribution. It also has a peak but has a much longer tail than the normal distribution and is only on one side. As the degree of freedom increases, they start looking similar to a normal distribution:

Fig. 4.6: Chi-square distribution with different degrees of freedom

When we look at the chi-square distribution table for a degree of freedom 1 and a significance level of 0.05, we get a value of 3.841. At a significance level of 0.01, we get 6.635. In both the cases, the chi-square statistic is greater than the value from the chi-square distribution, meaning that the chi-square statistic lies on the right of the value from the distribution table.

Hence, the null hypothesis is rejected. That means that the coin is not fair.

Fig. 4.7: Null hypothesis is rejected because the value of the chi-square statistic at the significance level is less than the value of the chi-square statistic

Let us look at another example where we want to prove that the gender of a student and the subjects they choose are independent.

Suppose, in a group of students, the following table represents the number of boys and girls who have taken Maths, Arts, and Commerce, as their main subjects.

The observed number of boys and girls in each subject is as shown in the following table:

On calculating and summing up all the values, the chi-square value comes out to be 5.05. The degree of freedom is the number of categories-1, which amounts to [(3x2)-1=5]. Let us assume a significance level of 0.05.

Looking at the chi-square distribution, one can find out that for a 5-degree freedom chi-square distribution, the value of the chi-square statistic at a significance level of 0.05 is 11.07.

The calculated chi-square statistic < chi-square statistic (at significance level=0.05).

Since, the chi-square statistic lies on the left of the value at the significance level, the null hypothesis can't be rejected. Hence, the choice of subjects is independent of the gender.

主站蜘蛛池模板: 横山县| 博乐市| 乐昌市| 浦江县| 高密市| 汝南县| 洪湖市| 仙桃市| 澄江县| 田东县| 阿勒泰市| 三门县| 太湖县| 长汀县| 泽州县| 丰顺县| 确山县| 通榆县| 怀来县| 华池县| 英德市| 寿宁县| 绵竹市| 邵武市| 上虞市| 镇赉县| 武功县| 栖霞市| 元阳县| 县级市| 苏尼特右旗| 静乐县| 平江县| 青海省| 怀宁县| 孟津县| 旌德县| 东安县| 榆中县| 贞丰县| 青河县|