官术网_书友最值得收藏!

Type I versus type II error

Binary classifiers have intuitive interpretation since they are trying to separate data points into two groups. This sounds simple, but we need to have some notion of measuring the quality of this separation. Furthermore, one important characteristic of a binary classification problem is that, often, the proportion of one group of labels versus the other can be disproportionate. That means the dataset may be imbalanced with respect to one label which necessitates careful interpretation by the data scientist.

Suppose, for example, we are trying to detect the presence of a particular rare disease in a population of 15 million people and we discover that - using a large subset of the population - only 10,000 or 10 million individuals actually carry the disease. Without taking this huge disproportion into consideration, the most naive algorithm would guess "no presence of disease" on the remaining five million people simply because 0.1% of the subset carried the disease. Suppose that of the remaining five million people, the same proportion, 0.1%, carried the disease, then these 5,000 people would not be correctly diagnosed because the naive algorithm would simply guess no one carries the disease. Is this acceptable? In this situation, the cost of the errors posed by binary classification is an important factor to consider, which is relative to the question being asked.

Given that we are only dealing with two outcomes for this type of problem, we can create a 2-D representation of the different types of errors that are possible. Keeping our preceding example of the people carrying / not carrying the disease, we can think about the outcome of our classification rule as follows:

Figure 1 - Relation between predicted and actual values

From the preceding table, the green area represents where we are correctly predicting the presence / absence of disease in the individual whereas the white areas represent where our prediction was incorrect. These false predictions fall into two categories known as Type I and Type II errors:

  • Type I error: When we reject the null hypothesis (that is, a person not carrying the disease) when in fact, it is true in actuality
  • Type II error: Where we predict the presence of the disease when the individual does not carry the disease

Clearly, both errors are not good but often, in practice, some errors are more acceptable than others.

Consider the situation where our model makes significantly more Type II errors than Type I errors; in this case, our model would be predicting more people are carrying the disease than actually are - a conservative approach may be more acceptable than a Type II error where we are failing to identify the presence of the disease. Determining the cost of each type of error is a function of the question being asked and is something the data scientist must consider. We will revisit this topic of errors and some other metrics of model quality after we build our first binary classification model which tries to predict the presence / non-presence of the Higgs-Boson particle.

主站蜘蛛池模板: 金湖县| 古蔺县| 乃东县| 林芝县| 永德县| 涟水县| 通辽市| 贺州市| 福建省| 托克逊县| 江津市| 兴和县| 卫辉市| 泸水县| 和平区| 抚州市| 开江县| 文山县| 黔西县| 阿拉尔市| 德昌县| 靖西县| 瑞金市| 塘沽区| 江门市| 巴楚县| 德江县| 水城县| 冕宁县| 宁海县| 玉田县| 德保县| 盘山县| 东乡| 石阡县| 高清| 黄平县| 闽侯县| 永胜县| 鄂尔多斯市| 额敏县|