官术网_书友最值得收藏!

Type I versus type II error

Binary classifiers have intuitive interpretation since they are trying to separate data points into two groups. This sounds simple, but we need to have some notion of measuring the quality of this separation. Furthermore, one important characteristic of a binary classification problem is that, often, the proportion of one group of labels versus the other can be disproportionate. That means the dataset may be imbalanced with respect to one label which necessitates careful interpretation by the data scientist.

Suppose, for example, we are trying to detect the presence of a particular rare disease in a population of 15 million people and we discover that - using a large subset of the population - only 10,000 or 10 million individuals actually carry the disease. Without taking this huge disproportion into consideration, the most naive algorithm would guess "no presence of disease" on the remaining five million people simply because 0.1% of the subset carried the disease. Suppose that of the remaining five million people, the same proportion, 0.1%, carried the disease, then these 5,000 people would not be correctly diagnosed because the naive algorithm would simply guess no one carries the disease. Is this acceptable? In this situation, the cost of the errors posed by binary classification is an important factor to consider, which is relative to the question being asked.

Given that we are only dealing with two outcomes for this type of problem, we can create a 2-D representation of the different types of errors that are possible. Keeping our preceding example of the people carrying / not carrying the disease, we can think about the outcome of our classification rule as follows:

Figure 1 - Relation between predicted and actual values

From the preceding table, the green area represents where we are correctly predicting the presence / absence of disease in the individual whereas the white areas represent where our prediction was incorrect. These false predictions fall into two categories known as Type I and Type II errors:

  • Type I error: When we reject the null hypothesis (that is, a person not carrying the disease) when in fact, it is true in actuality
  • Type II error: Where we predict the presence of the disease when the individual does not carry the disease

Clearly, both errors are not good but often, in practice, some errors are more acceptable than others.

Consider the situation where our model makes significantly more Type II errors than Type I errors; in this case, our model would be predicting more people are carrying the disease than actually are - a conservative approach may be more acceptable than a Type II error where we are failing to identify the presence of the disease. Determining the cost of each type of error is a function of the question being asked and is something the data scientist must consider. We will revisit this topic of errors and some other metrics of model quality after we build our first binary classification model which tries to predict the presence / non-presence of the Higgs-Boson particle.

主站蜘蛛池模板: 文安县| 鄂托克前旗| 蓝山县| 涿鹿县| 温州市| 汝阳县| 泸定县| 新兴县| 凌云县| 清新县| 冷水江市| 九龙县| 阳新县| 东阳市| 彰化县| 渝中区| 丰台区| 阿勒泰市| 深水埗区| 贵港市| 南平市| 永宁县| 永顺县| 宁强县| 民县| 尉犁县| 宿迁市| 永新县| 黄陵县| 平陆县| 彭泽县| 蒙城县| 镇巴县| 晋城| 寿宁县| 洪江市| 筠连县| 定兴县| 浙江省| 肥西县| 濮阳市|