官术网_书友最值得收藏!

Type I versus type II error

Binary classifiers have intuitive interpretation since they are trying to separate data points into two groups. This sounds simple, but we need to have some notion of measuring the quality of this separation. Furthermore, one important characteristic of a binary classification problem is that, often, the proportion of one group of labels versus the other can be disproportionate. That means the dataset may be imbalanced with respect to one label which necessitates careful interpretation by the data scientist.

Suppose, for example, we are trying to detect the presence of a particular rare disease in a population of 15 million people and we discover that - using a large subset of the population - only 10,000 or 10 million individuals actually carry the disease. Without taking this huge disproportion into consideration, the most naive algorithm would guess "no presence of disease" on the remaining five million people simply because 0.1% of the subset carried the disease. Suppose that of the remaining five million people, the same proportion, 0.1%, carried the disease, then these 5,000 people would not be correctly diagnosed because the naive algorithm would simply guess no one carries the disease. Is this acceptable? In this situation, the cost of the errors posed by binary classification is an important factor to consider, which is relative to the question being asked.

Given that we are only dealing with two outcomes for this type of problem, we can create a 2-D representation of the different types of errors that are possible. Keeping our preceding example of the people carrying / not carrying the disease, we can think about the outcome of our classification rule as follows:

Figure 1 - Relation between predicted and actual values

From the preceding table, the green area represents where we are correctly predicting the presence / absence of disease in the individual whereas the white areas represent where our prediction was incorrect. These false predictions fall into two categories known as Type I and Type II errors:

  • Type I error: When we reject the null hypothesis (that is, a person not carrying the disease) when in fact, it is true in actuality
  • Type II error: Where we predict the presence of the disease when the individual does not carry the disease

Clearly, both errors are not good but often, in practice, some errors are more acceptable than others.

Consider the situation where our model makes significantly more Type II errors than Type I errors; in this case, our model would be predicting more people are carrying the disease than actually are - a conservative approach may be more acceptable than a Type II error where we are failing to identify the presence of the disease. Determining the cost of each type of error is a function of the question being asked and is something the data scientist must consider. We will revisit this topic of errors and some other metrics of model quality after we build our first binary classification model which tries to predict the presence / non-presence of the Higgs-Boson particle.

主站蜘蛛池模板: 拉萨市| 当阳市| 湛江市| 邢台市| 临沭县| 德昌县| 正定县| 宁武县| 靖远县| 甘肃省| 博野县| 北海市| 邻水| 湘阴县| 五华县| 禹州市| 定兴县| 米易县| 德化县| 定兴县| 安化县| 忻州市| 错那县| 邵阳市| 虹口区| 呼和浩特市| 平乡县| 周宁县| 南投市| 金湖县| 临湘市| 昌都县| 仁怀市| 阳江市| 黔南| 大渡口区| 鄄城县| 墨竹工卡县| 诸城市| 安顺市| 南充市|