書(shū)名： Mastering Machine Learning with Spark 2.x
作者名： Alex Tellez Max Pumperla Michal Malohlava
本章字?jǐn)?shù)： 512字
更新時(shí)間： 2021-07-02 18:46:06

Type I versus type II error

Binary classifiers have intuitive interpretation since they are trying to separate data points into two groups. This sounds simple, but we need to have some notion of measuring the quality of this separation. Furthermore, one important characteristic of a binary classification problem is that, often, the proportion of one group of labels versus the other can be disproportionate. That means the dataset may be imbalanced with respect to one label which necessitates careful interpretation by the data scientist.

Suppose, for example, we are trying to detect the presence of a particular rare disease in a population of 15 million people and we discover that - using a large subset of the population - only 10,000 or 10 million individuals actually carry the disease. Without taking this huge disproportion into consideration, the most naive algorithm would guess "no presence of disease" on the remaining five million people simply because 0.1% of the subset carried the disease. Suppose that of the remaining five million people, the same proportion, 0.1%, carried the disease, then these 5,000 people would not be correctly diagnosed because the naive algorithm would simply guess no one carries the disease. Is this acceptable? In this situation, the cost of the errors posed by binary classification is an important factor to consider, which is relative to the question being asked.

Given that we are only dealing with two outcomes for this type of problem, we can create a 2-D representation of the different types of errors that are possible. Keeping our preceding example of the people carrying / not carrying the disease, we can think about the outcome of our classification rule as follows:

Figure 1 - Relation between predicted and actual values

From the preceding table, the green area represents where we are correctly predicting the presence / absence of disease in the individual whereas the white areas represent where our prediction was incorrect. These false predictions fall into two categories known as Type I and Type II errors:

Type I error: When we reject the null hypothesis (that is, a person not carrying the disease) when in fact, it is true in actuality
Type II error: Where we predict the presence of the disease when the individual does not carry the disease

Clearly, both errors are not good but often, in practice, some errors are more acceptable than others.

Consider the situation where our model makes significantly more Type II errors than Type I errors; in this case, our model would be predicting more people are carrying the disease than actually are - a conservative approach may be more acceptable than a Type II error where we are failing to identify the presence of the disease. Determining the cost of each type of error is a function of the question being asked and is something the data scientist must consider. We will revisit this topic of errors and some other metrics of model quality after we build our first binary classification model which tries to predict the presence / non-presence of the Higgs-Boson particle.

官术网_书友最值得收藏!

Mastering Machine Learning with Spark 2.x

Type I versus type II error