- Mastering Machine Learning with Spark 2.x
- Alex Tellez Max Pumperla Michal Malohlava
- 512字
- 2021-07-02 18:46:06
Type I versus type II error
Binary classifiers have intuitive interpretation since they are trying to separate data points into two groups. This sounds simple, but we need to have some notion of measuring the quality of this separation. Furthermore, one important characteristic of a binary classification problem is that, often, the proportion of one group of labels versus the other can be disproportionate. That means the dataset may be imbalanced with respect to one label which necessitates careful interpretation by the data scientist.
Suppose, for example, we are trying to detect the presence of a particular rare disease in a population of 15 million people and we discover that - using a large subset of the population - only 10,000 or 10 million individuals actually carry the disease. Without taking this huge disproportion into consideration, the most naive algorithm would guess "no presence of disease" on the remaining five million people simply because 0.1% of the subset carried the disease. Suppose that of the remaining five million people, the same proportion, 0.1%, carried the disease, then these 5,000 people would not be correctly diagnosed because the naive algorithm would simply guess no one carries the disease. Is this acceptable? In this situation, the cost of the errors posed by binary classification is an important factor to consider, which is relative to the question being asked.
Given that we are only dealing with two outcomes for this type of problem, we can create a 2-D representation of the different types of errors that are possible. Keeping our preceding example of the people carrying / not carrying the disease, we can think about the outcome of our classification rule as follows:

From the preceding table, the green area represents where we are correctly predicting the presence / absence of disease in the individual whereas the white areas represent where our prediction was incorrect. These false predictions fall into two categories known as Type I and Type II errors:
- Type I error: When we reject the null hypothesis (that is, a person not carrying the disease) when in fact, it is true in actuality
- Type II error: Where we predict the presence of the disease when the individual does not carry the disease
Clearly, both errors are not good but often, in practice, some errors are more acceptable than others.
Consider the situation where our model makes significantly more Type II errors than Type I errors; in this case, our model would be predicting more people are carrying the disease than actually are - a conservative approach may be more acceptable than a Type II error where we are failing to identify the presence of the disease. Determining the cost of each type of error is a function of the question being asked and is something the data scientist must consider. We will revisit this topic of errors and some other metrics of model quality after we build our first binary classification model which tries to predict the presence / non-presence of the Higgs-Boson particle.
- ASP.NET Core:Cloud-ready,Enterprise Web Application Development
- TensorFlow Lite移動(dòng)端深度學(xué)習(xí)
- Cross-platform Desktop Application Development:Electron,Node,NW.js,and React
- OpenCV 3和Qt5計(jì)算機(jī)視覺(jué)應(yīng)用開(kāi)發(fā)
- Python高級(jí)機(jī)器學(xué)習(xí)
- Mastering Kali Linux for Web Penetration Testing
- JS全書(shū):JavaScript Web前端開(kāi)發(fā)指南
- Drupal 8 Configuration Management
- 深入淺出PostgreSQL
- C# 8.0核心技術(shù)指南(原書(shū)第8版)
- Windows Phone 7.5:Building Location-aware Applications
- 利用Python進(jìn)行數(shù)據(jù)分析
- Serverless Web Applications with React and Firebase
- Java Web從入門(mén)到精通(第2版)
- INSTANT Apache ServiceMix How-to