- Machine Learning for OpenCV
- Michael Beyeler
- 1147字
- 2021-07-02 19:47:21
Scoring classifiers using accuracy, precision, and recall
In a binary classification task, where there are only two different class labels, there are a number of different ways to measure classification performance. Some common metrics are as follows:
- accuracy_score: Accuracy counts the number of data points in the test set that have been predicted correctly, and returns that number as a fraction of the test set size. Sticking to the example of classifying pictures as cats or dogs, accuracy indicates the fraction of pictures that have been correctly classified as containing either a cat or a dog. This is the most basic scoring function for classifiers.
- precision_score: Precision describes the ability of a classifier not to label as cat a picture that contains a dog. In other words, out of all the pictures in the test set that the classifier thinks contain a cat, precision is the fraction of pictures that actually contain a cat.
- recall_score: Recall (or sensitivity) describes the ability of a classifier to retrieve all the pictures that contain a cat. In other words, out of all the pictures of cats in the test set, recall is the fraction of pictures that have been correctly identified as pictures of cats.
Let's say, we have some ground truth class labels that are either zeros or ones. We can generate them at random using NumPy's random number generator. Obviously, this means that whenever we rerun our code, new data points will be generated at random. However, for the purpose of this book, this is not very helpful, as I want you to be able to run the code and always get the same result as me. A nice trick to get that is to fix the seed of the random number generator. This will make sure the generator is initialized the same way every time you run the script.
We can fix the seed of the random number generator using the following code:
In [1]: import numpy as np
In [2]: np.random.seed(42)
Then we can generate five random labels that are either zeros or ones by picking random integers in the range (0, 2):
In [3]: y_true = np.random.randint(0, 2, size=5)
... y_true
Out[3]: array([0, 1, 0, 0, 0])
Let's assume we have a classifier that tries to predict the class labels mentioned earlier. For the sake of argument, let's say the classifier is not very smart, and always predicts label 1. We can mock this behavior by hard-coding the prediction labels:
In [4]: y_pred = np.ones(5, dtype=np.int32)
... y_pred
Out[4]: array([1, 1, 1, 1, 1])
What is the accuracy of our prediction?
As mentioned earlier, accuracy counts the number of data points in the test set that have been predicted correctly, and returns that number as a fraction of the test set size. We correctly predicted only the second data point (where the true label is 1). In all other cases, the true label was a 0, yet we predicted 1. Hence, our accuracy should be 1/5 or 0.2.
A naive implementation of an accuracy metric might sum up all occurrences where the predicted class label matched the true class label:
In [5]: np.sum(y_true == y_pred) / len(y_true)
Out[5]: 0.20000000000000001
Close enough, Python.
A smarter, and more convenient, implementation is provided by scikit-learn's metrics module:
In [6]: from sklearn import metrics
In [7]: metrics.accuracy_score(y_true, y_pred)
Out[7]: 0.20000000000000001
That wasn't too hard, was it? However, in order to understand precision and recall, we need a general understanding of type I and type II errors. Let's recall that data points with class label 1 are often called positives, and data points with class label 0 (or -1) are often called negatives. Then classifying a specific data point can have one of four possible outcomes, as illustrated with the following confusion matrix:
Let's break this down. If a data point was truly a positive, and we predicted a positive, we got it all right! In this case, the outcome is called a true positive. If we thought the data point was a positive, but it was really a negative, we falsely predicted a positive (hence the term, false positive). Analogously, if we thought the data point was a negative, but it was really a positive, we falsely predicted a negative (false negative). Finally, if we predicted a negative and the data point was truly a negative, we found a true negative.
Let's quickly calculate these four metrics on our mock-up data. We have a true positive, where the true label is a 1 and we also predicted a 1:
In [8]: truly_a_positive = (y_true == 1)
In [9]: predicted_a_positive = (y_pred == 1)
In [10]: true_positive = np.sum(predicted_a_positive * truly_a_positive )
... true_positive
Out[10]: 1
Similarly, a false positive is where we predicted a 1 but the ground truth was really a 0:
In [11]: false_positive = np.sum((y_pred == 1) * (y_true == 0))
... false_positive
Out[11]: 4
I'm sure by now you've got the hang of it. But do we even have to do math in order to know about predicted negatives? Our not-so-smart classifier never predicted 0, so (y_pred == 0) should never be true:
In [12]: false_negative = np.sum((y_pred == 0) * (y_true == 1))
... false_negative
Out[12]: 0
In [13]: true_negative = np.sum((y_pred == 0) * (y_true == 0))
... true_negative
Out[13]: 0
To make sure we did everything right, let's calculate accuracy one more time. Accuracy should be the number of true positives plus the number of true negatives (that is, everything we got right) divided by the total number of data points:
In [14]: accuracy = (true_positive + true_negative) / len(y_true)
... accuracy
Out[14]: 0.20000000000000001
Success! Precision is then given as the number of true positives divided by the number of all true predictions:
In [15]: precision = true_positive / (true_positive + true_negative)
... precision
Out[15]: 1.0
Turns out that precision isn't better than accuracy in our case. Let's check our math with scikit-learn:
In [16]: metrics.precision_score(y_true, y_pred)
Out[16]: 0.20000000000000001
Finally, recall is given as the fraction of all positives that we correctly classified as positives:
In [17]: recall = true_positive / (true_positive + false_negative)
... recall
Out[17]: 1.0
In [18]: metrics.recall_score(y_true, y_pred)
Out[18]: 1.0
Perfect recall! But, going back to our mock-up data, it should be clear that this excellent recall score was mere luck. Since there was only a single 1 in our mock-up dataset, and we happened to correctly classify it, we got a perfect recall score. Does that mean our classifier is perfect? Not really! But we have found three useful metrics that seem to measure complementary aspects of our classification performance.