初五迎财神有什么讲究

書名： Scala for Machine Learning（Second Edition）
作者名： Patrick R. Nicolas
本章字數： 2915字
更新時間： 2021-07-08 10:43:06

Assessing a model

Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting a strategy such as a K-fold cross-validation scheme and selecting the appropriate labeled data.

Validation

The purpose of this section is to create a reusable Scala class to validate models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training.

Key quality metrics

Let's consider a simple classification model with two classes defined as positive (with respect to negative) represented with black (with respect to white) color in the diagram below. Data scientists use the following terminology:

True Positives (TPs): These are observations that are correctly labeled as belonging to the positive class (white dots on dark background)
True Negatives (TNs): These are observations that are correctly labeled as belonging to the negative class (black dots on light background)
False Positives (FPs): These are observations incorrectly labeled as belonging to the positive class (white dots on dark background)
False Negatives (FNs): These are observations incorrectly labeled as belonging to the negative class (dark dots on light background)

Categorization of validation results

This simplistic representation can be extended to classification problems that involve more than two classes. For instance, false positives are defined as observations incorrectly labeled that belong to any class other than the correct one. These four factors are used for evaluating accuracy, precision, recall, and F and G measures as follows:

Accuracy: Represented as ac, this is the percentage of observations correctly classified
Precision: Represented as p, this is the percentage of observations correctly classified as positive in the group that the classifier has declared positive
Recall: Represented as r, this is the percentage of observations labeled as positive that are correctly classified
F₁-measure or F₁-score: Represented as F₁, this measure strikes a balance between precision and recall. It is computed as the harmonic mean of the precision and recall with values ranging between 0 (worst score) and 1 (best score).
F_n score: Represented as F_n, this is the generic F-scoring method with an arbitrary degree n.
G measure: Represented as G, this is like the F-measure but is computed as the geometric mean of precision p and recall r.

Note

Precision, recall, and F1-score

M3: Accuracy ac, precision, p, recall r, F₁, F_n, and G scores:

The computation of the precision, recall, and F₁ score depends on the number of classes used in the classifier. We will consider the following implementations:

F-score validation for binomial (two classes) classification (that is, positive and negative outcome)
F-score validation for multinomial (more than two classes) classification

F-score for binomial classification

The binomial F validation computes the precision, recall, and F-score for the positive class.

Let's implement the F-score or F-measure as a specialized validation:

trait Validator{ def score: Double }

The class BinaryValidation encapsulates the computation of the F_n score as well as precision and recall by counting the occurrences of validation labels TP, TN, FP, and FN. It implements the M3 formula. In the tradition of Scala programming, the class is immutable; it computes the counters for TP, TN, FP, and FN when the class is instantiated. The class takes three parameters:

The expected values with value 0 for negative outcome and 1 for positive outcome
The set of observations, xt, used for validating the model

The predictive function, predict, that classifies observations (line 1):

class BinaryValidation[T: ToDouble](
 expected: Vector[Int],
 xt: Vector[Array[T]])(predict: Array[T] =>Int)(
extends AValidation[T](expected, xt)(predict) //1

  val counters = expected.zip(xt.map( predict(_)))
      .aggregate(new Counter[Label])((cnt, ap) =>
         cnt + classify(ap._1, ap._2), _ ++ _
  ) //2

  override def score: Double = f1 //3
  lazy val f1= 2.0*precision*recall/(precision + recall)
  lazy val precision = compute(FP())  //4
  lazy val recall = compute(FN()) 

  def compute(n: Label): Double = 
    1.0/(1.0 + counters(n)/counters(TP()))

  def classify(predicted: Int, expected: Int): VLabel = //5
   if(expected == predicted)     
      if(expected == POSITIVE) TP() else TN()
   else if(expected == POSITIVE) FP() else FN()
}

The constructor counts the number of occurrences for each of the four outcome labels {TP, TN, FP, and FN} (line 2). The values precision, recall, and f1 are defined as lazy values so they are computed only once, when they are accessed directly or the method score is invoked (line 4). The F₁ measure is the most commonly scoring value for validating classifiers. Therefore, it is the default score (line 3). The private method classify extracts the qualifier from the expected and predicted values (line 5).

The class BinaryValidation is independent of the type of classifier, its training, the labeling process, and the type of observations.

The validation labels of type VLabel are defined as sealed case classes (line 6). Although Scala supports enumeration in a fashion similar to Java, Scala programmers prefer case classes and pattern matching as an alternative to extending Enumeration type:

sealed trait Vlabel   //6
case class TP() extends Vlabel
case class TN() extends Vlabel
case class FP() extends Vlabel
case class FN() extends VLabel

The F-score formula with higher cardinality F_n with n > 1 favors precision over recall as illustrated in the following chart:

Comparative analysis of impact of precision on F1, F2, and F3 score for a given recall

Tip

Multiclass scoring

Our implementation of the binomial validation computes the precision, recall, and F₁ score for the positive class only. The generic multinomial validation class presented in the next section computes these quality metrics for both positive and negative classes.

F-score for multinomial classification

The validation metric is defined by the formulas M3. The idea is quite simple: the precision and recall is computed for all the classes, and then they are averaged to produce a single precision and recall value for the entire model. The precision and recall for the entire model leverage the counts of TP, FP, FN, and TN introduced in the previous section.

There are two commonly used set of formulas to compute the precision and recall for a model:

Macro: This method computes precision and recall for each class, sums and then averages them up.
Micro: This method sums the numerator and denominator of the precision and recall formulas for all the classes before computing the precision and recall.

We will use the macro formulas from now on.

Note

Macro formulas for multinomial precision and recall

M4: Macro version of the precision p and recall r for a model of c classes is computed as follows:

The computation of precision and recall factor for a classifier with more than two classes requires the extraction and manipulation of the confusion matrix. We use the following convention. Expected values are defined as columns and predicted values are defined as rows:

Confusion matrix for six-class classification

The multinomial validation class, MulticlassValidation, takes four parameters:

The expected class index with value 0 for negative outcome and 1 for positive outcome (line 6)
The set of observations, xt used for validating the model (line 7)
The number of classes in the model

The predictive function, predict classifies observations (line 7):

class MulticlassValidation[T: ToDouble](
 expected: Vector[Int],
 xt: Vector[Array[T]],
 nClasses: Int)(predict: Array[T] => Int) 
extends Validation{ //7

  val confusionMatrix: Matrix[Int] = //8
    labeled./:(Matrix[Int](nClasses)){
      case (m, (x,n)) => m + (n, predict(x), 1) //9
  }
  

  lazy val (precision, recall): DblPair = //10
    (0 until classes)./:(0.0,0.0)((s, n) => {
      val tp = confusionMatrix(n, n)   //11
      val fn = confusionMatrix.col(n).sum–tp  //12
      val fp = confusionMatrix.row(n).sum–tp  //13
      (s._1 + tp.toDouble/(tp + fp)/nClasses, 
            s._2 +tp.toDouble/(tp + fn)/nClasses)
    })

   def score: Double = 
         2.0*precision*recall/(precision+recall)
}

The core element of the multiclass validation is the confusion matrix confusionMatrix (line 8). Its elements at indices (i, j) = (index of expected class for an observation, index of the predicted class for the same observation) are computed using the expected and predictive outcome for each class (line 9).

As stated in the introduction of the section, we use the macro definition of the precision and recall (line 10). The count of true positive, tp, for each class corresponds to the diagonal element of the confusion matrix (line 11). The count of false negatives, fn, for a class is computed as the sum of the counts for all the predicted classes (column values), given an expected class except the true positive class (line 12). The count of false positives, fp, for a class is computed as the sum of the counts for all the expected classes (row values), given a predicted class except the true positive class (line 13).

The formula for the computation of the F₁ score is the same as the formula used in the binomial validation.

Area under the curves

A more sophisticated measurement of the quality of a model is known as the area under the curves. There are two commonly used measures:

Area under the precision-recall curve (AuPRC)
Area under the receiver operating characteristics (AuROC)

Area under PRC

The previous section introduces the concepts of precision and recall and their application to the computation of the F₁ score. There is a more appealing application of precision and recall that provides scientists with a more accurate evaluation and a convenient visualization of the performance of a binary classifier.

Let's consider two classes C0 and C1. The classification of a new observation is accomplished by setting a threshold on the predicted value (or probability the observation belongs to class C1, for instance). The higher the threshold, that the more selective the algorithm is to assign a new observation to the proper class.

The precision-recall curve (PRC) is generated by plotting the pair of (precision, recall) values on an xy graph and varying the threshold between 0 and 1.0:

Visualization of the performance of a binary classifier using precision-recall curve

The performance of the classifier improves as the precision increases and the recall decreases, toward the upper-left corner of the graph. A pure random process such as flipping a coin has similar values for the precision and recall (random line). An unusual case for which the recall is very high and precision close to zero is most often associated with an error in labeling observations prior to training.

Note

M5: AuPRC

Let's look at a simple implementation of the AuPRC, after a list of validation values, binaryValidations, has been generated through classification using a variable threshold:

def auPRC[T: ToDouble](
 binValidations: List[BinaryValidation]
): Double = binFValidations./:(0.5)(
  (s, valid) => s + valid.precision- valid.recall
)/binaryValidations.size

The performance of the classifier is quantified by computing the area under the PR curve (integral value). A value of 1.0 signifies a perfect classifier and a value of 0.5 represents a pure random process.

Area under ROC

Receiver Operating Characteristics (ROC) is a very convenient tool that visualizes the performance of a binary classifier [2:7]. The curve or plot is generated by computing the true positive rate (TPR) and the false positive rate (FPR) values for different classification threshold on a validation set.

Note

M5: True positive and false positive rates

The methodology that generates the ROC is identical to the process of creating the AuPRC, by plotting the pair of TPR and FPR on an xy graph and varying the threshold between 0 and 1.0:

Visualization of the performance of a binary classifier using ROC

The performance of the binary classifier is measured by computing the area or integral under the ROC in a similar fashion as the computation of the AuPRC.

Cross-validation

It is quite common that the labeled dataset (observations + expected outcome) available to the scientists is not very large. The solution is to break the original labeled dataset into K groups of data.

One-fold cross-validation

One-fold cross-validation is the simplest scheme for extracting a training set and a validation set from a labeled dataset as described in the following diagram:

Illustration of the generation of a one-fold validation set

The one-fold cross-validation methodology consists of the following three steps:

Select the ratio of the size of the training set over the size of the validation set.
Randomly select the labeled observations for the validation phase.
Create the training set as the remaining labeled observations.

The one-fold cross-validation is implemented by the class OneFoldValidation. It takes the following three arguments: the vector of observations, xt, the vector of expected classes, expected, and the ratio of the size of the training set over the size of the validation set (line 14):

type LabeledData[T] = Vector[(Array[T], Int)]

class OneFoldValidation[T: ToDouble](
 xt: Vector[Array[T]],
 expected: Vector[Int], 
 ratio: Double) {  //14
 
     val (trainSet, validSet):(LabeledData[T], LabeledData[T]) //15
}

The constructor of the class OneFoldValidation generates the segregated training and validation set from the set of observations, xt, and expected classes (line 15):

val datasSet: (LabeledData[T], LabeledData[T]) = { 
  val labeledData = xt.zip(expected) //16
  val trainingSize = (ratio*expected.size).floor.toInt //17

  val valSz = labeledData.size - trainingSize
  val adjValSz = if(valSz< 2) 1 
  else if(valSz >= labeledData.size) labeledData.size -1 
  else valSz  //18

  val ordLabeledData = labeledData
     .map( (_, nextDouble) )  //19
     .sortWith( _._2 < _._2).unzip._1//20

  (ordlabeledData.takeRight(adjValSz), 
      ordlabeledData.dropRight(adjValSz)) //21
}

The initialization of the class OneFoldValidation creates a vector of labeled observations, labeledData, by zipping the observations and the expected outcome (line 16). The training ratio is used to compute the respective size of the training set (line 17) and validation set, adjusted for small samples (line 18).

In order to randomly create training and validation sets, we zip the labeled dataset with a random generator (line 19), then reorder the labeled dataset by sorting the random values (line 20). Finally, the method returns the pair of training set and validation set (line 21).

K-fold cross-validation

The data scientist creates K training-validation datasets by selecting one of the groups as a validation set then combining all remaining groups into a training set, as illustrated in the next diagram. The process is known as K-fold cross-validation [2:8]:

Illustration of the generation of a K-fold cross-validation set

The third segment is used as validation data and all other dataset segments except S3 are combined into a single training set. This process is applied to each segment of the original labeled dataset.

Bias-variance decomposition

The challenge is to create a model that fits both the training set and subsequent observations to be classified during the validation phase.

If the model tightly fits the observations selected for training, there is a high probability that new observations may not be correctly classified. This is usually the case when the model is complex. This model is characterized as having a low bias with a high variance. Such a scenario can be attributed to the fact that the scientist is overly confident that the observations s/he selected for training are representative of the real world.

The probability of a new observation being classified as belonging to a positive class increases as the selected model fits loosely the training set. In this case, the model is characterized as having a high bias with a low variance.

The mathematical definition for the bias, variance, and mean square error (MSE) of the distribution are defined by the following formulas:

Note

M5: Variance and bias for a true model, θ:

M6: Mean square error:

Let's illustrate the concept of bias, variance, and mean square error with an example. At this stage, you have not been introduced to most of the machines learning techniques. Therefore, we create a simulator to illustrate the relation between bias and variance of a classifier. The components of the simulation are as follows:

A training set, training
A simulated model target, of type target: Double => Double extracted from the training set
A set of possible models to evaluate

A model that matches exactly the training data overfits the target model. Models that approximate the target model will most likely underfit. The models in this example are defined as a single variable function.

Note

Empirical estimation of overfitting

Overfitting models are specific to the training set. The prediction over a validation set will have a low bias and a high variance.

These models are evaluated against a validation dataset. The class, BiasVariance, takes the target model, target, and the size of the validation test, nValues, as parameters (line 22). It merely implements the formula to compute the bias and variance for each of the models:

type DblF = Double => Double 
class BiasVariance[T: ToDouble](target: DblF, nValues: Int){//22
  def fit(models: List[DblF]): List[DblPair] ={ //23
     models.map(accumulate(_, models.size)) //24
  }
}

The fit method computes the variance and bias for each of the model models compared to the target model (line 23). It computes the mean, variance, and bias in the method accumulate (line 24):

def acumulate(f: DblF, y:Double, numModels:Int): DblPair = 
  (0 until nValues)./:(0.0, 0.0){ case((s,t) x) => { 
    val diff = (f(x) - y)/numModels
    (s + diff*diff, t + abs(f(x)-target(x))) //25
  }}

The training data is generated by the single variable function, with noise components r₁ and r₂:

The method accumulate returns a tuple (variance, bias) for each of the models f (line 25). The model candidates are defined by the following family of single variables for values n = 1, 2, 4:

The target model (line 26) and the models (line 27) belong to the same family of single variable functions:

val template = (x: Double, n : Int) =>
     0.2*x*(1.0 + sin(x*0.1)/n)
val training = (x: Double) => {
   val r1 = 0.45*(nextDouble-0.5)
   val r2 = 38.0*(nextDouble - 0.5) + sin(x*0.3)
   0.2*x*(1.0 + sin(x*0.1 + r1)) + r2
}

val target = (x: Double) => template(x, 1) //26

val models = List[(DblF, String)] (  //27
   ((x: Double) => template(x, 4), "Underfit1"),
   ((x: Double) => template(x, 2), "Underfit2"),
   ((x : Double) => training(x), "Overfit")
   (target, "target"),
)
val evaluator = new BiasVariance[Double](target, 200)
evaluator.fit(models.map( _._1)) match { /* … */ }

The JFreeChart library is used to display the training dataset and the models:

Fitting models to a dataset

The model that replicates the training data overfits. The models for which the sine component has a lower amplitude underfits.

The variance-bias trade-off for the different models and the training data is illustrated with the following scatter chart:

Scatter plot for the bias-variance trade-off for four models, one duplicating the training set.

The variance of each of the smoothing or approximating models is lower than the variance of the training set. As expected, the target model, 0.2.x.(1+sin(x/10)), has no bias and no variance. The training set has a very high variance because it overfits any target model. The last chart compares the mean square error between each of the models, the training set and the target model:

Comparative mean square error for four models

Note

Evaluating bias and variance

The section uses a fictitious target model and a training set to illustrate the concept of bias and variance of models. The bias and variance of machine learning models are estimated using validation data.

Overfitting

You can apply the methodology presented in the example to any classification and regression models. The list of models with low variance includes constant functions and models independent of the training set. High degree polynomial, complex functions, and deep neural networks have high variance. Linear regression applied to linear data has low bias, while linear regression applied to non-linear data has a higher bias [2:9].

Overfitting affects all aspects of the modeling process negatively, for example:

It renders debugging difficult
It makes the model too dependent on minor fluctuation (long tail) and noisy data
It may discover irrelevant relationships between observed and latent features
It leads to poor predictive performance

However, there are well-proven solutions to reduce overfitting [2:10]:

Increasing the size of the training set whenever possible
Reducing noise in labeled observations using smoothing and filtering techniques
Decreasing the number of features using techniques such as principal components analysis, as described in the Principal components analysis section of Chapter 5, Dimension Reduction
Modeling observable and latent noisy data using Kalman or auto-regressive models as described in Chapter 3, Data Pre-processing
Reducing inductive bias in training set by applying cross-validation
Penalizing extreme values for some of the model's features using regularization techniques, as described in the Regularizationsection of Chapter 9, Regression and Regularization

官术网_书友最值得收藏!

Scala for Machine Learning（Second Edition）

Assessing a model

Validation

Key quality metrics

Note

F-score for binomial classification

Tip

F-score for multinomial classification

Note

Area under the curves

Area under PRC

Note

Area under ROC

Note

Cross-validation

One-fold cross-validation

K-fold cross-validation

Bias-variance decomposition

Note

Note

Note

Overfitting