書名： Hands-On Machine Learning with ML.NET
作者名： Jarred Capellman
本章字數： 667字
更新時間： 2021-06-24 16:43:29

The Trainer class

In the following, you will find the sole method in the Trainer class. The Trainer method at a high level does the following:

It loads the training data (in this case our CSV) into memory.
It builds a training set and a test set.
It creates the pipeline.
It trains and saves the model.
It performs an evaluation on the model.

This is the structure and flow we will follow throughout the rest of this book. Now, let's dive into the code behind the Train method:

First, we check to make sure that the training data filename exists:

if (!File.Exists(trainingFileName)) {
    Console.WriteLine($"Failed to find training data file ({trainingFileName}");

    return;
}

Even though this is a simple test application, it is always a good practice to treat it like a production-grade application. In addition, since this is a console application, you may incorrectly pass in a path for the training data, which then can cause exceptions further on in the method.

Use the LoadFromTextFile helper method that ML.NET provides to assist with the loading of text files into an IDataView object:

IDataView trainingDataView = MlContext.Data.LoadFromTextFile<RestaurantFeedback>(trainingFileName);

As you can see, we are passing in both the training filename and the type; in this case, it is the RestaurantFeedback class mentioned earlier. It should be noted that this method has several other parameters, including the following:

separatorChar: This is the column separator character; it defaults to \t (in other words, a tab).
hasHeader: If set to true, the dataset's first row has the header; it defaults to false.
allowQuoting: This defines whether the source file can contain columns defined by a quoted string; it defaults to false.
trimWhitespace: This removes trailing whitespace from the rows; it defaults to false.
allowSparse: This defines whether the file can contain numerical vectors in sparse format; it defaults to false. The sparse format requires a new column to have the number of features.

For most projects used throughout this book, we will use the default settings.

Given the IDataView object we created previously, use the TrainTestSplit method that ML.NET provides to create a test set from the main training data:

DataOperationsCatalog.TrainTestData dataSplit = MlContext.Data.TrainTestSplit(trainingDataView, testFraction: 0.2);

As mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, sample data is split into two sets—training and test. The parameter, testFraction, specifies the percentage of the dataset to hold back for testing, in our case, 20%. By default, this parameter is set to 0.2.

Firstly, we create the pipeline:

TextFeaturizingEstimator dataProcessPipeline = MlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features",
        inputColumnName: nameof(RestaurantFeedback.Text));

Future examples will have a much more complex pipeline. In this example, we are simply mapping the Text property discussed earlier to the Features output column.

Next, we instantiate our Trainer class:

SdcaLogisticRegressionBinaryTrainer sdcaRegressionTrainer = MlContext.BinaryClassification.Trainers.SdcaLogisticRegression(
        labelColumnName: nameof(RestaurantFeedback.Label),
        featureColumnName: "Features");

As you might remember from Chapter 1, Getting Started with Machine Learning and ML.NET, the various algorithms found in ML.NET are referred to as trainers. In this project, we are using an SCDA trainer.

Then, we complete the pipeline by appending the trainer we instantiated previously:

EstimatorChain<BinaryPredictionTransformer<CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> trainingPipeline = dataProcessPipeline.Append(sdcaRegressionTrainer);

Next, we train the model with the dataset we created earlier in the chapter:

ITransformer trainedModel = trainingPipeline.Fit(dataSplit.TrainSet);

We save our newly created model to the filename specified, matching the training set's schema:

MlContext.Model.Save(trainedModel, dataSplit.TrainSet.Schema, ModelPath);

Now, we transform our newly created model with the test set we created earlier:

IDataView testSetTransform = trainedModel.Transform(dataSplit.TestSet);

Finally, we will use the testSetTransform function created previously and pass it into the BinaryClassification class's Evaluate method:

CalibratedBinaryClassificationMetrics modelMetrics = MlContext.BinaryClassification.Evaluate(
        data: testSetTransform,
        labelColumnName: nameof(RestaurantFeedback.Label),
        scoreColumnName: nameof(RestaurantPrediction.Score));

Console.WriteLine($"Area Under Curve: {modelMetrics.AreaUnderRocCurve:P2}{Environment.NewLine}" +
        $"Area Under Precision Recall Curve: {modelMetrics.AreaUnderPrecisionRecallCurve:P2}" +                    $"{Environment.NewLine}" +
        $"Accuracy: {modelMetrics.Accuracy:P2}{Environment.NewLine}" +
        $"F1Score: {modelMetrics.F1Score:P2}{Environment.NewLine}" +
        $"Positive Recall: {modelMetrics.PositiveRecall:#.##}{Environment.NewLine}" +
        $"Negative Recall: {modelMetrics.NegativeRecall:#.##}{Environment.NewLine}");

This method allows us to generate model metrics. We then print the main metrics using the trained model with the test set. We will dive into these properties specifically in the Evaluating the Model section of this chapter.

官术网_书友最值得收藏!

Hands-On Machine Learning with ML.NET

The Trainer class