官术网_书友最值得收藏!

The Trainer class

In the following, you will find the sole method in the Trainer class. The Trainer method at a high level does the following:

  • It loads the training data (in this case our CSV) into memory.
  • It builds a training set and a test set.
  • It creates the pipeline.
  • It trains and saves the model.
  • It performs an evaluation on the model.

This is the structure and flow we will follow throughout the rest of this book. Now, let's dive into the code behind the Train method:

  1. First, we check to make sure that the training data filename exists:
if (!File.Exists(trainingFileName)) {
Console.WriteLine($"Failed to find training data file ({trainingFileName}");

return;
}

Even though this is a simple test application, it is always a good practice to treat it like a production-grade application. In addition, since this is a console application, you may incorrectly pass in a path for the training data, which then can cause exceptions further on in the method.

  1. Use the LoadFromTextFile helper method that ML.NET provides to assist with the loading of text files into an IDataView object:
IDataView trainingDataView = MlContext.Data.LoadFromTextFile<RestaurantFeedback>(trainingFileName);

As you can see, we are passing in both the training filename and the type; in this case, it is the RestaurantFeedback class mentioned earlier. It should be noted that this method has several other parameters, including the following:

  • separatorChar: This is the column separator character; it defaults to \t (in other words, a tab).
  • hasHeader: If set to true, the dataset's first row has the header; it defaults to false.
  • allowQuoting: This defines whether the source file can contain columns defined by a quoted string; it defaults to false.
  • trimWhitespace: This removes trailing whitespace from the rows; it defaults to false.
  • allowSparse: This defines whether the file can contain numerical vectors in sparse format; it defaults to false. The sparse format requires a new column to have the number of features.

For most projects used throughout this book, we will use the default settings.

  1. Given the IDataView object we created previously, use the TrainTestSplit method that ML.NET provides to create a test set from the main training data:
DataOperationsCatalog.TrainTestData dataSplit = MlContext.Data.TrainTestSplit(trainingDataView, testFraction: 0.2);

As mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, sample data is split into two sets—training and test. The parameter, testFraction, specifies the percentage of the dataset to hold back for testing, in our case, 20%. By default, this parameter is set to 0.2.

  1. Firstly, we create the pipeline:
TextFeaturizingEstimator dataProcessPipeline = MlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features",
inputColumnName: nameof(RestaurantFeedback.Text));

Future examples will have a much more complex pipeline. In this example, we are simply mapping the Text property discussed earlier to the Features output column.

  1. Next, we instantiate our Trainer class:
SdcaLogisticRegressionBinaryTrainer sdcaRegressionTrainer = MlContext.BinaryClassification.Trainers.SdcaLogisticRegression(
labelColumnName: nameof(RestaurantFeedback.Label),
featureColumnName: "Features");

As you might remember from Chapter 1, Getting Started with Machine Learning and ML.NET, the various algorithms found in ML.NET are referred to as trainers. In this project, we are using an SCDA trainer.

  1. Then, we complete the pipeline by appending the trainer we instantiated previously:
EstimatorChain<BinaryPredictionTransformer<CalibratedModelParametersBase<LinearBinaryModelParameters, PlattCalibrator>>> trainingPipeline = dataProcessPipeline.Append(sdcaRegressionTrainer);
  1. Next, we train the model with the dataset we created earlier in the chapter:
ITransformer trainedModel = trainingPipeline.Fit(dataSplit.TrainSet);
  1. We save our newly created model to the filename specified, matching the training set's schema:
MlContext.Model.Save(trainedModel, dataSplit.TrainSet.Schema, ModelPath);
  1. Now, we transform our newly created model with the test set we created earlier:
IDataView testSetTransform = trainedModel.Transform(dataSplit.TestSet);
  1. Finally, we will use the testSetTransform function created previously and pass it into the BinaryClassification class's Evaluate method:
CalibratedBinaryClassificationMetrics modelMetrics = MlContext.BinaryClassification.Evaluate(
data: testSetTransform,
labelColumnName: nameof(RestaurantFeedback.Label),
scoreColumnName: nameof(RestaurantPrediction.Score));

Console.WriteLine($"Area Under Curve: {modelMetrics.AreaUnderRocCurve:P2}{Environment.NewLine}" +
$"Area Under Precision Recall Curve: {modelMetrics.AreaUnderPrecisionRecallCurve:P2}" + $"{Environment.NewLine}" +
$"Accuracy: {modelMetrics.Accuracy:P2}{Environment.NewLine}" +
$"F1Score: {modelMetrics.F1Score:P2}{Environment.NewLine}" +
$"Positive Recall: {modelMetrics.PositiveRecall:#.##}{Environment.NewLine}" +
$"Negative Recall: {modelMetrics.NegativeRecall:#.##}{Environment.NewLine}");

This method allows us to generate model metrics. We then print the main metrics using the trained model with the test set. We will dive into these properties specifically in the Evaluating the Model section of this chapter.

主站蜘蛛池模板: 静海县| 兴化市| 汝南县| 汽车| 左权县| 柳河县| 乌兰察布市| 巩义市| 文登市| 榆社县| 桐梓县| 彩票| 象州县| 建宁县| 宁武县| 桂阳县| 根河市| 镇雄县| 台东市| 兰考县| 正阳县| 黄平县| 巨野县| 团风县| 洪湖市| 凤庆县| 富宁县| 天柱县| 张家界市| 化州市| 潼南县| 南平市| 浏阳市| 阳春市| 韩城市| 道真| 葫芦岛市| 南投市| 木里| 城固县| 曲麻莱县|