In the following, you will find the sole method in the Trainer class. The Trainer method at a high level does the following:
It loads the training data (in this case our CSV) into memory.
It builds a training set and a test set.
It creates the pipeline.
It trains and saves the model.
It performs an evaluation on the model.
This is the structure and flow we will follow throughout the rest of this book. Now, let's dive into the code behind the Train method:
First, we check to make sure that the training data filename exists:
if (!File.Exists(trainingFileName)) { Console.WriteLine($"Failed to find training data file ({trainingFileName}");
return; }
Even though this is a simple test application, it is always a good practice to treat it like a production-grade application. In addition, since this is a console application, you may incorrectly pass in a path for the training data, which then can cause exceptions further on in the method.
Use the LoadFromTextFile helper method that ML.NET provides to assist with the loading of text files into an IDataView object:
As you can see, we are passing in both the training filename and the type; in this case, it is the RestaurantFeedback class mentioned earlier. It should be noted that this method has several other parameters, including the following:
separatorChar: This is the column separator character; it defaults to \t (in other words, a tab).
hasHeader: If set to true, the dataset's first row has the header; it defaults to false.
allowQuoting: This defines whether the source file can contain columns defined by a quoted string; it defaults to false.
trimWhitespace: This removes trailing whitespace from the rows; it defaults to false.
allowSparse: This defines whether the file can contain numerical vectors in sparse format; it defaults to false. The sparse format requires a new column to have the number of features.
For most projects used throughout this book, we will use the default settings.
Given the IDataView object we created previously, use the TrainTestSplit method that ML.NET provides to create a test set from the main training data:
As mentioned in Chapter 1, Getting Started with Machine Learning and ML.NET, sample data is split into two sets—training and test. The parameter, testFraction, specifies the percentage of the dataset to hold back for testing, in our case, 20%. By default, this parameter is set to 0.2.
Future examples will have a much more complex pipeline. In this example, we are simply mapping the Text property discussed earlier to the Features output column.
As you might remember from Chapter 1, Getting Started with Machine Learning and ML.NET, the various algorithms found in ML.NET are referred to as trainers. In this project, we are using an SCDA trainer.
Then, we complete the pipeline by appending the trainer we instantiated previously:
This method allows us to generate model metrics. We then print the main metrics using the trained model with the test set. We will dive into these properties specifically in the Evaluating the Model section of this chapter.