MG不朽的情缘

書名： Hands-On Machine Learning with scikit：learn and Scientific Python Toolkits
作者名： Tarek Amr
本章字數： 2334字
更新時間： 2021-06-18 18:24:28

The model development life cycle

When asked to solve a problem using machine learning, data scientists achieve this by following a sequence of steps. In this section, we are going to discuss those iterative steps.

Understanding a problem

"All models are wrong, but some are useful."

– George Box

The first thing to do when developing a model is to understand the problem you are trying to solve thoroughly. This not only involves understanding what problem you are solving, but also why you are solving it, what impact are you expecting to have, and what the currently available solution isthat you are comparing your new solution to. My understanding of what Box said when he stated that all models are wrong is that a model is just an approximation of reality by modeling one or more angles of it. By understanding the problem you are trying to solve, you can decide which angles of reality you need to model, and which ones you can tolerate ignoring.

You also need to understand the problem well to decide how to split your data for training and evaluation (more on that in the next section). You can then decide what kind of model to use. Is the problem suitable for supervised or unsupervised learning? Are we better off using classification or regression algorithms for this problem? What kind of classification algorithm will serve us best? Is a linear model good enough to approximate our reality? Do we need the most accurate model or one that we can easily explain to its users and to the business stakeholders?

Minimal exploratory data analysis can be done here, where you can check whether you have labels and check the cardinality of the labels, if present, to decide whether you are dealing with a classification or a regression problem. I would still save any further data analysis until after the dataset is split into training and test sets. It is important to limit advanced data analysis to the training set only to ensure your model's generalizability.

Finally, we need to understand what we are comparing our model to. What is the current baseline that we need to improve on? If there are already business rules in place, then our model has to be better at solving the problem at hand than these rules. To be able to decide how much better it is at solving the problem, we need to use evaluation metrics—metrics that are suitable for our model and also as close as possible to our business requirements. If our aim is to increase revenue, then our metric should be good at estimating the increase in revenue when our model is used, compared to the current status quo. If our aim is to increase repeat purchases regardless of the revenue, then other metrics may be more suitable.

Splitting our data

As we have seen in supervised learning, we train our model on a set of data where the correct answers (labels) are given. Learning, however, is only half of the problem. We also want to be able to tell whether the model we built is going to do a good job when used on future data. We cannot foresee the future, but we can use the data we already have to evaluate our model.

We do this by splitting our data into parts. We use one part of the data to train the model (the training set) and then use a separate part to evaluate the model (the test set). Since we want our test set to be as close as possible to the future data, there are two key points discussed in the following subsections to keep in mind when splitting our data:

Finding the best manner to split the data
Making sure the training and test datasets are separate

Finding the best manner to split the data

Say your users' data is sorted according to their country in alphabetical order. If you just take the first N records for training and the rest for testing, you will end up training the model on users from certain countries and will never let it learn anything about users from, say, Zambia and Zimbabwe. So, one common solution is to randomize your data before splitting it. Random split is not always the best option, however. Say we want to build a model to predict the stock prices or climate change phenomena a few years ahead. To be confident that our system will capture temporal trends such as global warming, we need to split our data based on time. We can train on earlier data and see whether the model can do a good job in predicting more recent data.

Sometimes, we just predict rare incidents. It can be that the number of fraud cases that occur in your payment system is 0.1%. If you randomly split your data, you may be unlucky and have the vast majority of the fraud cases in the training data and very few cases in the test data, or vice versa. So, it is advised that you use stratification when it comes to highly unbalanced data. Stratification makes sure that the distribution of your targets is more or less the same in both the training and test datasets.

A stratified sampling strategy is used to make sure that the different subgroups in our population are represented in our samples. If my dataset is made up of 99% males and 1% females, a random sample of the population may end up having only males in it. So, you should separate the male and female populations first, and then take a sample from each one of the two and combine them later to make sure they are both represented in the final sample. The same concept is applied here if we want to make sure all the class labels are present in our training and test sets. Later on in this book, we will be splitting our data using the train_test_split()function. This function uses the class labels to stratify its samples by default.

Making sure the training and the test datasets are separate

One of the most common mistakes new data scientists may fall prey to is the look-ahead bias. We use the test dataset to simulate the data we will see in the future, but usually, the test dataset contains information that we can only know after time has passed. Take the case of our example space vehicles; we may have two columns—one saying whether the vehicle returns, and the other saying how long the vehicle will take to return. If we are to build a classifier to predict whether a vehicle will return, we will use the former column as our target, but we will never use the latter column as a feature. We can only know how long a vehicle stayed in outer space once it is actually back. This example looks trivial, but believe me, look-ahead bias is a very common mistake, especially when dealing with less obvious cases than this one.

Besides training, you also learn things from the data in order to preprocess it. Say, instead of users' heights in centimeters, you want to have a feature stating whether a user's height is above or below the median. To do that, you need to go through the data and calculate the median. Now, since anything that we learn has to come from the training set itself, we also need to learn this median from the training set and not from the entire dataset. Luckily, in all the data preprocessing functions of scikit-learn, there are separate methods for the fit(), predict(), and transform() functions. This makes sure that anything learned from the data (via the fit() method) is only learned from the training dataset, and then it can be applied to the test set (via the predict() and/or transform() methods).

Development set

When developing a model, we need to try multiple configurations of the model to decide which configuration gives the best results. To be able to do so, we usually split the training dataset further into training and development sets. Having two new subsets allows us to try different configurations when training on one of the two subsets and evaluating the effect of those configuration changes on the other. Once we find the best configuration, we evaluate our model with its final configuration on the test set. In Chapter 2, Making Decisions with Trees, we will do all this in practice. Note that I will be using the terms model configuration and hyperparameters interchangeably.

Evaluating our model

Evaluating your model's performance is essential in picking the best algorithm for the job and to be able to estimate how your model will perform in real life. As Box said, a model that is wrong can still be useful. Take the example of a web start-up. They run an ad campaign where they are paid $1 for each view they get, and they know that for every 100 viewers, only one viewer signs up and buys stuff for $50. In other words, they have to spend $100 to make $50. Obviously, that's a bad Return of Investment (ROI) for their business. Now, what if you create a model for them that can pick users for them to target, but your new model is only correct 10% of the time? Is 10% precision good or bad, in this case? Well, of course, this model is wrong 90% of the time, which may sound like a very bad model, but if we calculate ROI now, then for every $100 they spend, they make $500. Well, I would definitely pay you to build me this model that is quite wrong, yet quite useful!

scikit-learn provides a large number of evaluation metrics that we will be using to evaluate the models we build in this book. But remember, a metric is only useful if you really understand the problem you are solving and its business impact.

Deploying in production and monitoring

The main reason that many data scientists use Python for machine learning instead of R, for example, is that it makes it easier to productionize your code. Python has plenty of web frameworks to build APIs with and put the machine learning models behind. It is also supported by all cloud providers. I find it important that the team developing a model is also responsible for deploying it in production. Building your model in one language and then asking another team to port it into another language is error-prone. Of course, having one person or team building and deploying models may not be feasible in larger companies or due to other implementation constraints.

However, keeping the two teams in close contact and making sure that the ones developing the model can still understand the production code is essential and helps to minimize errors on account of development and production code inconsistency.

We try our best not to have any look-ahead bias when training our models. We hope data doesn't change after our models are trained, and we want our code to be bug-free. However, we cannot guarantee any of this. We may overlook the fact that the user's credit score is only added to the database after they make their first purchase. We may not know that our developers decided to switch to the metric system to specify our inventory's weights while it was saved in pounds when the model was trained. Because of that, it is important to log all the predictions your model makes to be able to monitor its performance in real life and compare it to the test set's performance. You can also log the test set's performance every time you retrain the model or keep track of the target's distribution over time.

Iterating

Often, when you deploy a model, you end up with more data. Furthermore, the performance of your model is not guaranteed to be the same when deployed in production. This can be due to some implementation issues or mistakes that took place during the evaluation process. Those two points mean that the first version of your solution is always up for improvement. Starting with simple solutions (that can be improved via iterations) is an important concept for agile programming and is a paramount concept for machine learning.

This whole process, from understanding theproblemto monitoring the ongoing improvements on the solution, requires tools that allow us to iterate quickly and efficiently. In the next section, we will introduce you to scikit-learn and explain why many machine learning practitioners consider it the right tool for the job.

When to use machine learning

"Pretty much anything that a normal person can do in less than 1 second, we can now automate with AI."

– Andrew Ng

One additional note before moving on to the next section is that when faced with a problem, you have to decide whether machine learning is apt for the task. Andrew Ng's 1-second rule is a good heuristic for you to estimate whether a machine learning-based solution will work. The main reason behind this is that computers are good with patterns. They are way better than humans at picking repeated patterns and acting on them.

Once they identify the same pattern over and over again, it is easy to codify them to make the same decisions every time. In the same manner, computers are also good with tactics. In 1908, Richard Teichmann stated that a game of chess is 99% basedontactics. Maybe that's why computers have beat humans in chess since 1997. If we are to believe Teichmann's statement, then the remaining 1% is strategy. Unlike tactics, strategy is the arena where humans beat machines. If the problem you want to solve can be formulated as a set of tactics, then go for machine learning and leave the strategic decisions for humans to make. In the end, most of our day-to-day decisions are tactical. Furthermore, one man's strategy is often someone else's tactics.

官术网_书友最值得收藏!

Hands-On Machine Learning with scikit：learn and Scientific Python Toolkits