書名： Hands-On Machine Learning with JavaScript
作者名： Burak Kanber
本章字數： 877字
更新時間： 2021-06-25 21:38:21

Feature identification

Imagine that you are responsible for placing targeted product advertisements on an e-commerce store that you help run. The goal is to analyze a visitor's past shopping trends and select products to display that will increase the shopper's likelihood to make a purchase. Given then the gift of foresight, you've been collecting 50 different metrics on all of your shoppers for months: you've been recording past purchases, the product categories of those purchases, the price tag on each purchase, the time on site each user spent before making a purchase, and so on.

Believing that ML is a silver bullet, believing that more data is better, and believing that more training of your model is better, you load all 50 dimensions of data into an algorithm and train it for days on end. When testing your algorithm you find that its accuracy is very high when evaluating data points that you've trained the algorithm on, but also find that the algorithm fails spectacularly when evaluating against your validation set. Additionally, the model has taken a very long time to train. What went wrong here?

First, you've made the assumption that all of your 50 dimensions of data are relevant to the task at hand. It turns out that not all data is relevant. ML is great at finding patterns within data, but not all data actually contains patterns. Some data is random, and other data is not random but is also uninteresting. One example of uninteresting data that fits a pattern might be the time of day that the shopper is browsing your site on: users can only shop while they're awake, so most of your users shop between 7 a.m. and midnight. This data obviously follows a pattern, but may not actually affect the user's purchase intent. Of course, there may indeed be an interesting pattern: perhaps night owls tend to make late-night impulse purchases—but maybe not.

Second, using all 50 dimensions and training your model for a long period of time may cause overfitting of your model: instead of being able to generalize behavioral patterns and making shopping predictions, your overfitted model is now very good at identifying that a certain behavior represents Steve Johnson (one specific shopper), rather than generalizing Steve's behavior into a widely applicable trend. This overfit was caused by two factors: the long training time and the existence of irrelevant data in the training set. If one of the dimensions you've recorded is largely random and you spend a lot of time training a model on that data, the model may end up using that random data as an identifier for a user rather than filtering it out as a non-trend. The model may learn that, when the user's time on site is exactly 182 seconds, they will purchase a product worth $120, simply because you've trained the model on that data point many thousands of times in the training process.

Let's consider a different example: face identification. You've got thousands of photos of peoples' faces and want to be able to analyze a photo and determine who the subject is. You train a CNN on your data, and find that the accuracy of your algorithm is quite low, only being able to correctly identify the subject 60% of the time. The problem here may be that your CNN, working with raw pixel data, has not been able to automatically identify the features of a face that actually matter. For instance, Sarah Jane always takes her selfies in her kitchen, and her favorite spatula is always on display in the background. Any other user who also happens to have a spatula in the picture may be falsely identified as Sarah Jane, even if their faces are quite different. The data has overtrained the neural network to recognize spatulas as Sarah Jane, rather than actually looking at the user's face.

In both of these examples, the problem starts with insufficient preprocessing of data. In the e-commerce store example, you have not correctly identified the features of a shopper that actually matter, and so have trained your model with a lot of irrelevant data. The same problem exists in the face detection example: not every pixel in the photograph represents a person or their features, and in seeing a reliable pattern of spatulas the algorithm has learned that Sarah Jane is a spatula.

To solve both of these problems, you will need to make better selections of the features that you give to your ML model. In the e-commerce example, it may turn out that only 10 of your 50 recorded dimensions are relevant, and to fix the problem you must identify what those 10 dimensions are and only use those when training your model. In the face detection example, perhaps the neural network should not receive raw pixel intensity data but instead facial dimensions such as nose bridge length, mouth width, distance between pupils, distance between pupil and eyebrow, distance between earlobes, distance from chin to hairline, and so on. Both of these examples demonstrate the need to select the most relevant and appropriate features of your data. Making the appropriate selection of features will serve to improve both the speed and accuracy of your model.

官术网_书友最值得收藏!

Hands-On Machine Learning with JavaScript

Feature identification