書名： Hands-On Machine Learning with JavaScript
作者名： Burak Kanber
本章字數： 1226字
更新時間： 2021-06-25 21:38:21

Feature selection and feature extraction

Both feature selection and feature extraction are techniques used to reduce dimensionality, though they are slightly different concepts. Feature selection is the practice of using only variables or features that are relevant to the problem at hand. In general, feature selection looks at inpidual features (such as time on site) and makes a determination of the relevance of that single feature. Feature extraction is similar, however feature extraction often looks at multiple correlated features and combines them into a single feature (like looking at hundreds of inpidual pixels and converting them into a distance between pupils measurement). In both cases, we are reducing the dimensionality of the problem, but the difference between the two is whether we are simply filtering out irrelevant dimensions (feature selection) or combining existing features in order to derive a new representative feature (feature extraction).

The goal of feature selection is to select the subset of features or dimensions of your data that optimizes the accuracy of your model. Let's take a look at the naive approach to solving this problem: an exhaustive, brute force search of all possible subsets of dimensions. This approach is not viable in real-world applications, but it serves to frame the problem for us. If we take the e-commerce store example, our goal is to find some subset of dimensions or features that gives us the best results from our model. We know we have 50 features to choose from, but we don't know how many are in the optimum set of features. Solving this problem by brute force, we would first pick only one feature at a time, and train and evaluate our model for each feature.

For instance, we would use only time on site as a data point, train the model on that data point, evaluate the model, and record the accuracy of the model. Then we move on to total past purchase amount, train the model, evaluate the model, and record results. We do this 48 more times for the remaining features and record the performance of each. Then we have to consider combinations of two features at a time, for instance by training and evaluating the model on time on site and total past purchase amount, and then training and evaluating on time on site and last purchase date, and so on. There are 1,225 unique pairs of features out of our set of 50, and we must repeat the procedure for each pair. Then we must consider groups of three features at a time, of which there are 19,600 combinations. Then we must consider groups of four features, of which there are 230,300 unique combinations. There are 2,118,760 combinations of five features, and nearly 16 million combinations of six features available to us, and so on. Obviously this exhaustive search for the optimal set of features to use cannot be done in a reasonable amount of time: we'd have to train our model billions of times just to find out what the best subset of features to use is! We must find a better approach.

In general, feature selection techniques are split into three categories: filter methods, wrapper methods, and embedded methods. Each category has a number of techniques, and the technique you select will depend on the data, the context, and the algorithm of your specific situation.

Filter methods are the easiest to implement and typically have the best performance. Filter methods for feature selection analyze a single feature at a time and attempt to determine that feature's relevance to the data. Filter methods typically do not have any relation to the ML algorithm you use afterwards, and are more typically statistical methods that analyze the feature itself.

For instance, you may use the Pearson correlation coefficient to determine if a feature has a linear relationship with the output variable, and remove features with a correlation very close to zero. This family of approaches will be very fast in terms of computational time, but has the disadvantage of not being able to identify features that are cross-correlated with one another, and, depending on the filter algorithm you use, may not be able to identify nonlinear or complex relationships.

Wrapper methods are similar to the brute force approach described earlier, however with the goal of avoiding a full exhaustive search of every combination of features as we did previously. For instance, you may use a genetic algorithm to select subsets of features, train and evaluate the model, and then use the evaluation of the model as evolutionary pressure to find the next subset of features to test.

The genetic algorithm approach may not find the perfect subset of features, but will likely discover a very good subset of features to use. Depending on the actual machine learning model you use and the size of the dataset, this approach may still take a long time, but it will not take an intractably long amount of time like the exhaustive search would. The advantage of wrapper methods is that they interact with the actual model you're training and therefore serve to directly optimize your model, rather than simply attempting to independently statistically filter out inpidual features. The major disadvantage of these methods is the computational time it takes to achieve the desired results.

There is also a family of methods called embedded methods, however this family of techniques relies on algorithms that have their own feature selection algorithm built in and are therefore quite specialized; we will not discuss them here.

Feature extraction techniques focus on combining existing features into new, derived features that better represent your data while also eliminating extra or redundant dimensionality. Imagine that your e-commerce shopper data includes both time on site and total pixel scrolling distance while browsing as dimensions. Also imagine that both of these dimensions do strongly correlate to the amount of money a shopper spends on the site. Naturally, these two features are related to each other: the more time a user spends on the site, the more likely they are to have scrolled a farther distance. Using only feature selection techniques, such as the Pearson correlation analysis, you would find that you should keep both time on site and total distance scrolled as features. The feature selection technique, which analyzes these features independently, has determined that both are relevant to your problem, but has not understood that the two features are actually highly related to each other and therefore redundant.

A more sophisticated feature extraction technique, such as Principal Component Analysis (PCA), would be able to identify that time on site and scroll distance can actually be combined into a single, new feature (let's call it site engagement) that encapsulates the data represented by what used to be two separate features. In this case we have extracted a new feature from the time on site and scrolling distance measurements, and we are using that single feature instead of the two original features separately. This differs from feature selection; in feature selection we are simply choosing which of the original features to use when training our model, however in feature extraction we are creating brand new features from related combinations of original features. Both feature selection and feature extraction therefore reduce the dimensionality of our data, but do so in different ways.

官术网_书友最值得收藏!

Hands-On Machine Learning with JavaScript

Feature selection and feature extraction