官术网_书友最值得收藏!

Preprocessing using pipelines

When taking measurements of real-world objects, we can often get features in very different ranges. For instance, if we are measuring the qualities of an animal, we might have several features, as follows:

  • Number of legs: This is between the range of 0-8 for most animals, while some have many more!
  • Weight: This is between the range of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
  • Number of hearts: This can be between zero to five, in the case of the earthworm.

For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.

One of the methods to overcome this is to use a process called preprocessing to normalize the features so that they all have the same range, or are put into categories like small, medium and large. Suddenly, the large difference in the types of features has less of an impact on the algorithm, and can lead to large increases in the accuracy.

Preprocessing can also be used to choose only the more effective features, create new features, and so on. Preprocessing in scikit-learn is done through Transformer objects, which take a dataset in one form and return an altered dataset after some transformation of the data. These don't have to be numerical, as Transformers are also used to extract features-however, in this section, we will stick with preprocessing.

An example

We can show an example of the problem by breaking the Ionosphere dataset. While this is only an example, many real-world datasets have problems of this form. First, we create a copy of the array so that we do not alter the original dataset:

X_broken = np.array(X)

Next, we break the dataset by piding every second feature by 10:

X_broken[:,::2] /= 10

In theory, this should not have a great effect on the result. After all, the values for these features are still relatively the same. The major issue is that the scale has changed and the odd features are now larger than the even features. We can see the effect of this by computing the accuracy:

estimator = KNeighborsClassifier()
original_scores = cross_val_score(estimator, X, y, scoring='accuracy')
print("The original average accuracy for is {0:.1f}%".format(np.mean(original_scores) * 100))
broken_scores = cross_val_score(estimator, X_broken, y, scoring='accuracy')
print("The 'broken' average accuracy for is {0:.1f}%".format(np.mean(broken_scores) * 100))

This gives a score of 82.3 percent for the original dataset, which drops down to 71.5 percent on the broken dataset. We can fix this by scaling all the features to the range 0 to 1.

Standard preprocessing

The preprocessing we will perform for this experiment is called feature-based normalization through the MinMaxScaler class. Continuing with the IPython notebook from the rest of this chapter, first, we import this class:

from sklearn.preprocessing import MinMaxScaler

This class takes each feature and scales it to the range 0 to 1. The minimum value is replaced with 0, the maximum with 1, and the other values somewhere in between.

To apply our preprocessor, we run the transform function on it. While MinMaxScaler doesn't, some transformers need to be trained first in the same way that the classifiers do. We can combine these steps by running the fit_transform function instead:

X_transformed = MinMaxScaler().fit_transform(X)

Here, X_transformed will have the same shape as X. However, each column will have a maximum of 1 and a minimum of 0.

There are various other forms of normalizing in this way, which is effective for other applications and feature types:

  • Ensure the sum of the values for each sample equals to 1, using sklearn.preprocessing.Normalizer
  • Force each feature to have a zero mean and a variance of 1, using sklearn.preprocessing.StandardScaler, which is a commonly used starting point for normalization
  • Turn numerical features into binary features, where any value above a threshold is 1 and any below is 0, using sklearn.preprocessing.Binarizer

We will use combinations of these preprocessors in later chapters, along with other types of Transformers object.

Putting it all together

We can now create a workflow by combining the code from the previous sections, using the broken dataset previously calculated:

X_transformed = MinMaxScaler().fit_transform(X_broken)
estimator = KNeighborsClassifier()
transformed_scores = cross_val_score(estimator, X_transformed, y, scoring='accuracy')
print("The average accuracy for is {0:.1f}%".format(np.mean(transformed_scores) * 100))

This gives us back our score of 82.3 percent accuracy. The MinMaxScaler resulted in features of the same scale, meaning that no features overpowered others by simply being bigger values. While the Nearest Neighbor algorithm can be confused with larger features, some algorithms handle scale differences better. In contrast, some are much worse!

主站蜘蛛池模板: 高要市| 滁州市| 天全县| 郁南县| 沙洋县| 衡阳市| 全椒县| 贵定县| 修水县| 长子县| 东丰县| 永济市| 临泽县| 通海县| 福鼎市| 石景山区| 广安市| 永吉县| 全州县| 馆陶县| 修文县| 怀远县| 连云港市| 延川县| 锡林郭勒盟| 白河县| 张北县| 鸡西市| 剑河县| 盐池县| 滦平县| 德惠市| 泰顺县| 克什克腾旗| 耒阳市| 获嘉县| 田林县| 滕州市| 沂南县| 新沂市| 台南市|