官术网_书友最值得收藏!

Preprocessing

When taking measurements of real-world objects, we can often get features in different ranges. For instance, if we measure the qualities of an animal, we might have several features, as follows:

  • Number of legs: This is between the range of 0-8 for most animals, while some have more! more! more!
  • Weight: This is between the ranges of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
  • Number of hearts: This can be between zero to five, in the case of the earthworm.

For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.

One of the possible strategies normalizes the features so that they all have the same range, or the values are turned into categories like small, medium and large. Suddenly, the large differences in the types of features have less of an impact on the algorithm and can lead to large increases in the accuracy.

Pre-processing can also be used to choose only the more effective features, create new features, and so on. Pre-processing in scikit-learn is done through Transformer objects, which take a dataset in one form and return an altered dataset after some transformation of the data. These don't have to be numerical, as Transformers are also used to extract features-however, in this section, we will stick with pre-processing.

We can show an example of the problem by breaking the Ionosphere dataset. While this is only an example, many real-world datasets have problems of this form.

  1. First, we create a copy of the array so that we do not alter the original dataset:
X_broken = np.array(X)
  1. Next, we break the dataset by piding every second feature by 10:
X_broken[:,::2] /= 10

In theory, this should not have a great effect on the result. After all, the values of these features are still relatively the same. The major issue is that the scale has changed and the odd features are now larger than the even features. We can see the effect of this by computing the accuracy:

estimator = KNeighborsClassifier() 
original_scores = cross_val_score(estimator, X, y,scoring='accuracy')
print("The original average accuracy for is {0:.1f}%".format(np.mean(original_scores) * 100))
broken_scores = cross_val_score(estimator, X_broken, y, scoring='accuracy')
print("The 'broken' average accuracy for is {0:.1f}%".format(np.mean(broken_scores) * 100))

This testing methodology gives a score of 82.3 percent for the original dataset, which drops down to 71.5 percent on the broken dataset. We can fix this by scaling all the features to the range 0 to 1.

主站蜘蛛池模板: 台湾省| 秀山| 大名县| 革吉县| 仙桃市| 九台市| 云阳县| 拜城县| 苍南县| 双城市| 枣庄市| 永仁县| 图木舒克市| 舟山市| 沂南县| 博白县| 米林县| 巧家县| 隆回县| 白玉县| 湘潭县| 邓州市| 织金县| 紫云| 罗定市| 红桥区| 湟中县| 汝城县| 阿瓦提县| 临澧县| 深州市| 德保县| 集安市| 兴文县| 乌拉特前旗| 陇西县| 成武县| 三江| 通山县| 登封市| 三穗县|