- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 468字
- 2021-07-02 23:40:06
Setting parameters
Almost all parameters that the user can set, letting algorithms focus more on the specific dataset, rather than only being applicable across a small and specific range of problems. Setting these parameters can be quite difficult, as choosing good parameter values is often highly reliant on features of the dataset.
The nearest neighbor algorithm has several parameters, but the most important one is that of the number of nearest neighbors to use when predicting the class of an unseen attribution. In -learn, this parameter is called n_neighbors. In the following figure, we show that when this number is too low, a randomly labeled sample can cause an error. In contrast, when it is too high, the actual nearest neighbors have a lower effect on the result:

In figure (a), on the left-hand side, we would usually expect to classify the test sample (the triangle) as a circle. However, if n_neighbors is 1, the single red diamond in this area (likely a noisy sample) causes the sample to be predicted as a diamond. In figure (b), on the right-hand side, we would usually expect to classify the test sample as a diamond. However, if n_neighbors is 7, the three nearest neighbors (which are all diamonds) are overridden by a large number of circle samples. Nearest neighbors a difficult problem to solve, as the parameter can make a huge difference. Luckily, most of the time the specific parameter value does not greatly affect the end result, and the standard values (usually 5 or 10) are often near enough.
With that in mind, we can test out a range of values, and investigate the impact that this parameter has on performance. If we want to test a number of values for the n_neighbors parameter, for example, each of the values from 1 to 20, we can rerun the experiment many times by setting n_neighbors and observing the result. The code below does this, storing the values in the avg_scores and all_scores variables.
avg_scores = []
all_scores = []
parameter_values = list(range(1, 21)) # Include 20
for n_neighbors in parameter_values:
estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
scores = cross_val_score(estimator, X, y, scoring='accuracy') avg_scores.append(np.mean(scores))
all_scores.append(scores)
We can then plot the relationship between the value of n_neighbors and the accuracy. First, we tell the Jupyter Notebook that we want to show plots inline in the notebook itself:
%matplotlib inline
We then import pyplot from the matplotlib library and plot the parameter values alongside average scores:
from matplotlib import pyplot as plt plt.plot(parameter_values, avg_scores, '-o')

While there is a lot of variance, the plot shows a decreasing trend as the number of neighbors increases. With regard to the variance, you can expect large amounts of variance whenever you do evaluations of this nature. To compensate, update the code to run 100 tests, per value of n_neighbors.
- 兩周自制腳本語言
- C# Programming Cookbook
- Clojure for Domain:specific Languages
- 軟件測試工程師面試秘籍
- Android 7編程入門經典:使用Android Studio 2(第4版)
- 秒懂設計模式
- Quarkus實踐指南:構建新一代的Kubernetes原生Java微服務
- 你不知道的JavaScript(中卷)
- PhoneGap:Beginner's Guide(Third Edition)
- 微信小程序項目開發實戰
- Visual C#通用范例開發金典
- MINECRAFT編程:使用Python語言玩轉我的世界
- Clean Code in C#
- 零基礎學C語言(升級版)
- 微信小程序開發邊做邊學(微課視頻版)