官术网_书友最值得收藏!

Setting parameters

Almost all parameters that the user can set, letting algorithms focus more on the specific dataset, rather than only being applicable across a small and specific range of problems. Setting these parameters can be quite difficult, as choosing good parameter values is often highly reliant on features of the dataset.

The nearest neighbor algorithm has several parameters, but the most important one is that of the number of nearest neighbors to use when predicting the class of an unseen attribution. In -learn, this parameter is called n_neighbors. In the following figure, we show that when this number is too low, a randomly labeled sample can cause an error. In contrast, when it is too high, the actual nearest neighbors have a lower effect on the result:

In figure (a), on the left-hand side, we would usually expect to classify the test sample (the triangle) as a circle. However, if n_neighbors is 1, the single red diamond in this area (likely a noisy sample) causes the sample to be predicted as a diamond. In figure (b), on the right-hand side, we would usually expect to classify the test sample as a diamond. However, if n_neighbors is 7, the three nearest neighbors (which are all diamonds) are overridden by a large number of circle samples. Nearest neighbors a difficult problem to solve, as the parameter can make a huge difference. Luckily, most of the time the specific parameter value does not greatly affect the end result, and the standard values (usually 5 or 10) are often near enough.

With that in mind, we can test out a range of values, and investigate the impact that this parameter has on performance. If we want to test a number of values for the n_neighbors parameter, for example, each of the values from 1 to 20, we can rerun the experiment many times by setting n_neighbors and observing the result. The code below does this, storing the values in the avg_scores and all_scores variables.

avg_scores = [] 
all_scores = []
parameter_values = list(range(1, 21)) # Include 20
for n_neighbors in parameter_values:
estimator = KNeighborsClassifier(n_neighbors=n_neighbors)
scores = cross_val_score(estimator, X, y, scoring='accuracy') avg_scores.append(np.mean(scores))
all_scores.append(scores)

We can then plot the relationship between the value of n_neighbors and the accuracy. First, we tell the Jupyter Notebook that we want to show plots inline in the notebook itself:

%matplotlib inline

We then import pyplot from the matplotlib library and plot the parameter values alongside average scores:

from matplotlib import pyplot as plt plt.plot(parameter_values,  avg_scores, '-o')

While there is a lot of variance, the plot shows a decreasing trend as the number of neighbors increases. With regard to the variance, you can expect large amounts of variance whenever you do evaluations of this nature. To compensate, update the code to run 100 tests, per value of n_neighbors.

主站蜘蛛池模板: 宣武区| 玉田县| 依安县| 湘潭市| 肥东县| 衡水市| 长顺县| 贵阳市| 汕尾市| 仙桃市| 漳平市| 繁昌县| 通道| 白水县| 唐河县| 屏东县| 通化市| 桐城市| 河曲县| 太仆寺旗| 永胜县| 龙口市| 清河县| 鄂尔多斯市| 恩施市| 盱眙县| 庆阳市| 宿州市| 永宁县| 泽州县| 闵行区| 乌兰浩特市| 兴城市| 辽阳县| 邳州市| 三明市| 大冶市| 东乌珠穆沁旗| 翼城县| 呼和浩特市| 开鲁县|