官术网_书友最值得收藏!

Predicting the label of a new data point

The other really helpful method that knn provides is called findNearest. It can be used to predict the label of a new data point based on its nearest neighbors.

Thanks to our generate_data function, it is actually really easy to generate a new data point! We can think of a new data point as a dataset of size 1:

In [17]: newcomer, _ = generate_data(1)
Out[17]: newcomer

Our function also returns a random label, but we are not interested in that. Instead, we want to predict it using our trained classifier! We can tell Python to ignore an output value with an underscore (_).

Let's have a look at our town map again. We will plot the training set as we did earlier, but also add the new data point as a green circle (since we don't know yet whether it is supposed to be a blue square or a red triangle):

In [18]: plot_data(blue, red)
... plt.plot(newcomer[0, 0], newcomer[0, 1], 'go', markersize=14);
You can add a semicolon to the plt.plot function call in order to suppress its output, the same as in Matlab.

The preceding code will produce the following figure (minus the rings):

The entire training set, plus a new data point (green) whose label has yet to be determined

If you had to guess based on its neighbors, what label would you assign the new data point-blue or red?

Well, it depends, doesn't it? If we look at the house closest to it (the one living roughly at (x, y) = (85, 75), circled with a dotted line in the preceding figure), we would probably assign the new data point to be a red triangle as well. This is exactly what our classifier would predict for k=1:

In [19]: ret, results, neighbor, dist = knn.findNearest(newcomer, 1)
... print("Predicted label:\t", results)
... print("Neighbor's label:\t", neighbor)
... print("Distance to neighbor:\t", dist)
Out[19]: Predicted label: [[ 1.]]
Neighbor's label: [[ 1.]]
Distance to neighbor: [[ 250.]]

Here, knn reports that the nearest neighbor is 250 arbitrary units away, that the neighbor has label 1 (which we said corresponds to red triangles), and that therefore the new data point should also have label 1. The same would be true if we looked at the k=2 nearest neighbors, and the k=3 nearest neighbors. But we want to be careful not to pick arbitrary even numbers for k. Why is that? Well, you can see it in the preceding figure (dashed circle); among the six nearest neighbors within the dashed circle, there are three blue squares and three red triangles--we have a tie!

In the case of a tie, OpenCV's implementation of k-NN will prefer the neighbors with a closer overall distance to the data point.

Finally, what would happen if we dramatically widened our search window and classified the new data point based on its k=7 nearest neighbors (circled with a solid line in the figure mentioned earlier)?

Let's find out by calling the findNearest method with k=7 neighbors:

In [20]: ret, results, neighbors, dist = knn.findNearest(newcomer, 7)
... print("Predicted label:\t", results)
... print("Neighbors' labels:\t", neighbors)
... print("Distance to neighbors:\t", dist)
Out[20]: Predicted label: [[ 0.]]
Neighbors' label: [[ 1. 1. 0. 0. 0. 1. 0.]]
Distance to neighbors: [[ 250. 401. 784. 916. 1073. 1360. 4885.]]

Suddenly, the predicted label is 0 (blue square). The reason is that we now have four neighbors within the solid circle that are blue squares (label 0), and only three that are red triangles (label 1). So the majority vote would suggest making the newcomer a blue square as well.

As you can see, the outcome of k-NN changes with the number k. However, often we do not know beforehand which number k is the most suitable. A naive solution to this problem is just to try a bunch of values for k, and see which one performs best. We will learn more sophisticated solutions in later chapters of this book.

主站蜘蛛池模板: 合水县| 三亚市| 台安县| 朝阳区| 文成县| 桐梓县| 佛学| 海宁市| 怀宁县| 四平市| 长岛县| 宁南县| 麦盖提县| 沁水县| 新沂市| 于都县| 莆田市| 闵行区| 开原市| 尤溪县| 通道| 临夏县| 仁化县| 鹤山市| 三门峡市| 灵璧县| 扎囊县| 龙山县| 大丰市| 盐源县| 静安区| 江口县| 鄂温| 海南省| 柳河县| 丹江口市| 石景山区| 孙吴县| 甘肃省| 赫章县| 巴彦淖尔市|