官术网_书友最值得收藏!

K-nearest neighbors

In our previous efforts, we built models that had coefficients or, to put it in another way, parameter estimates for each of our included features. With KNN, we have no parameters as the learning method is so-called instance-based learning. In short, labeled examples (inputs and corresponding output labels) are stored, and no action is taken until a new input pattern demands an output value (Battiti and Brunato, 2014, p. 11). This method is commonly called lazy learning, as no specific model parameters are produced. The train instances themselves represent the knowledge. For the prediction of any new instance (a new data point), the training data is searched for an instance that most resembles the new instance in question. KNN does this for a classification problem by looking at the closest points—the nearest neighbors—to determine the proper class. The k comes into play by deciding how many neighbors should be examined by the algorithm, so if k=5, it will consider the five nearest points. A weakness of this method is that all five points are given equal weight in the algorithm even if they're less relevant in learning. We'll look at methods using R and try to alleviate this issue.

The best way to understand how this works is with a simple visual example of a binary classification learning problem. In the following screenshot, we have a plot showing whether a tumor is benign or malignant based on two predictive features. The X in the plot indicates a new observation that we would like to predict. If our algorithm considers K=3, the circle encompasses the three observations that are nearest to the one that we want to score. As the most commonly occurring classifications are malignant, the X data point is classified as malignant, as shown in the following screenshot:

Even from this simple example, it's clear that the selection of k for the nearest neighbors is critical. If k is too small, you may have a high variance on the test set observations even though you have a low bias. On the other hand, as k grows, you may decrease your variance, but the bias may be unacceptable. Cross-validation is necessary to determine the proper k

It's also important to point out the calculation of the distance or the nearness of the data points in our feature space. The default distance is Euclidean distance. This is merely the straight-line distance from point A to point B—as the crow fliesor you can utilize the formula that states it's equivalent to the square root of the sum of the squared differences between the corresponding points. The formula for Euclidean distance, given point A and B with coordinates p1, p2, ... pn and q1, q2, ... qn respectively, would be as follows:

This distance is highly dependent on the scale that the features were measured on, so it's critical to standardize them. Other distance calculations as well as weights can be used, depending on the distance. We'll explore this in the upcoming example.

主站蜘蛛池模板: 安阳市| 乌兰浩特市| 桦川县| 海林市| 遵义县| 广水市| 克什克腾旗| 普宁市| 阿瓦提县| 鄂尔多斯市| 台前县| 西宁市| 土默特左旗| 武城县| 永仁县| 井陉县| 丰县| 安徽省| 鄢陵县| 兴隆县| 康平县| 肥东县| 交城县| 宜兰县| 花莲县| 荔波县| 遵义县| 株洲市| 弋阳县| 介休市| 麻栗坡县| 红安县| 锦州市| 榕江县| 丽江市| 仙游县| 永清县| 中方县| 迁安市| 章丘市| 弥勒县|