官术网_书友最值得收藏!

  • Machine Learning with Swift
  • Alexander Sosnovshchenko
  • 536字
  • 2021-06-24 18:55:04

Understanding the KNN algorithm

To recognize different types of motion activities, we will train the KNN classifier. The idea of the method is to find k training samples closest to the sample with an unknown label, and predict the label as a most frequent class among those k. That's it:

Figure 3.5: KNN classification algorithm. The new data point marked with ? gets classified based on the classes of its neighbors.
Note how the choice of neighbor number affects the result of classification.

In fact, the algorithm is so simple, that it's tempting to formulate it in more complicated terms. Let's do it. The secret sauce of a KNN is a distance metric: function, which defines how close to each other two samples are. We have discussed several of them already: Euclidean, Manhattan, Minkowski, edit distance, and DTW. Following the terminology, samples are points in some n-dimensional space, where n equals to the number of features in each sample. This space is called feature space, and samples are distributed in it as clouds of points. Classification of an unknown data point happens in three steps:

  1. Calculate distances from the point to all points in a training set
  2. Choose the k-closest neighbors to the unknown point
  3. Perform a majority vote among them

The surface that separates one class of points from another class is known as a decision boundary. The KNN algorithm creates piecewise linear decision boundaries that can approximate a decision boundary of any complexity by adding more and more training samples:

Figure 3.6: Voronoi cells graph shows the closest neighbor at each point with a color. Depending on the distance metric you choose, the graph looks quite different. From the left to the right: Manhattan ( c = 1), Euclidean ( c = 2), and Minkowski ( c = 3) distance metrics.
Algorithms similar to KNN are also known as non-generalizing machine learning. In Chapter 6 ,  Linear Regression and Gradient Descent, we will discuss a linear regression, an algorithm that constructs general representation of all data points—a straight line, because it assumes that all data points lie along the line. Unlike linear regression, KNN makes no assumption about the underlying structure of the data, it just stores all the training samples. Both approaches have their advantages and downsides.

You may think that this algorithm is too simple to be used for anything but some toy tasks. But over the years, KNN has demonstrated to be successfully employed for a wide range of problems, such as handwriting recognition, and satellite photo classification. It's also worth noting that it's easy to turn this classification algorithm into regression—you just need to replace categorical labels with the real numbers, and add an interpolation function.

Parametric versus non-parametric models
Many restrictions of the linear regressions come from the fact that it assumes that data is normally distributed. The class of statistical models which makes explicit assumptions about the statistical distribution underlying data is called parametric models.

Unlike linear regression, KNN makes no assumptions about the distribution from which samples are generated. That's why we call them non-parametric. This is the right tool to choose in situations where data has unusual distribution, and the decision boundary is irregular.
主站蜘蛛池模板: 成安县| 永定县| 曲靖市| 黔江区| 大新县| 姚安县| 靖边县| 西城区| 武义县| 云南省| 马尔康县| 呼图壁县| 莎车县| 迁西县| 亳州市| 岳阳县| 云安县| 九龙县| 嵩明县| 长治县| 恭城| 文安县| 五台县| 林甸县| 乌兰察布市| 陆川县| 云霄县| 横峰县| 望江县| 额济纳旗| 夏河县| 西华县| 玉田县| 南部县| 永定县| 岚皋县| 沭阳县| 连云港市| 大城县| 武城县| 若羌县|