- Learning Data Mining with Python(Second Edition)
- Robert Layton
- 329字
- 2021-07-02 23:40:05
Nearest neighbors
The Nearest neighbors algorithm is our new sample. We take the most similar samples
and predict the same class that most of these nearby samples have. This vote is often simply a simple count,although more complicated methods do exist such as weighted voting.
As an example in the below diagram, we wish to predict the class of the triangle, based on which class it is more like (represented here by having similar objects closer together). We seek the three nearest neighbors, which are the two diamonds and one square within the drawn circle. There are more diamonds than circles, and the predicted class for the triangle is, therefore, a diamond:

Nearest neighbors are used for nearly any dataset - however, it can be computationally expensive to compute the distance between all pairs of samples. For example, if there are ten samples in the dataset, there are 45 unique distances to compute. However, if there are 1000 samples, there are nearly 500,000! Various methods exist for improving this speed, such as the use of tree structures for distance computation. Some of these algorithms can be quite complex, but thankfully a version is implemented in scikit-learn already, enabling us to classify on larger datasets. As these tree structures are the default in scikit-learn, we do not need to configure anything to use it.
Nearest neighbors can do poorly in categorical-based datasets, with categorical features, and another algorithm should be used for these instead. Nearest Neighbor's issue is due to the difficulty in comparing differences in categorical values, something better left to an algorithm that gives weight to each feature's importance. Comparing categorical features can be done with some distance metrics or pre-processing steps such as one hot encoding that we use in later chapters. Choosing the correct algorithm for the task is one of the difficult issues in data mining, often it can be easiest to test a set of algorithms and see which performs best on your task.
- Docker and Kubernetes for Java Developers
- The Computer Vision Workshop
- Raspberry Pi 2 Server Essentials
- TMS320LF240x芯片原理、設計及應用
- Canvas Cookbook
- Web App Testing Using Knockout.JS
- Java自然語言處理(原書第2版)
- MongoDB Administrator’s Guide
- 走近SDN/NFV
- C++ Data Structures and Algorithm Design Principles
- Java EE互聯網輕量級框架整合開發:SSM+Redis+Spring微服務(上下冊)
- Python網絡爬蟲從入門到實踐
- Switching to Angular 2
- Instant SQL Server Analysis Services 2012 Cube Security
- Test-Driven Java Development(Second Edition)