- The Unsupervised Learning Workshop
- Aaron Jones Christopher Kruger Benjamin Johnston
- 367字
- 2021-06-18 18:12:53
Clusters as Neighborhoods
Until now, we have explored the concept of likeness being described as a function of Euclidean distance – data points that are closer to any one point can be seen as similar, while those that are further away in Euclidean space can be seen as dissimilar. This notion is seen once again in the DBSCAN algorithm. As alluded to by the lengthy name, the DBSCAN approach expands upon basic distance metric evaluation by also incorporating the notion of density. If there are clumps of data points that all exist in the same area as one another, they can be seen as members of the same cluster:

Figure 3.1: Neighbors have a direct connection to clusters
In the preceding figure, we can see four neighborhoods. The density-based approach has a number of benefits when compared to the past approaches we've covered that focus exclusively on distance. If you were just focusing on distance as a clustering threshold, then you may find your clustering makes little sense if faced with a sparse feature space with outliers. Both k-means and hierarchical clustering will automatically group together all data points in the space until no points are left.
While hierarchical clustering does provide a path around this issue somewhat, since you can dictate where clusters are formed using a dendrogram post-clustering run, k-means is the most susceptible to failure as it is the simplest approach to clustering. These pitfalls are less evident when we begin evaluating neighborhood approaches to clustering. In the following dendrogram, you can see an example of the pitfall where all data points are grouped together. Clearly, as you travel down the dendrogram, there is a lot of potential variation that gets grouped together since every point needs to be a member of a cluster. This is less of an issue with neighborhood-based clustering:

Figure 3.2: Example dendrogram
By incorporating the notion of neighbor density in DBSCAN, we can leave outliers out of clusters if we choose to, based on the hyperparameters we choose at runtime. Only the data points that have close neighbors will be seen as members within the same cluster, and those that are farther away can be left as unclustered outliers.
- Windows phone 7.5 application development with F#
- 深入淺出SSD:固態存儲核心技術、原理與實戰
- 嵌入式技術基礎與實踐(第5版)
- 精選單片機設計與制作30例(第2版)
- Intel FPGA/CPLD設計(高級篇)
- 數字邏輯(第3版)
- 計算機組裝與維修技術
- OpenGL Game Development By Example
- BeagleBone Robotic Projects
- Istio服務網格技術解析與實踐
- Hands-On Motion Graphics with Adobe After Effects CC
- 筆記本電腦芯片級維修從入門到精通(圖解版)
- 基于網絡化教學的項目化單片機應用技術
- Building Machine Learning Systems with Python
- MicroPython Cookbook