- The Unsupervised Learning Workshop
- Aaron Jones Christopher Kruger Benjamin Johnston
- 487字
- 2021-06-18 18:12:52
Introduction
In previous chapters, we evaluated a number of different approaches to data clustering, including k-means and hierarchical clustering. While k-means is the simplest form of clustering, it is still extremely powerful in the right scenarios. In situations where k-means can't capture the complexity of the dataset, hierarchical clustering proves to be a strong alternative.
One of the key challenges in unsupervised learning is that you will be presented with a collection of feature data but no complementary labels telling you what a target state will be. While you may not get a discrete view of what the target labels are, you can get some semblance of structure out of the data by clustering similar groups together and seeing what is similar within groups. The first approach we covered to achieve this goal of clustering similar data points is k-means. K-means clustering works best for simple data challenges where speed is paramount. Simply looking at the closest data point (cluster centroid) does not require a lot of computational overhead; however, there is also a greater challenge posed when it comes to higher-dimensional datasets. K-means clustering is also not ideal if you are unaware of the potential number of clusters you are looking for. An example we worked with in Chapter 2, Hierarchical Clustering, entailed looking at chemical profiles to determine which wines belonged together in a disorganized shipment. This exercise only worked well because we knew that three wine types were ordered; however, k-means would have been less successful if you had no idea regarding what the original order constituted.
The second clustering approach we explored was hierarchical clustering. This method can work in two ways – either agglomerative or pisive. Agglomerative clustering works with a bottom-up approach, treating each data point as its own cluster and recursively grouping them together with linkage criteria. Divisive clustering works in the opposite way by treating all data points as one large class and recursively breaking them down into smaller clusters. This approach has the benefit of fully understanding the entire data distribution, as it calculates splitting potential; however, it is typically not implemented in practice due to its greater complexity. Hierarchical clustering is a strong contender for your clustering needs when it comes to not knowing anything about the data. Using a dendrogram, you can visualize all the splits in your data and consider what number of clusters makes sense after the fact. This can be really helpful in your specific use case; however, it also comes at a higher computational cost than is associated with k-means.
In this chapter, we will cover a clustering approach that will serve us best in the sphere of highly complex data: Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Canonically, this method has always been seen as a high performer in datasets that have a lot of densely interspersed data. Let's walk through why it does so well in these use cases.
- Learning SQL Server Reporting Services 2012
- 龍芯應用開發標準教程
- 基于Proteus和Keil的C51程序設計項目教程(第2版):理論、仿真、實踐相融合
- 數字道路技術架構與建設指南
- 平衡掌控者:游戲數值經濟設計
- 電腦軟硬件維修從入門到精通
- Mastering Adobe Photoshop Elements
- Visual Media Processing Using Matlab Beginner's Guide
- Creating Flat Design Websites
- 單片機開發與典型工程項目實例詳解
- 無蘋果不生活:OS X Mountain Lion 隨身寶典
- Hands-On Motion Graphics with Adobe After Effects CC
- 電腦橫機使用與維修
- IP網絡視頻傳輸:技術、標準和應用
- Intel FPGA權威設計指南:基于Quartus Prime Pro 19集成開發環境