官术网_书友最值得收藏!

Introduction

In previous chapters, we evaluated a number of different approaches to data clustering, including k-means and hierarchical clustering. While k-means is the simplest form of clustering, it is still extremely powerful in the right scenarios. In situations where k-means can't capture the complexity of the dataset, hierarchical clustering proves to be a strong alternative.

One of the key challenges in unsupervised learning is that you will be presented with a collection of feature data but no complementary labels telling you what a target state will be. While you may not get a discrete view of what the target labels are, you can get some semblance of structure out of the data by clustering similar groups together and seeing what is similar within groups. The first approach we covered to achieve this goal of clustering similar data points is k-means. K-means clustering works best for simple data challenges where speed is paramount. Simply looking at the closest data point (cluster centroid) does not require a lot of computational overhead; however, there is also a greater challenge posed when it comes to higher-dimensional datasets. K-means clustering is also not ideal if you are unaware of the potential number of clusters you are looking for. An example we worked with in Chapter 2, Hierarchical Clustering, entailed looking at chemical profiles to determine which wines belonged together in a disorganized shipment. This exercise only worked well because we knew that three wine types were ordered; however, k-means would have been less successful if you had no idea regarding what the original order constituted.

The second clustering approach we explored was hierarchical clustering. This method can work in two ways – either agglomerative or pisive. Agglomerative clustering works with a bottom-up approach, treating each data point as its own cluster and recursively grouping them together with linkage criteria. Divisive clustering works in the opposite way by treating all data points as one large class and recursively breaking them down into smaller clusters. This approach has the benefit of fully understanding the entire data distribution, as it calculates splitting potential; however, it is typically not implemented in practice due to its greater complexity. Hierarchical clustering is a strong contender for your clustering needs when it comes to not knowing anything about the data. Using a dendrogram, you can visualize all the splits in your data and consider what number of clusters makes sense after the fact. This can be really helpful in your specific use case; however, it also comes at a higher computational cost than is associated with k-means.

In this chapter, we will cover a clustering approach that will serve us best in the sphere of highly complex data: Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Canonically, this method has always been seen as a high performer in datasets that have a lot of densely interspersed data. Let's walk through why it does so well in these use cases.

主站蜘蛛池模板: 凤冈县| 馆陶县| 苍南县| 凤台县| 大石桥市| 宿迁市| 裕民县| 铁岭县| 长葛市| 钟山县| 彭阳县| 浙江省| 忻城县| 泸定县| 马尔康县| 梧州市| 临沧市| 华宁县| 叙永县| 远安县| 阜平县| 河曲县| 花莲市| 班玛县| 甘肃省| 绥阳县| 宝丰县| 淮安市| 北川| 阆中市| 高安市| 阆中市| 鹰潭市| 方山县| 南阳市| 丹阳市| 饶阳县| 政和县| 横山县| 从江县| 元江|