官术网_书友最值得收藏!

  • The Unsupervised Learning Workshop
  • Aaron Jones Christopher Kruger Benjamin Johnston
  • 487字
  • 2021-06-18 18:12:52

Introduction

In previous chapters, we evaluated a number of different approaches to data clustering, including k-means and hierarchical clustering. While k-means is the simplest form of clustering, it is still extremely powerful in the right scenarios. In situations where k-means can't capture the complexity of the dataset, hierarchical clustering proves to be a strong alternative.

One of the key challenges in unsupervised learning is that you will be presented with a collection of feature data but no complementary labels telling you what a target state will be. While you may not get a discrete view of what the target labels are, you can get some semblance of structure out of the data by clustering similar groups together and seeing what is similar within groups. The first approach we covered to achieve this goal of clustering similar data points is k-means. K-means clustering works best for simple data challenges where speed is paramount. Simply looking at the closest data point (cluster centroid) does not require a lot of computational overhead; however, there is also a greater challenge posed when it comes to higher-dimensional datasets. K-means clustering is also not ideal if you are unaware of the potential number of clusters you are looking for. An example we worked with in Chapter 2, Hierarchical Clustering, entailed looking at chemical profiles to determine which wines belonged together in a disorganized shipment. This exercise only worked well because we knew that three wine types were ordered; however, k-means would have been less successful if you had no idea regarding what the original order constituted.

The second clustering approach we explored was hierarchical clustering. This method can work in two ways – either agglomerative or pisive. Agglomerative clustering works with a bottom-up approach, treating each data point as its own cluster and recursively grouping them together with linkage criteria. Divisive clustering works in the opposite way by treating all data points as one large class and recursively breaking them down into smaller clusters. This approach has the benefit of fully understanding the entire data distribution, as it calculates splitting potential; however, it is typically not implemented in practice due to its greater complexity. Hierarchical clustering is a strong contender for your clustering needs when it comes to not knowing anything about the data. Using a dendrogram, you can visualize all the splits in your data and consider what number of clusters makes sense after the fact. This can be really helpful in your specific use case; however, it also comes at a higher computational cost than is associated with k-means.

In this chapter, we will cover a clustering approach that will serve us best in the sphere of highly complex data: Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Canonically, this method has always been seen as a high performer in datasets that have a lot of densely interspersed data. Let's walk through why it does so well in these use cases.

主站蜘蛛池模板: 隆化县| 伊川县| 长宁县| 永宁县| 弥渡县| 高要市| 丰顺县| 大田县| 乌海市| 泽库县| 泸定县| 新野县| 招远市| 迁安市| 武穴市| 亳州市| 台北县| 于都县| 旬邑县| 华安县| 阿勒泰市| 屏南县| 宣汉县| 礼泉县| 洛川县| 海兴县| 南华县| 博野县| 四子王旗| 阳信县| 陕西省| 藁城市| 图们市| 长兴县| 宁波市| 潢川县| 抚远县| 朝阳市| 明水县| 房山区| 法库县|