官术网_书友最值得收藏!

Agglomerative versus Divisive Clustering

Our instances of hierarchical clustering so far have all been agglomerative – that is, they have been built from the bottom up. While this is typically the most common approach for this type of clustering, it is important to know that it is not the only way a hierarchy can be created. The opposite hierarchical approach, that is, built from the top up, can also be used to create your taxonomy. This approach is called Divisive Hierarchical Clustering and works by having all the data points in your dataset in one massive cluster. Many of the internal mechanics of the divisive approach will prove to be quite similar to the agglomerative approach:

Figure 2.15: Agglomerative versus divisive hierarchical clustering

As with most problems in unsupervised learning, deciding the best approach is often highly dependent on the problem you are faced with solving.

Imagine that you are an entrepreneur who has just bought a new grocery store and needs to stock it with goods. You receive a large shipment of food and drink in a container, but you've lost track of all the shipment information! In order to most effectively sell your products, you must group similar products together (your store will be a huge mess if you just put everything on the shelves in a random order). Setting out on this organizational goal, you can take either a bottom-up or top-down approach. On the bottom-up side, you will go through the shipping container and think of everything as disorganized – you will then pick up a random object and find its most similar product. For example, you may pick up apple juice and realize that it makes sense to group it together with orange juice. With the top-down approach, you will view everything as organized in one large group. Then, you will move through your inventory and split the groups based on the largest differences in similarity. For example, you may originally think that apple juice and tofu go together, but on second thoughts, they are really different. Therefore, you will break them into smaller, dissimilar groups.

In general, it helps to think of agglomerative as the bottom-up approach and divisive as the top-down approach – but how do they trade off in performance? Due to the greedy nature of Agglomerative, it has the potential to be fooled by local neighbors and not see the larger implications of clusters it forms at any given time. On the flip side, the divisive approach has the benefit of seeing the entire data distribution as one from the beginning and choosing the best way to break down clusters. This insight into what the entire dataset looks like is helpful for potentially creating more accurate clusters and should not be overlooked. Unfortunately, a top-down approach, typically, trades off greater accuracy with deeper complexity. In practice, an agglomerative approach works most of the time and should be the preferred starting point when it comes to hierarchical clustering. If, after reviewing the hierarchies, you are unhappy with the results, it may help to take a divisive approach.

Exercise 8: Implementing Agglomerative Clustering with scikit-learn

In most real-world use cases, you will likely find yourself implementing hierarchical clustering with a package that abstracts everything away, such as scikit-learn. Scikit-learn is a free package that is indispensable when it comes to machine learning in Python. It conveniently provides highly optimized forms of the most popular algorithms, such as regression, classification, and, of book, clustering. By using an optimized package such as scikit-learn, your work becomes much easier. However, you should only use it when you fully understand how hierarchical clustering works from the prior sections. The following exercise will compare two potential routes that you can take when forming clusters – using SciPy and scikit-learn. By completing the exercise, you will learn what the pros and cons are of each, and which suits you best from a user perspective:

  1. Scikit-learn makes implementation as easy as just a few lines of code:

    from sklearn.cluster import AgglomerativeClustering

    from sklearn.datasets import make_blobs

    import matplotlib.pyplot as plt

    from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

    ac = AgglomerativeClustering(n_clusters = 8, affinity="euclidean", linkage="average")

    X, y = make_blobs(n_samples=1000, centers=8, n_features=2, random_state=800)

    distances = linkage(X, method="centroid", metric="euclidean")

    sklearn_clusters = ac.fit_predict(X)

    scipy_clusters = fcluster(distances, 3, criterion="distance")

    First, we assign the model to the ac variable, by passing in parameters that we are familiar with, such as affinity (the distance function) and linkage (explore your options as we did in Activity 2, Implementing Linkage Criteria).

  2. After instantiating our model into a variable, we can simply pass through the dataset we are interested in in order to determine where the cluster memberships lie using .fit_predict() and assigning it to an additional variable.
  3. We can then compare how each of the approaches work by comparing the final cluster results through plotting. Let's take a look at the clusters from the scikit-learn approach:

    plt.figure(figsize=(6,4))

    plt.title("Clusters from Sci-Kit Learn Approach")

    plt.scatter(X[:, 0], X[:, 1], c = sklearn_clusters ,s=50, cmap='tab20b')

    plt.show()

    Here is the output for the clusters from the scikit-learn approach:

Figure 2.16: A plot of the Scikit-Learn approach

Take a look at the clusters from the SciPy Learn approach:

plt.figure(figsize=(6,4))

plt.title("Clusters from SciPy Approach")

plt.scatter(X[:, 0], X[:, 1], c = scipy_clusters ,s=50, cmap='tab20b')

plt.show()

The output is as follows:

Figure 2.17: A plot of the SciPy approach

As you can see in our example problem, the two converge to basically the same clusters. While this is great from a toy-problem perspective, you will soon learn, in the next activity, that small changes to the input parameters can lead to wildly different results!

Activity 3: Comparing k-means with Hierarchical Clustering

You are managing a store's inventory and receive a large shipment of wine, but the brand labels have fallen off the bottles during transit. Fortunately, your supplier has provided you with the chemical readings for each bottle, along with their respective serial numbers. Unfortunately, you aren't able to open each bottle of wine and taste test the difference – you must find a way to group the unlabeled bottles back together according to their chemical readings! You know from the order list that you ordered three different types of wine and are given only two wine attributes to group the wine types back together. In this activity, we will be using the wine dataset.

Note

The wine dataset can be downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/wine/. It can be accessed at https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson02/Activity03.

UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

The aim of this activity is to implement k-means and hierarchical clustering on the wine dataset and to explore which approach ends up being more accurate or easier for you to use. You can try different combinations of scikit-learn implementations and use helper functions in SciPy and NumPy. You can use the silhouette score to compare the different clustering methods and visualize the clusters on a graph.

Expected Outcome:

After completing this activity, you will have gained an understanding of how k-means and hierarchical clustering work on similar datasets. You will likely notice that one method performs better than the other depending on how the data is shaped. Another key outcome from this activity is gaining an understanding of how important hyperparameters are in any given use case.

Here are the steps to complete this activity:

  1. Import the necessary packages from scikit-learn (KMeans, AgglomerativeClustering, and silhouette_score).
  2. Read the wine dataset into the pandas DataFrame and print a small sample.
  3. Visualize the wine dataset to understand its data structure.
  4. Use the sklearn implementation of k-means on the wine dataset, knowing that there are three wine types.
  5. Use the sklearn implementation of hierarchical clustering on the wine dataset.
  6. Plot the predicted clusters from k-means.
  7. Plot the predicted clusters from hierarchical clustering.
  8. Compare the silhouette score of each clustering method.

Plot the predicted clusters from the k-means clustering method as follows:

Figure 2.18: The expected clusters from the k-means method

Plot the predicted clusters from the agglomerative clustering method, as follows:

Figure 2.19: The expected clusters from the agglomerative method

Note

The solution for this activity is on page 312.

主站蜘蛛池模板: 安远县| 同心县| 鄱阳县| 张家港市| 台南市| 天镇县| 汉川市| 安岳县| 宁陕县| 盐池县| 新河县| 邯郸市| 麦盖提县| 鄂温| 延川县| 晋州市| 普兰县| 青神县| 清镇市| 江津市| 醴陵市| 棋牌| 迭部县| 百色市| 寿宁县| 上栗县| 曲阳县| 巫溪县| 浙江省| 陇南市| 平度市| 桦南县| 哈密市| 信宜市| 花莲市| 蕲春县| 兴国县| 黔西县| 井陉县| 宁陵县| 新和县|