Our instances of hierarchical clustering so far have all been agglomerative – that is, they have been built from the bottom up. While this is typically the most common approach for this type of clustering, it is important to know that it is not the only way a hierarchy can be created. The opposite hierarchical approach, that is, built from the top up, can also be used to create your taxonomy. This approach is called Divisive Hierarchical Clustering and works by having all the data points in your dataset in one massive cluster. Many of the internal mechanics of the divisive approach will prove to be quite similar to the agglomerative approach:
Figure 2.15: Agglomerative versus divisive hierarchical clustering
As with most problems in unsupervised learning, deciding the best approach is often highly dependent on the problem you are faced with solving.
Imagine that you are an entrepreneur who has just bought a new grocery store and needs to stock it with goods. You receive a large shipment of food and drink in a container, but you've lost track of all the shipment information! In order to most effectively sell your products, you must group similar products together (your store will be a huge mess if you just put everything on the shelves in a random order). Setting out on this organizational goal, you can take either a bottom-up or top-down approach. On the bottom-up side, you will go through the shipping container and think of everything as disorganized – you will then pick up a random object and find its most similar product. For example, you may pick up apple juice and realize that it makes sense to group it together with orange juice. With the top-down approach, you will view everything as organized in one large group. Then, you will move through your inventory and split the groups based on the largest differences in similarity. For example, you may originally think that apple juice and tofu go together, but on second thoughts, they are really different. Therefore, you will break them into smaller, dissimilar groups.
In general, it helps to think of agglomerative as the bottom-up approach and divisive as the top-down approach – but how do they trade off in performance? Due to the greedy nature of Agglomerative, it has the potential to be fooled by local neighbors and not see the larger implications of clusters it forms at any given time. On the flip side, the divisive approach has the benefit of seeing the entire data distribution as one from the beginning and choosing the best way to break down clusters. This insight into what the entire dataset looks like is helpful for potentially creating more accurate clusters and should not be overlooked. Unfortunately, a top-down approach, typically, trades off greater accuracy with deeper complexity. In practice, an agglomerative approach works most of the time and should be the preferred starting point when it comes to hierarchical clustering. If, after reviewing the hierarchies, you are unhappy with the results, it may help to take a divisive approach.
Exercise 8: Implementing Agglomerative Clustering with scikit-learn
In most real-world use cases, you will likely find yourself implementing hierarchical clustering with a package that abstracts everything away, such as scikit-learn. Scikit-learn is a free package that is indispensable when it comes to machine learning in Python. It conveniently provides highly optimized forms of the most popular algorithms, such as regression, classification, and, of book, clustering. By using an optimized package such as scikit-learn, your work becomes much easier. However, you should only use it when you fully understand how hierarchical clustering works from the prior sections. The following exercise will compare two potential routes that you can take when forming clusters – using SciPy and scikit-learn. By completing the exercise, you will learn what the pros and cons are of each, and which suits you best from a user perspective:
Scikit-learn makes implementation as easy as just a few lines of code:
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
ac = AgglomerativeClustering(n_clusters = 8, affinity="euclidean", linkage="average")
X, y = make_blobs(n_samples=1000, centers=8, n_features=2, random_state=800)
First, we assign the model to the ac variable, by passing in parameters that we are familiar with, such as affinity (the distance function) and linkage (explore your options as we did in Activity 2, Implementing Linkage Criteria).
After instantiating our model into a variable, we can simply pass through the dataset we are interested in in order to determine where the cluster memberships lie using .fit_predict() and assigning it to an additional variable.
We can then compare how each of the approaches work by comparing the final cluster results through plotting. Let's take a look at the clusters from the scikit-learn approach:
plt.figure(figsize=(6,4))
plt.title("Clusters from Sci-Kit Learn Approach")
plt.scatter(X[:, 0], X[:, 1], c = sklearn_clusters ,s=50, cmap='tab20b')
plt.show()
Here is the output for the clusters from the scikit-learn approach:
Figure 2.16: A plot of the Scikit-Learn approach
Take a look at the clusters from the SciPy Learn approach:
plt.figure(figsize=(6,4))
plt.title("Clusters from SciPy Approach")
plt.scatter(X[:, 0], X[:, 1], c = scipy_clusters ,s=50, cmap='tab20b')
plt.show()
The output is as follows:
Figure 2.17: A plot of the SciPy approach
As you can see in our example problem, the two converge to basically the same clusters. While this is great from a toy-problem perspective, you will soon learn, in the next activity, that small changes to the input parameters can lead to wildly different results!
Activity 3: Comparing k-means with Hierarchical Clustering
You are managing a store's inventory and receive a large shipment of wine, but the brand labels have fallen off the bottles during transit. Fortunately, your supplier has provided you with the chemical readings for each bottle, along with their respective serial numbers. Unfortunately, you aren't able to open each bottle of wine and taste test the difference – you must find a way to group the unlabeled bottles back together according to their chemical readings! You know from the order list that you ordered three different types of wine and are given only two wine attributes to group the wine types back together. In this activity, we will be using the wine dataset.
UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science
The aim of this activity is to implement k-means and hierarchical clustering on the wine dataset and to explore which approach ends up being more accurate or easier for you to use. You can try different combinations of scikit-learn implementations and use helper functions in SciPy and NumPy. You can use the silhouette score to compare the different clustering methods and visualize the clusters on a graph.
Expected Outcome:
After completing this activity, you will have gained an understanding of how k-means and hierarchical clustering work on similar datasets. You will likely notice that one method performs better than the other depending on how the data is shaped. Another key outcome from this activity is gaining an understanding of how important hyperparameters are in any given use case.
Here are the steps to complete this activity:
Import the necessary packages from scikit-learn (KMeans, AgglomerativeClustering, and silhouette_score).
Read the wine dataset into the pandas DataFrame and print a small sample.
Visualize the wine dataset to understand its data structure.
Use the sklearn implementation of k-means on the wine dataset, knowing that there are three wine types.
Use the sklearn implementation of hierarchical clustering on the wine dataset.
Plot the predicted clusters from k-means.
Plot the predicted clusters from hierarchical clustering.
Compare the silhouette score of each clustering method.
Plot the predicted clusters from the k-means clustering method as follows:
Figure 2.18: The expected clusters from the k-means method
Plot the predicted clusters from the agglomerative clustering method, as follows:
Figure 2.19: The expected clusters from the agglomerative method