Advertisement

A Beginner’s Guide to Hierarchical Clustering and how to Perform it in Python

阅读量:

Introduction

Understanding customer behavior is of utmost importance in any industry. I recalled that last year, when my chief marketing officer inquired, "I was curious to know which existing customers we should target for our new product."

That was indeed a significant learning curve for me. As a data scientist, I rapidly understood how crucial it was to segment customers for my organization to develop tailored and strategic plans. This is precisely where the concept of clustering proved to be ever so much more effective in facilitating these efforts!

Problems akin to customer segmentation often appear highly non-trivial, as individuals are not initially equipped with any specific outcome variable to predict or focus on. We find ourselves firmly within the domain of unsupervised learning, where our primary task is to uncover inherent patterns and structures without pre-defining a target variable. undoubtedly presents an exciting and intellectually stimulating challenge for data scientists.

在这里插入图片描述

Now, several distinct approaches exist to execute clustering tasks (as you will encounter in the following sections). In this article, I aim to acquaint you with a specific category of these methodologies – hierarchical clustering.

We aim to acquire a comprehensive understanding of hierarchical clustering, emphasizing its superiorities over alternative clustering techniques, explore various forms of hierarchical clustering, and master the step-by-step process for its implementation. Finally, we will utilize a customer segmentation dataset to apply hierarchical clustering in Python. I thoroughly enjoy this approach, confident that you’ll find it equally fascinating once you’ve read through this article.

Note: As noted, various methods for performing clustering exist. I invite you to take a look at our comprehensive resource on the different types of clustering available.

Table of Contents
  • 监督学习与无监督学习方法对比
  • 为什么选择层次聚类?
  • 层次聚类的基本概念是什么?
  • 层次聚类的不同类型有哪些?
  • 什么是凝聚分析?
  • 如何执行层次聚类分析?
  • 如何确定层次聚类中的最优簇数?
  • 在解决 wholesale 客户细分问题时如何应用层次聚类
Supervised vs Unsupervised Learning

It is crucial to grasp the distinction between supervised and unsupervised learning, as it is imperative to understand unsupervised learning before proceeding with hierarchical clustering. I would like to explain this distinction using a straightforward example.

Assuming our goal is to determine the number of bikes that will be rented each day in the city.

在这里插入图片描述

Or, let's assume we aim to predict whether a passenger on the Titanic might have survived.

在这里插入图片描述

We have a fixed target to achieve in both these examples:

  • In Example One: We aim to forecast bicycle numbers by considering factors such as seasonality、holidays、working days、weather conditions and temperature levels.
  • The prediction focuses on determining if passengers survived or perished in Example Two. The 'Survived' variable indicates survival status: a value of 0 signifies non-survival and a value of 1 signifies survival. The influencing factors considered in this scenario include passenger class gender age ticket fare etc.

When provided with a target variable (count and Survival in the two scenarios mentioned above), we aim to forecast outcomes based on a collection of independent variables ( season , holiday , Sex , Age , etc. ). These types of problems are categorized as supervised learning problems.

Let’s look at the figure below to understand this visually:

在这里插入图片描述

Here, y denotes our response or outcome variable, while X represents the explanatory variables. The response variable depends on X and is therefore referred to as a dependent variable. Our model is built using these explanatory variables under the supervision of y, giving rise to the term supervised learning.

Our objective during model training is to design a function that maps the input space to the output space. Once the model has been established, it will accept new sample batches and predict their corresponding outputs. This essentially constitutes supervised learning.

可能存在这样一种情况:当我们无法获得任何目标变量来进行预测时。这些情况通常被归类为无监督学习问题。在这些问题中,我们只拥有自变量而没有因变量。

在这里插入图片描述

We aim to partition the complete dataset into clusters in these scenarios. The groups are referred to as clusters, and the process that forms them has been termed clustering.

在这里插入图片描述

This technique is typically employed to cluster a population into distinct groups. Some typical examples involve dividing customers into segments, grouping similar documents, suggesting analogous songs and movies, among others.

Unsupervised learning has numerous applications. If anyone discovers an intriguing application, feel compelled to mention it below this comment.

Now, contemporary advancements have led to the development of diverse algorithms that assist in forming these clusters. K-means and Hierarchical clustering are widely regarded as the most popular clustering methods.

Why Hierarchical Clustering?

We must first comprehend the underlying principles of K-means clustering before we can effectively tackle hierarchical clustering. I beseech you, it will undeniably enhance your grasp of this intricate concept.

Here’s a brief overview of how K-means works:

Determine the number of clusters (k). Choose k initial centroids randomly from your dataset. Group all data points into their nearest cluster center by measuring distance to each centroid. Compute new cluster centers based on current groupings by averaging coordinates of all points in a cluster. Iterate Steps 3 and 4 until convergence when no further changes occur in cluster assignments or centroid positions.

该过程属于迭代性质。该系统将持续运行下去直到新形成的集群的质心不再发生变化或达到预先设定的最大迭代次数。

But K-means faces certain limitations. It tends to generate clusters of similar sizes, which might not always be ideal for real-world data. Additionally, determining the appropriate number of clusters is a critical step that must be decided upfront in the algorithm. Ideally, we would like to ascertain this optimal number without prior knowledge, making it a notable challenge inherent to the K-means methodology.

The gap-based hierarchical clustering method achieves remarkable success in organizing data without prior assumptions about cluster numbers. This approach eliminates the need for prespecified cluster numbers, which is a significant advantage over traditional methods that rely on initial guesses. Undoubtedly, this approach stands out as an exceptional solution in the field of unsupervised learning. Let’s delve into what hierarchical clustering entails and explore its advantages over the K-means algorithm.

What is Hierarchical Clustering?

Let’s say we have the below points and we want to cluster them into groups:

在这里插入图片描述

We can assign each of these points to a separate cluster:

在这里插入图片描述

Given that these clusters exhibit similarities, we can merge the most similar clusters together and iterate upon this procedure until a sole cluster remains.

在这里插入图片描述

We are constructing a series of hierarchical clusters to organize our data systematically. The term 'hierarchical clustering' is derived from the fact that this algorithm organizes data points into nested groups based on their similarities and differences. In upcoming sections, I will delve into the methodology for determining the optimal number of clusters for your dataset. For now, let's explore the various forms and applications of hierarchical clustering techniques in more detail.

Types of Hierarchical Clustering

There are mainly two types of hierarchical clustering:

Agglomerative类型的层次聚类算法是一种将数据点逐步合并为簇的无监督学习方法

Let’s understand each type in detail.

Agglomerative Hierarchical Clustering

Each data point is uniquely partitioned into its respective cluster within this methodology. Assuming a dataset comprising four distinct data points, the methodology will systematically allocate each point into a corresponding cluster, thereby initially establishing four unique clusters.

在这里插入图片描述

In each iteration, we group the pair of clusters that are nearest to each other and repeat this process until just one cluster remains.

在这里插入图片描述

In each step, the clusters are combined. This method is commonly referred to as additive hierarchical clustering.

Divisive Hierarchical Clustering

Divisive hierarchical clustering operates inversely compared to agglomerative hierarchical clustering. In contrast to starting with numerous clusters corresponding to individual data points, this approach begins with a single cluster and subsequently assigns all data points to it.

Regardless of whether we have 10 or 1000 data points, it doesn't matter how many; initially, all these data points will be assigned to the same cluster.

在这里插入图片描述

Within each iteration, we identify the farthest point within the cluster and repeatedly apply this procedure until every cluster ultimately comprises just one individual.

在这里插入图片描述

At each iteration, we partition existing clusters into smaller groups, thereby establishing a hierarchical structure through divisive methods.

The Agglomerative Clustering method is commonly employed across various industries, serving as a central topic in this article. The Divisive Hierarchical Clustering approach becomes relatively straightforward once one grasps the principles of Agglomerative methods.

Steps to Perform Hierarchical Clustering

Within hierarchical clustering, we repeatedly merge the most similar points or clusters, a fact that is well-established. This brings us to an essential question: how can we determine which points are considered sufficiently similar while excluding those that are not? It stands as one of the most critical inquiries within the field of clustering analysis!

Among various methods to assess similarity, one effective approach involves measuring the distance between cluster centroids. Those with the smallest separation are termed similar entities, allowing us to consolidate them. Recognized by its reliance on inter-cluster distances, this method is categorized under distance-based algorithms.

In hierarchical clustering, there exists a term known as the proximity matrix. It represents rather than storing the actual distances between each pair of points. When analyzing this matrix along with the steps involved in performing hierarchical clustering, let's consider an example to better understand its application and significance.

Setting up the Example

在这里插入图片描述

Assume a educator aims to partition her students into distinct groups. Given each student’s score in the assignment, she seeks segmentation. There is no predetermined objective concerning the number of clusters. Without knowing which student type should be allocated to which cluster, this cannot be framed as a supervised learning task. Therefore, we opt for hierarchical clustering methods and partition the students into various clusters.

Let’s take a sample of 5 students:

在这里插入图片描述

Creating a Proximity Matrix

First, we will construct a proximity matrix to determine the distances between each pair of points. Since we calculate each point's distance from every other point, the resulting matrix will be square-shaped with dimensions n by n, where n represents the total number of observations.

Let’s make the 5 x 5 proximity matrix for our example:

在这里插入图片描述

The diagonal entries in this matrix are consistently zero because the distance from any point to itself is zero. The rest of these distances will be calculated using the Euclidean distance formula. Suppose we aim to compute the distance between points one and two:

√(10-7)^2 = √9 = 3

By computing all pairwise distances between data points, we can complete the proximity matrix.

Steps to Perform Hierarchical Clustering

Step 1 : First, we assign all the points to an individual cluster:

在这里插入图片描述

Distinct color assignments here correspond to distinct cluster groups. We observe that there are 5 distinct cluster groups for the 5 points in our dataset.

Step 2: In addition, we examine the closest distance within the proximity matrix to identify and merge those points exhibiting this minimal distance. Following this operation, we subsequently update the proximity matrix accordingly.

在这里插入图片描述

Here, the smallest distance is 3 and hence we will merge point 1 and 2:

在这里插入图片描述

It is necessary to redetermine the proximity matrix based on the revised cluster groups.

在这里插入图片描述

In this context, we have selected the maximum score from two sets (7 and 10) to serve as a replacement for this cluster's scores. However, instead of selecting only the highest value, other options such as taking either the lowest or averaging both could be considered in subsequent steps. Moving forward, we will proceed to recalculate...

在这里插入图片描述

Step 3: We will repeat step 2 until only a single cluster is left.

In this process, we first examine the smallest distance within the proximity matrix before combining the two most proximal clusters. After performing these operations repeatedly, as demonstrated below, we obtain the resulting merged clusters.

在这里插入图片描述

Starting from 5 clusters, we result in a single cluster.
This illustrates how agglomerative hierarchical clustering performs its operations.
However, a central issue still persists—how do we determine the optimal number of clusters?
Moving forward, let us delve into this matter in detail.

How should we Choose the Number of Clusters in Hierarchical Clustering?

I've got here at last to address this question that has been lingering since I began my studies. We rely on an essential technique in hierarchical clustering analysis to determine the number of clusters. This technique is known as a dendrogram, which serves as a diagrammatic representation used in hierarchical cluster analysis to visualize the merging and splitting processes.

A dendrogram depicts a hierarchical structure that depicts the clusters' histories.

Let's revisit our teacher-student paradigm. When two clusters are merged, a dendrogram captures their separation distance and visualizes it graphically. Observing a dendrogram: It typically comprises vertical lines symbolizing individual data points and horizontal lines illustrating the merging process.

在这里插入图片描述

The dataset contains sample instances along both axes, with sample characteristics displayed on one axis and pairwise distances plotted on the other. When two clusters are merged, their merging process is reflected in this hierarchical structure as a connection point, with its height determined by the corresponding distance metric. Constructing a dendrogram for our case study:

在这里插入图片描述

Please pause for a moment to analyze this image. Initially, we combined samples 1 and 2, finding their distance to be 3 (as detailed in our earlier proximity matrix). Our next step is to represent this in a dendrogram.

在这里插入图片描述

Here, it is evident that sample 1 and 2 were successfully combined. The vertical line signifies the distance between these samples. Concurrently, all merging steps were utilized to generate a dendrogram of this nature.

在这里插入图片描述

Hierarchical clustering's process can be visually represented using a dendrogram. The greater the distance between vertical lines in a dendrogram, the greater the distance between those clusters.

Currently, we are able to establish a threshold distance and subsequently draw a horizontal line. (Typically, our approach involves setting the threshold in such a manner that it intersects with the tallest vertical line on the chart.) Let’s adopt this specific threshold value of 12 and proceed to draw our horizontal reference line.

在这里插入图片描述

The number of clusters is determined by how many vertical lines are intersected by a line drawn using a specific threshold. As shown in this example, if a red line intersects two vertical lines with a threshold value of five percent, it creates two distinct clusters. Each cluster contains its own set of samples; for instance, one cluster might include samples at positions (1.0019877679658864e-05, 1), while another could contain samples at positions (3.00596389491963e-05, 5). This process is quite straightforward.

This is a method to determine the number of clusters by analyzing a dendrogram visualization tool within hierarchical clustering framework. In upcoming sections, this technique will be applied to demonstrate these concepts clearly.

Applying hierarchical clustering techniques to address the challenge of customer segmentation in the wholesale sector

Time to get our hands dirty in Python!

We are concentrating on a customer segmentation problem specific to the wholesale sector. You can obtain the dataset via this link: https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale customers data.csv. The dataset is hosted by the UCI Machine Learning repository. The objective of this task is to classify clients of a wholesale distributor based on their annual expenditures across various product categories such as milk, grocery goods, and regional products.

First, we conduct an exploratory analysis of the dataset. Following this, we implement a hierarchical clustering approach to effectively categorize and segment the client base.

We will first import the required libraries:

复制代码
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline
    
    
      
      
      
      
    
    AI助手

Load the data and look at the first few rows:

复制代码
    data = pd.read_csv('Wholesale customers data.csv')
    data.head()
    
    
      
      
    
    AI助手
在这里插入图片描述

The study involves several product categories including Fresh Milk and Grocery items. These metrics indicate the quantity of each product purchased by individual clients. We aim to group the data into clusters to effectively segment similar customer profiles. It is reasonable to employ Hierarchical Clustering given the nature of this problem.

However, before implementing Hierarchical Clustering, it's imperative to standardize your data in order for each variable's scale to remain consistent. What's its significance? Because if variables aren't scaled uniformly—such as with Fresh and Milk indicators—as shown in Table 1—the model may become skewed towards those with higher magnitudes.

So, let us first standardize the data and adjust variables to a uniform scale.

复制代码
    from sklearn.preprocessing import normalize
    data_scaled = normalize(data)
    data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
    data_scaled.head()
    
    
      
      
      
      
    
    AI助手
在这里插入图片描述

Currently, we observe that the magnitude of all features is nearly identical. Having achieved this, we are now prepared to proceed. Let’s begin by constructing a dendrogram to enable us to determine the optimal number of clusters for this specific issue:

复制代码
    import scipy.cluster.hierarchy as shc
    plt.figure(figsize=(10, 7))  
    plt.title("Dendrograms")  
    dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
    
    
      
      
      
      
    
    AI助手
在这里插入图片描述

In this plot, the x-axis represents various samples, while the y-axis denotes their corresponding distances. The vertical line that reaches the highest point on this plot is colored blue, which allows us to establish a threshold value of 6 and truncate the dendrogram accordingly.

复制代码
    plt.figure(figsize=(10, 7))  
    plt.title("Dendrograms")  
    dend = shc.dendrogram(shc.linkage(data_scaled, method='ward'))
    plt.axhline(y=6, color='r', linestyle='--')
    
    
      
      
      
      
    
    AI助手
在这里插入图片描述

This line divides the dendrogram into two points, thereby creating two cluster groups. Now, let's proceed to apply hierarchical clustering to form two cluster groups.

复制代码
    from sklearn.cluster import AgglomerativeClustering
    cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')  
    cluster.fit_predict(data_scaled)
    
    
      
      
      
    
    AI助手
在这里插入图片描述

We are clearly identifiable values of binary digits, such as 0 and 1, within our output because we defined only two clusters. The value of each digit indicates that a particular point is assigned to either one cluster or another. Moving forward, let's visualize these two distinct clusters.

复制代码
    plt.figure(figsize=(10, 7))  
    plt.scatter(data_scaled['Milk'], data_scaled['Grocery'], c=cluster.labels_) 
    
    
      
      
    
    AI助手
在这里插入图片描述

Eager to see the results? We are capable of easily visualizing the two clusters in this section. The process of implementing hierarchical clustering in Python is as follows: This involves recursively partitioning the dataset into subsets based on similarity measures, ultimately forming a tree-like structure that represents the hierarchy of clusters.

End Notes

Hierarchical clustering is a highly effective approach for grouping observations. One significant advantage is that it does not require predefining the number of clusters, which significantly outperforms k-Means.

Given that you're still relatively new to data science, I strongly advise enrolling in our Applied Machine Learning course. This course stands out as one of the most thorough end-to-end machine learning courses available. Additionally, hierarchical clustering represents just one among a wide array of subjects covered in this course.

What do you think about hierarchical clustering? Would you say there's a more efficient method to form clusters while using fewer computational resources? Feel free to connect with me in the comments section below for a discussion.

复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手
复制代码
    
    
      
    
    AI助手

全部评论 (0)

还没有任何评论哟~