Clustering, or grouping data based on similarities, is a powerful way to categorize information, making it easier to uncover hidden patterns in large datasets. Clustering organizes data points into groups, or “clusters,” where similar data is grouped, while dissimilar data is separated into clusters. This approach is part of what’s known as unsupervised machine learning, meaning it works with unlabeled datasets, finding relationships without prior guidance.
Clustering techniques aren’t just theoretical, they’re widely used in real-world applications across diverse fields like artificial intelligence, biology, customer relationship management, marketing, image processing, psychology, and more. Huang (1998) delved into extending the k-means algorithm to handle large datasets with categorical values, enhancing its utility in areas like data mining. Similarly, other authors (2010) discuss improvements to the k-means algorithm, making it more effective for large-scale data processing in various industries.
Up’s and down’s of clustering
Clustering offers several advantages that make it a popular choice in data analysis and machine learning. First, it is relatively simple to implement, requiring fewer complex steps compared to many other algorithms. Additionally, clustering algorithms, especially k-means, are known for their speed and efficiency, making them suitable for handling large datasets without extensive computational resources. Another benefit is that clustering can adapt smoothly to new examples, allowing it to remain relevant and effective even as datasets evolve.
Despite these strengths, clustering also has some limitations. One drawback is the need to manually choose the number of clusters, or K, which can impact the accuracy and quality of the clustering results if not selected carefully. Moreover, clustering algorithms can be highly sensitive to outliers, which may skew the results by misclassifying data points or causing clusters to appear where they don’t accurately represent the underlying structure. This sensitivity to outliers means that data preprocessing, such as cleaning or normalizing, is often necessary to achieve optimal results.
When to use clustering
The clustering techniques are unsupervised learning algorithms, these algorithms tend to shine when you have a set of unlabeled data. Clustering techniques, in particular, are relatively easy to implement and can, without instruction, quickly organize large datasets into something more usable, which is great for an exploratory analysis. This type of analysis is also a powerful tool when we don’t know how many classes we have.
There are problems where the use of clustering techniques is very common, such as resource allocation and market segmentation. We will use an example of the latter to illustrate how a clustering analysis is done.
Example
In this example we are a supermarket mall and through membership cards we have some basic data about your customers like customer ID, age, gender, annual income and spending score. We want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly. For doing that we choose to do a customer segmentation analysis using the k-means clustering technique. This analysis starts below.
For more information on what each column of the dataset means and what values it can contain, check out the dataset on Kaggle.
Import libraries and load the data
The first thing we have to do is to import the libraries, load the data and define the help functions that we’ll need.
import numpy as np
import pandas as pd
from warnings import filterwarnings
filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
import scipy as sp
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
import os
cwd = os.getcwd()
def elbow_method_plot(k_values: list, inertias: list, title: str) -> None:
"""
Plots the elbow graphic for understanding the better
value of k for a trained K-Means cluster.
"""
sns.set_palette("viridis")
plt.plot(k_values, inertias, 'bx-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Sum of the Square distances(Inertias)')
plt.title(title, fontsize=12, y=1.05)
plt.show()
def tsne_plot(df, target_column):
X = df.drop(target_column, axis=1)
y = df[target_column]
tsne = TSNE(n_components=2, random_state=seed)
X_tsne = tsne.fit_transform(X)
unique_targets = np.unique(y)
num_classes = len(unique_targets)
colors = plt.cm.get_cmap('viridis', num_classes)
sns.set_palette("viridis")
plt.figure(figsize=(8, 6))
legend_labels = ["Profile 1", "Profile 2", "Profile 3", "Profile 4"]
for i, target in enumerate(unique_targets):
plt.scatter(X_tsne[y == target, 0], X_tsne[y == target, 1], color=colors(i), label=legend_labels[i])
plt.title("Metacognitive Analysis with t-SNE", fontsize=12, y=1.05)
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")
plt.legend(title="")
plt.show()
dataset = pd.read_csv(os.path.join(cwd,'kaggle', 'input','segmentation data.csv'))
dataset.head()
Here we can see that the dataset has 8 columns. 7 of them have important information about the clients, and 1 of them has only the ID that identifies them. This ID does not contribute with information about the client, in terms of cluster analysis, so we cut this column.
Then it’s important to see what values we can expect in each of the fields and the distribution of the values. To do that, we’ll print all the unique values from each column and plot their distribution below.
dataset=dataset.drop("ID",axis=1)
print('Unique values in each column: ')
for x in list(dataset.columns):
print (x,': ', dataset[x].unique())
dataset.hist(figsize=(10,10))
The Elbow Method
The elbow methodology is used to determine the appropriate number of clusters in a K-Means clustering algorithm. The method is so named because the curve resulting from plotting the number of clusters versus the sum of squared distances looks similar to an elbow.
The interpretation of the elbow method graph is based on looking for an inflection point in the curve. This point is called the “elbow” and represents a balance between the model’s ability to explain the data and the complexity of the solution.
When analyzing the graph, we look for the point where the decrease in the sum of the squares of the distances begins to slow down significantly. This point indicates an adequate number of clusters since adding more clusters would not bring substantial gains in explaining the data.
# clustering parameters
k_values_range = range(2, 20)
squared_distances_error = []
silhouette_scores = []
seed = np.random.seed(300)
# applying clusterization
for k in k_values_range:
# fitting k-means
kmeans = KMeans(n_clusters=k, n_init="auto", random_state = seed)
kmeans.fit(dataset)
# calculating silhouette_scores
cluster_labels = kmeans.predict(dataset)
silhouette_avg = silhouette_score(dataset, cluster_labels)
# collecting fit metrics
squared_distances_error.append(kmeans.inertia_)
silhouette_scores.append(silhouette_avg)
# calculating optimal number of clusters with silhouette analysis data
silhouette_optimal_clusters = silhouette_scores.index(max(silhouette_scores)) + 4
# elbow plot
elbow_method_plot(
k_values=k_values_range,
inertias=squared_distances_error,
title="Elbow Method"
Analyzing the graph above, we can easily see that the elbow method indicates 4 as the ideal number of clusters to be considered for grouping the metacognitive database, which translates into a natural separation of 4 different customer profiles.
Data Retraining and Labeling
We’re then going to train the K-Means algorithm with the best number of clusters determined above, and we’re using the trained cluster to add the appropriate labels to each row in the database so that they can be analyzed later.
# training final cluster
kmeans = KMeans(n_clusters=silhouette_optimal_clusters, n_init="auto", random_state=seed)
kmeans.fit(dataset)
centroids = kmeans.cluster_centers_
# generating cluster labels
cluster_labels = kmeans.predict(dataset)
# labeling oiginal data
dataset["Cluster"] = cluster_labels
clusters_summary = dataset.groupby("Cluster").agg(lambda x: x.mode().iloc[0])
clusters_summary
The next step is to plot the variables divided by clusters and see the results.
sns.pairplot(dataset, hue="Cluster", height=2.5, palette='colorblind')
One of these plots stands out, the Income x Age plot, which shows a very clear pattern in the clustering, so let’s zoom in.
sns.scatterplot(data = dataset, x = 'Age', y = 'Income', hue = 'Cluster', palette='colorblind')
The data show that income is a very important distinguishing feature. Clusters 0 and 2 have higher incomes, and clusters 1 and 3 have lower incomes. This seems to be one of the most important features defining each cluster.
Conclusions and Next Steps
In summary, clustering is a powerful unsupervised machine learning technique that enables the classification of data into distinct groups based on common characteristics. Its ability to reveal hidden patterns makes it invaluable in fields ranging from marketing to medicine.
The k-means algorithm, for instance, is a widely used clustering method celebrated for its simplicity and efficiency, though it requires careful selection of the number of clusters and is sensitive to outliers. By segmenting data effectively, such as grouping customers in a supermarket scenario, clustering not only enhances exploratory data analysis but also provides actionable insights that inform strategic decisions. Clustering is thus a key tool for transforming unlabeled data into meaningful and valuable information.
To dive deeper into clustering techniques or discuss how they can be applied to your business, click the banner below and connect with one of our specialists!
This article was written by Caio Bastos. If you enjoyed it, feel free to send me an email via [email protected].