K-Means Clustering

k-means

K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the unlabeled dataset into different clusters. It aims at grouping similar objects into clusters, thereby revealing inherent structures within the data for exploratory analysis or serving as a preprocessing step for subsequent algorithms.

Our primary objective is to partition a set of n data points into k clusters. 'S' represents the set of all possible cluster assignments. It is important to note that S is finite.

K-means Clustering (Lloyd’s Algorithm)

Lloyd’s Algorithm, popularly known as the k-means algorithm, offers a widely utilized and straightforward clustering method that segregates a dataset into K predetermined clusters by iteratively computing the mean distance between the points and their cluster centroids.

Algorithm Steps:

Start by randomly selecting k points, which will act as the initial centers for our clusters. These points are known as means or centroids.
Assign each data point to the cluster whose centroid is closest to it. Then, update the centroids by taking the average of all the data points assigned to each cluster.
Repeat this process for a set number of iterations, refining the clusters each time until they stabilize and the centroids no longer change significantly. At this point, we have our final clusters.

Formula for Means:

The primary goal is to minimize the sum of the squared distances between each data point and its assigned centroid.

Σ ||xi – μzi||^2

Where xi denotes the data point, zi denotes the cluster label of xi, and μzi denotes the mean of the cluster with label zi.

Elbow Method

Determining the optimal number of clusters is crucial in unsupervised learning, as it directly affects the quality of clustering. Without a predefined number of clusters, we turn to methods like the Elbow Method to find the most suitable value for k, the number of centroids in k-means clustering.

The Elbow Method involves iterating through a range of k values, typically from 1 to a chosen maximum value. For each value of k, we calculate the within-cluster sum of squares (WCSS), which represents the sum of the squared distances between each data point and its assigned centroid.

To determine the optimal k value, we plot a graph of k against WCSS. The characteristic “elbow” suggests a balance between maximizing the number of clusters for finer distinctions and minimizing redundancy in clustering.

Implementation of Elbow Method

First, import the necessary libraries.
Load the dataset. In this case, we will use a diabetes dataset of patients.
The clustering will be based on the data points in the columns Age and BMI.
Separate the two columns using the iloc method to obtain all the rows for Age and BMI, and convert them into an array.

from sklearn.cluster import KMeans

# Example code
X = dataset.iloc[:, [age_column, bmi_column]].values

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Plot the Elbow graph
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Since the elbow is found at 4, the number of clusters to be chosen for the K-Means Clustering algorithm is k=4.

Formation of Clusters

Fit the model to the data X with k=4. The fit_predict() method fits the model and returns the cluster labels for each data point. These labels are stored in the variable y_kmeans.

kmeans = KMeans(n_clusters=4, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(X)

# Plotting clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4')

# Plotting centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='yellow', label='Centroids')
plt.title('Clusters of patients')
plt.xlabel('Age')
plt.ylabel('BMI')
plt.legend()
plt.show()

Read the full article on K-Means Clustering