K-means clustering from scratch

K-means clustering belongs to the class of unsupervised learning algorithms. The main idea is to partition a group of objects in a pre-defined number of clusters in a way that objects in the same cluster are more similar to each other than to other objects from other clusters.

The algorithm to detemine the final set of clusters can be divided in the following steps:

1. choose k – the number of clusters

2. select k random points as the initial centroids

3. assign each data point to the nearest cluster based on the distance of the data point to the centroid (use Euclidean distance)

4. calculate the new coordinates of all clusters centroids by averaging coordinates over all clusters members

5. repeat steps 3. and 4. until the clusters members do not change anymore (no transitions between clusters)

Data

Let us first generate a set of data points from multivariate normal distribution:

import numpy as np
import matplotlib.pyplot as plt
import random 
import os

#generate population
number_of_centres=3
number_of_points_per_centre=500
np.random.seed(1)
colors = ["red", "blue" , "green", "orange"]
img_counter=0
for i in range(number_of_centres): 
    means=[15*np.random.random(),10*np.random.random()]
    a=np.random.multivariate_normal(means,[[1,0],[0,1]],number_of_points_per_centre)    
    if i==0:
        X=a
    else:    
        X=np.vstack((X,a))
plt.figure(figsize=(10,8))
plt.scatter(X[:,0],X[:,1])

We need some functions to perform main steps of the algorithm:

def recalculate_clusters(X,centroids,k):
    clusters=dict()
    for i in range(k):
        clusters[i]=[]
    for data in X:
        e_distance=[]
        for j in range(k):
            e_distance.append(np.linalg.norm(data - centroids[j]))
        clusters[e_distance.index(min(e_distance))].append(data)
    return clusters    

def recalculate_centroids(centroids,clusters,k):
    for i in range(k):
        centroids[i]=np.average(clusters[i],axis=0)
    return centroids

Iterating through the algorithm:

k=3
centroids={}
for i in range(k):
    centroids[i]=X[i]
    
for i in range(10):        
    clusters=recalculate_clusters(X,centroids,k)  
    centroids=recalculate_centroids(centroids,clusters,k)
    plot_clusters(centroids,clusters,k)

and plotting the clusters with:

def plot_clusters(centroids,clusters,k):
    plt.figure(figsize=(10,8))  
    area = (20)**2
    for i in range(k):
        for cluster in clusters[i]:
            plt.scatter(cluster[0],cluster[1],c=colors[i % 3])          
        plt.scatter(centroids[i][0],centroids[i][1],s=area,marker='^', edgecolors='white',c=colors[i % 3])

gives the following evolution to final cluster composition: