Clustering in Python

What is Clustering?

Clustering is an unsupervised Machine Learning technique used in statistical data analysis, image processing, and pattern recognition. The clustering algorithm classifies each data (variables) in a particular group. Similar variables, properties, features, and data points are in a single group while other data points exist. For example, data points such as income, education, profession, age, number of children, etc., come in different clusters, and each cluster has people with similar socioeconomic criteria.

Clustering in python

interpret cluster

After computing optimal clusters, aggregate measure like mean has to be calculated on all variables, and then resultant values for all the variables have to be interpreted among the clusters.

K-Means Clustering

K-means algorithm is iterative. First, define means called K value(k points).partitions a data set into clusters and randomly selected centroids(centre point). This process is repetitive until the cluster formed is homogeneous and the points in each cluster are close. The algorithm tries to maintain enough separability between these clusters. Due to the unsupervised nature, the clusters have no labels.

kmeans clustering
#import KMeans library from sklearn.cluster
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2) # want 2 cluster
kmeans.fit(X)

Full Code: Click Hear.

Hierarchical Clustering

The hierarchical cluster algorithm treats each data point as a separate cluster, also known as hierarchical cluster analysis. Hierarchical Clustering has two types. One is agglomerative Clustering and divisive Clustering. Agglomerative clusters use a bottom-up approach. Starting with single data points as Cluster, then merging other Cluster. in the last, all data points in the one Cluster. Divisive Cluster uses a Top-down approach. They are working as vice versa agglomerative Clustering. First, all data points are in one singe cluster, then split. 

READ Also  Django Vs Flask

Dendrogram

Hierarchial Clustering number of clusters will be decided only after looking at the dendrogram.

dendrogram use for visualise cluster and spiting single cluster to multiple.

Linkage

Linkage is the criteria based on which distances between two clusters are computed: Single, Complete, Average.

Single Linkage– The distance between two clusters is defined as the shortest distance between two points in each cluster. Complete Linkage – The distance between two clusters is defined as the long distance between two points in each cluster. Average Linkage – the distance between two clusters is defined as the average distance between each point in one cluster to every other cluster.

Hierarchical Clustering
#import AgglomerativeClustering from sklearn.cluster library 
from sklearn.cluster import AgglomerativeClustering 
#build Agglomerative Hierarchical Clustering
h_complete=AgglomerativeClustering(n_clusters=3,linkage='complete',affinity = "euclidean").fit(X)

Full Code: Click Here.

Conclusion

Scalability in Cluster algorithm. A highly scalable cluster algorithm is needed in data mining if the data size is enormous. Suppose the data set has noisy. Then first, impute them; otherwise, give a poor accuracy. And handle all kinds of attributes such as binary, categorical, and numerical (interval-based) data.

Leave a Comment