January 10, 2020

Machine Learning Clustering Algorithms

What is clustering?

Clustering is the machine learning technique. It is the task of dividing the data points into a number of groups such that data points in the same group more similar to the other data points in the same groups and dissimilar to other data points in the other group. It is basically a collection of objects based on similarity and dissimilarity between them. Clustering is a method of separate learning and it is a technique for statistical data analysis used in many fields.

Why we use clustering?
The purpose of clustering is used in machine learning to make sense of and extract value from large sets of structural and nonstructural data. If you are working with large set unstructured data, it only makes sense to try to partition the data into some sort of logical groupings before attempting to analyze it.

If you are interested to learn Machine Learning you can enroll for free live demo Machine Learning Online course

Examples of clustering applications:
Following are examples of clustering applications:
Marketing: It can be used to characterize and discover customer segment for the marketing process
Biology: It can be used for classification among different types of plants and animals.
Libraries: It is used in clustering different books on the basis of topic and information.
Insurance: It is used to accept the customers, their policies and identifying the frauds.
Earthquake studies: By using a cluster, the earthquake-affected areas we can determine the dangerous zones.

Clustering methods:
There are 4 types of clustering methods:
Density-based method
Hierarchical based method
3. Partitioning method
4. Grid-based method

Density-based methods: These methods are considered the cluster has the dense region and having some similarities and differences from the lower dense region of the space. These methods have good accuracy and the ability to combine two clusters.
Examples of Density-based methods are DBSCAN (Density-Based spatial clustering Applications with Noise), OPTICS (Ordering Points To Identifying Clustering Structure), etc.

Hierarchical based methods: In the hierarchical clustering method, the clusters are formed as a tree-type structure based on the hierarchy. New clusters are formed using previously formed clusters. It is divided into 2 categories one is agglomerative and another one is Divisive. Agglomerative used for a bottom-up approach and Divisive is used for a top-down approach.
Examples of Hierarchy based methods are CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies), etc.

partitioning methods: these methods partitions the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similar function such as when the distance is a major parameter.
Examples of partitioning methods are k-means, CLARINS(Clustering large applications based upon randomized search).

Grid-based Methods: In this method, the data space is formulated into a finite number of cells that form a grid-like structure. All the clustering operations done on these grids are fast and independent of the number of data objects.
Examples of Grid-based methods are STING (Statistical information grid), wave cluster, CLIQUE(Clustering in QUEst), etc.

Clustering algorithms: The classification and clustering algorithm is given below:

1.K-means clustering algorithm: K-Means clustering algorithm, is the simplest unsupervised learning algorithm that solves well-known clustering problems. This process follows a simple and easy way to allocate a given dataset through a certain number of clusters. The main aim is, to define k centers, one for each cluster. These centers should be placed in a logical way because different location causes a different result.

If you are searching a machine learning interview question learn more

Advantages of the k-means algorithm:
Fast, robust easier to understand.
Gives the best result when the data set is well defined or well separated from each other.

Disadvantages of the k-means algorithm:
Algorithms fail for the non-linear dataset.
Applicable, only when mean is defined i.e fails for unconditional data.
Unable to handle noisy data and outliers.

Hierarchical clustering algorithm:
The hierarchical clustering algorithm is of 2 types: An agglomerative hierarchical clustering algorithm Divisive Hierarchical cluster algorithm

Agglomerative Hierarchical Clustering algorithm: This algorithm can be by collecting the data one-by-one on the basis of the nearest distance measure of all the pairwise distance between the data point. Again distance between data points is recalculated but we don’t know which distance is considered and which groups are formed, to know this we use the following methods, They are:
Single nearest distance or single linkage
Complete farthest distance to complete linkage
Average-average distance or average linkage
Centroid distance.

Decisive Hierarchical cluster algorithm: It is the reverse process of the Agglomerative hierarchical clustering algorithm.

Advantages of Hierarchical clustering algorithm:
Easy to implement and give the best results in some cases.
No theoretical information needed for the number of clusters required.

Disadvantages of Hierarchical clustering algorithm:
The algorithm never undo what was done previously.
No objective function is directly minimized.
Some times it is difficult to identify the correct number of clusters.

In this article, I have explained about clustering in machine learning and different clustering algorithms. I hope this article given the better idea about clusters.