Cluster Analysis

Introduction

Cluster analysis (or clustering) is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters or classes), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. The computational task of classifying the data set into k clusters is often referred to as k-clustering.

Types of clustering

Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative or divisive. Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering.

Distance measures

An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another.

Particularly important distance measures are the Euclidean distance which leads to a spherical shape of the clusters, and the Mahalanobis distance, which leads to arbitrary elliptic shapes, reflecting the different scales and correlations in the data.

BEAM cluster analysis tools

BEAM currently offers two clustering tools, both employing partional algorithms:
K-means clustering
This algorithm assigns each pixel to the cluster whose center is nearest. The center is the arithmetic mean of all pixels belonging to the cluster. The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Its disadvantage is that it does not take into account different scales and correlations in the data. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance. This algorithm should be your primary choice for performing a cluster analysis. For the anylysis of large scenes, this algorithm is strongly remmended.
EM clustering
This algorithm is a generalization of the k-means algorithm where the clusters are ellipsoids defined by a center and a covariance matrix. The main advantage of this algorithm is that it is not affected by different scales of the data dimensions and correlations between them. Its disadvantage is considerably less speed, which practically limits the applicability to smaller datasets only. It minimizes the intra-cluster variances but does not ensure that the result has a global minimum of variance. Use this algorithm when you want to perform a cluster analysis of a small scene or region-of-interest and are not satisfied with the results obtained from the k-means algorithm.

Further information

A good starting point for obtaining further information on cluster analysis terms and algorithms is the Wikipedia entry on data clustering.