


Very fast version of kmeans clustering.
Cluster the N x p matrix X into k clusters using the kmeans algorithm. It returns the
cluster memberships for each data point in the N x 1 vector IDX and the K x p matrix of
cluster means in C.
Custom implementation of the kmeans algorithm. In some ways it is less general (for
example only uses euclidian distance), but it has some options that the matlab version
does not (for example, it has a notion of outliers and min-cluster size). It is also
many times faster than matlab's kmeans. General kmeans help can be found in help for
the matlab implementation of kmeans. Note that the although the names and conventions
for this algorithm are taken from Matlab's implementation, there are slight
alterations (for example, IDX==-1 is used to indicate outliers).
-------------------------------------------------------------------------
INPUTS
X
n-by-p data matrix of n p-dimensional vectors. That is X(i,:) is the ith point in X.
k
Integer indicating the maximum nuber of clusters for kmeans to find. Actual number may
be smaller (for example if clusters shrink and are eliminated).
-------------------------------------------------------------------------
ADDITIONAL INPUTS
[...] = kmeans2(...,'param1',val1,'param2',val2,...) enables you to
specify optional parameter name-value pairs to control the iterative
algorithm used by kmeans. Valid parameters are the following:
'replicates' - Number of times to repeat the clustering, each with a
new set of initial cluster centroid positions. kmeans
returns the solution with the lowest value for sumd.
'maxiter' - Maximum number of iterations. Default is 100.
'display' - Whether or not to display algorithm status (default==0)
'randstate' - seed with which to initialize kmeans. Useful for
replicability of algoirhtm.
'outlierfrac' - maximum fraction of points that can be treated as
outliers
'minCsize' - minimum size for a cluster (smaller clusters get
eliminated)
-------------------------------------------------------------------------
OUTPUTS
IDX
n-by-1 vector used to indicated cluster membership. Let X be a set of n points. Then
the ID of X - or IDX is a column vector of length n, where each element is an integer
indicating the cluster membership of the corresponding point in X. That is IDX(i)=c
indicates that the ith point in X belongs to cluster c. Cluster labels range from 1 to
k, and thus k=max(IDX) is typically the number of clusters IDX divides X into. The
cluster label "-1" is reserved for outliers. That is IDX(i)==-1 indicates that the
given point does not belong to any of the discovered clusters. Note that matlab's
version of kmeans does not have outliers.
C
k-by-p matrix of centroid locations. That is C(j,:) is the cluster centroid of points
belonging to cluster j. In kmeans, given X and IDX, a cluster centroid is simply the
mean of the points belonging to the given cluster, ie: C(j,:) = mean( X(IDX==j,:) ).
sumd
1-by-k vector of within-cluster sums of point-to-centroid distances. That is sumd(j) is
the sum of the distances from X(IDX==j,:) to C(j,:). The total sum, sum(sumd), is a
typical error measure of the quality of a clustering.
-------------------------------------------------------------------------
DATESTAMP
29-Sep-2005 2:00pm
See also DEMOCLUSTER