kmeans2

PURPOSE ^

Very fast version of kmeans clustering.

SYNOPSIS ^

function [IDX,C,sumd] = kmeans2( X,k,varargin )

DESCRIPTION ^

 Very fast version of kmeans clustering.

 Cluster the N x p matrix X into k clusters using the kmeans algorithm. It returns the
 cluster memberships for each data point in the N x 1 vector IDX and the K x p matrix of
 cluster means in C. 

 Custom implementation of the kmeans algorithm.  In some ways it is less general (for
 example only uses euclidian distance), but it has some options that the matlab version
 does not (for example, it has a notion of outliers and min-cluster size).  It is also
 many times faster than matlab's kmeans.  General kmeans help can be found in help for
 the matlab implementation of kmeans. Note that the although the names and conventions
 for this algorithm are taken from Matlab's implementation, there are slight
 alterations (for example, IDX==-1 is used to indicate outliers).

 
 -------------------------------------------------------------------------
 INPUTS
 
  X
 n-by-p data matrix of n p-dimensional vectors.  That is X(i,:) is the ith point in X.

  k
 Integer indicating the maximum nuber of clusters for kmeans to find. Actual number may
 be smaller (for example if clusters shrink and are eliminated).

 -------------------------------------------------------------------------
 ADDITIONAL INPUTS

 [...] = kmeans2(...,'param1',val1,'param2',val2,...) enables you to
 specify optional parameter name-value pairs to control the iterative
 algorithm used by kmeans. Valid parameters are the following:
   'replicates'  - Number of times to repeat the clustering, each with a
                   new set of initial cluster centroid positions. kmeans
                   returns the solution with the lowest value for sumd.
   'maxiter'     - Maximum number of iterations. Default is 100.
   'display'     - Whether or not to display algorithm status (default==0)
   'randstate'   - seed with which to initialize kmeans.  Useful for
                   replicability of algoirhtm.
   'outlierfrac' - maximum fraction of points that can be treated as
                   outliers   
   'minCsize'    - minimum size for a cluster (smaller clusters get
                   eliminated)

 -------------------------------------------------------------------------
 OUTPUTS

  IDX
 n-by-1 vector used to indicated cluster membership.  Let X be a set of n points.  Then
 the ID of X - or IDX is a column vector of length n, where each element is an integer
 indicating the cluster membership of the corresponding point in X.  That is IDX(i)=c
 indicates that the ith point in X belongs to cluster c. Cluster labels range from 1 to
 k, and thus k=max(IDX) is typically the number of clusters IDX divides X into.  The
 cluster label "-1" is reserved for outliers.  That is IDX(i)==-1 indicates that the
 given point does not belong to any of the discovered clusters.  Note that matlab's
 version of kmeans does not have outliers.

  C        
 k-by-p matrix of centroid locations.  That is C(j,:) is the cluster centroid of points
 belonging to cluster j.  In kmeans, given X and IDX, a cluster centroid is simply the
 mean of the points belonging to the given cluster, ie: C(j,:) = mean( X(IDX==j,:) ). 

  sumd
 1-by-k vector of within-cluster sums of point-to-centroid distances. That is sumd(j) is
 the sum of the distances from X(IDX==j,:) to C(j,:). The total sum, sum(sumd), is a
 typical error measure of the quality of a clustering. 

 -------------------------------------------------------------------------

 DATESTAMP
   29-Sep-2005  2:00pm

 See also DEMOCLUSTER

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:
Generated on Wed 03-May-2006 23:48:50 by m2html © 2003