Very fast version of kmeans clustering. Cluster the N x p matrix X into k clusters using the kmeans algorithm. It returns the cluster memberships for each data point in the N x 1 vector IDX and the K x p matrix of cluster means in C. Custom implementation of the kmeans algorithm. In some ways it is less general (for example only uses euclidian distance), but it has some options that the matlab version does not (for example, it has a notion of outliers and min-cluster size). It is also many times faster than matlab's kmeans. General kmeans help can be found in help for the matlab implementation of kmeans. Note that the although the names and conventions for this algorithm are taken from Matlab's implementation, there are slight alterations (for example, IDX==-1 is used to indicate outliers). ------------------------------------------------------------------------- INPUTS X n-by-p data matrix of n p-dimensional vectors. That is X(i,:) is the ith point in X. k Integer indicating the maximum nuber of clusters for kmeans to find. Actual number may be smaller (for example if clusters shrink and are eliminated). ------------------------------------------------------------------------- ADDITIONAL INPUTS [...] = kmeans2(...,'param1',val1,'param2',val2,...) enables you to specify optional parameter name-value pairs to control the iterative algorithm used by kmeans. Valid parameters are the following: 'replicates' - Number of times to repeat the clustering, each with a new set of initial cluster centroid positions. kmeans returns the solution with the lowest value for sumd. 'maxiter' - Maximum number of iterations. Default is 100. 'display' - Whether or not to display algorithm status (default==0) 'randstate' - seed with which to initialize kmeans. Useful for replicability of algoirhtm. 'outlierfrac' - maximum fraction of points that can be treated as outliers 'minCsize' - minimum size for a cluster (smaller clusters get eliminated) ------------------------------------------------------------------------- OUTPUTS IDX n-by-1 vector used to indicated cluster membership. Let X be a set of n points. Then the ID of X - or IDX is a column vector of length n, where each element is an integer indicating the cluster membership of the corresponding point in X. That is IDX(i)=c indicates that the ith point in X belongs to cluster c. Cluster labels range from 1 to k, and thus k=max(IDX) is typically the number of clusters IDX divides X into. The cluster label "-1" is reserved for outliers. That is IDX(i)==-1 indicates that the given point does not belong to any of the discovered clusters. Note that matlab's version of kmeans does not have outliers. C k-by-p matrix of centroid locations. That is C(j,:) is the cluster centroid of points belonging to cluster j. In kmeans, given X and IDX, a cluster centroid is simply the mean of the points belonging to the given cluster, ie: C(j,:) = mean( X(IDX==j,:) ). sumd 1-by-k vector of within-cluster sums of point-to-centroid distances. That is sumd(j) is the sum of the distances from X(IDX==j,:) to C(j,:). The total sum, sum(sumd), is a typical error measure of the quality of a clustering. ------------------------------------------------------------------------- DATESTAMP 29-Sep-2005 2:00pm See also DEMOCLUSTER