justClusters {ClassDiscovery} | R Documentation |
Unsupervised clustering algorithms, such as partitioning around medoids
(pam
), K-means (kmeans
), or
hierarchical clustering (hclust
) after cutting the tree,
produce a list of class assignments along with other structure. To
simplify the interface for the BootstrapClusterTest
and
PerturbationClusterTest
, we have written these routines
that simply extract these cluster assignments.
cutHclust(data, k, method = "average", metric = "pearson") cutPam(data, k) cutKmeans(data, k) cutRepeatedKmeans(data, k, nTimes) repeatedKmeans(data, k, nTimes)
data |
A numerical data matrix |
k |
The number of classes desired from the algorithm |
method |
Any valid linkage method that can be passed to the
hclust function |
metric |
Any valid distance metric that can be passed to the
distanceMatrix function |
nTimes |
An integer; the number of times to repeat the K-means algorithm with a different random starting point |
Each of the clustering routines used here has a different
structure for storing cluster assignments. The kmeans
function stores the assignments in a ‘cluster’ attribute. The
pam
function uses a ‘clustering’ attribute. For
hclust
, the assigments are produced by a call to the
cutree
function.
It has been observed that the K-means algorithm can converge to
different solutions depending on the starting values of the group
centers. We also include a routine (repeatedKmeans
) that runs
the K-means algorithm repeatedly, using different randomly generated
staring points each time, saving the best results.
Each of the cut...
functions returns a vector of integer values
representing the cluster assignments found by the algorithm.
The repeatedKmeans
function returns a list x
with three
components. The component x$kmeans
is the result of the call
to the kmeans
function that produced the best fit to the
data. The component x$centers
is a matrix containing the list
of group centers that were used in the best call to kmeans
.
The component x$withinss
contains the sum of the within-group
sums of squares, which is used as the measure of fitness.
Kevin R. Coombes <kcoombes@mdanderson.org>
# simulate data from three different groups d1 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE) d2 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE) d3 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE) dd <- cbind(d1, d2, d3) cutKmeans(dd, k=3) cutKmeans(dd, k=4) cutHclust(dd, k=3) cutHclust(dd, k=4) cutPam(dd, k=3) cutPam(dd, k=4) cutRepeatedKmeans(dd, k=3, nTimes=10) cutRepeatedKmeans(dd, k=4, nTimes=10) # cleanup rm(d1, d2, d3, dd)