The kmed package was designed to analyse k-medoids based clustering; the features include:
There are four distance options, i.e. manhattan weighted by rank (mrw
), squared euclidean weighted by rank (ser
) and variance (sev
), and squared euclidean unweighted (se
). The distNumeric
in the method
provides its desired distance method.
library(kmed)
num <- as.matrix(iris[,1:4])
mrwdist <- distNumeric(num, num, method = "mrw")
mrwdist[1:6,1:6]
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00000000 0.2638889 0.2530603 0.3225047 0.06944444 0.3841808
## [2,] 0.26388889 0.0000000 0.1558380 0.1419492 0.27777778 0.6480697
## [3,] 0.25306026 0.1558380 0.0000000 0.1033427 0.26694915 0.6372411
## [4,] 0.32250471 0.1419492 0.1033427 0.0000000 0.33639360 0.6727872
## [5,] 0.06944444 0.2777778 0.2669492 0.3363936 0.00000000 0.3702919
## [6,] 0.38418079 0.6480697 0.6372411 0.6727872 0.37029190 0.0000000
Two options of distances are available for binary or categorical variables, namely simple matching (matching
) and coocurrence (coocurrence
) distance.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000000 0.3333333 0.6666667 0.6666667 0.6666667 0.3333333 0.6666667
## [2,] 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000
## [3,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [4,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [5,] 0.6666667 0.3333333 0.6666667 0.6666667 0.0000000 1.0000000 0.6666667
## [6,] 0.3333333 0.6666667 0.3333333 0.3333333 1.0000000 0.0000000 0.3333333
## [7,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000000 0.4500000 0.7916667 0.7916667 0.7000000 0.5416667 0.7916667
## [2,] 0.4500000 0.0000000 1.2416667 1.2416667 0.2500000 0.9916667 1.2416667
## [3,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [4,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [5,] 0.7000000 0.2500000 0.9916667 0.9916667 0.0000000 1.2416667 0.9916667
## [6,] 0.5416667 0.9916667 0.2500000 0.2500000 1.2416667 0.0000000 0.2500000
## [7,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
There are five distances for mixed variables data, i.e gower (gower
), wishart (wishart
), podani (podani
), huang (huang
) and harikumar-pv (harikumar
).
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""), paste(c("bin"), 1:3, sep = ""), paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)
## 1 2 3 4 5 6 7
## 1 0.0000000 0.4228395 0.5648148 0.4799383 0.5817901 0.3966049 0.5262346
## 2 0.4228395 0.0000000 0.6358025 0.5262346 0.3101852 0.7083333 0.6466049
## 3 0.5648148 0.6358025 0.0000000 0.1929012 0.5632716 0.6280864 0.3996914
## 4 0.4799383 0.5262346 0.1929012 0.0000000 0.5895062 0.4876543 0.3981481
## 5 0.5817901 0.3101852 0.5632716 0.5895062 0.0000000 0.8425926 0.4135802
## 6 0.3966049 0.7083333 0.6280864 0.4876543 0.8425926 0.0000000 0.7006173
## 7 0.5262346 0.6466049 0.3996914 0.3981481 0.4135802 0.7006173 0.0000000
## 1 2 3 4 5 6 7
## 1 0.0000000 0.6956580 0.8514500 0.9512342 0.5817152 1.212672 0.8039374
## 2 0.6956580 0.0000000 0.7345406 0.6560616 0.6778587 2.454716 0.8768878
## 3 0.8514500 0.7345406 0.0000000 0.4346564 0.8401231 2.804699 0.4706517
## 4 0.9512342 0.6560616 0.4346564 0.0000000 1.0477822 2.193826 0.5181166
## 5 0.5817152 0.6778587 0.8401231 1.0477822 0.0000000 1.668443 0.6046386
## 6 1.2126721 2.4547164 2.8046989 2.1938258 1.6684434 0.000000 2.3092201
## 7 0.8039374 0.8768878 0.4706517 0.5181166 0.6046386 2.309220 0.0000000
There are some k-medoids algorithms, partitioning around medoids, for example, is available in cluster package. For this moment, in kmed package, the available algorithm is park and jun.
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 39 3
## 3 0 11 47
To evaluate the clustering algorithm, a bootstrap evaluation can be run by make a function. This function input arguments must be only a distance matrix and a number of cluster and the output is only membership.
k <- 3
parkboot <- function(x, nclust) {
res <- fastkmed(x, nclust, iterate = 50)
return(res$cluster)
}
irisboot <- clustboot(mrwdist, nclust=k, parkboot, nboot=50)
irisboot[1:5,c(1:5,46:50)]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 1 0 1 3 0 3 1 1
## [2,] 3 0 1 1 1 3 0 0 1 1
## [3,] 0 0 0 0 1 3 3 0 0 1
## [4,] 3 0 1 0 0 3 3 0 0 1
## [5,] 0 3 0 1 1 0 3 3 1 1
We can change the algorithm, for example kmeans
from stats package.
kmboot <- function(x, nclust) {
res <- kmeans(x, nclust)
return(res$cluster)
}
kmboot <- clustboot(num, nclust=k, kmboot, nboot=50, diss = FALSE)
kmboot[1:5,c(1:5,46:50)]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 2 0 3 3 0 1 2 2 0
## [2,] 1 0 3 3 3 2 1 2 2 2
## [3,] 1 2 3 3 3 2 1 2 2 2
## [4,] 0 2 0 3 0 2 1 2 0 2
## [5,] 1 0 3 0 0 1 1 2 0 0
A consensus matrix (n x n) can be produced from a bootstrap replicate matrix. The reorder
input is a function to reorder the objects in the consensus matrix. This function input arguments must be only a distance matrix and a number of cluster and the output is only membership. This matrix can be visualized using heatmap directly.
wardorder <- function(x, nclust) {
res <- hclust(x, method = "ward.D2")
member <- cutree(res, nclust)
return(member)
}
consensusiris <- consensusmatrix(irisboot, nclust = k, wardorder)
consensusiris[c(1:5,51:55,101:105),c(1:5,51:55,101:105)]
## 1 1 1 1 1 2 2 2 2 2 2
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0 0 0 0 0 1.0000000 0.8750000 0.9310345 0.6296296 0.9047619 0.7333333
## 2 0 0 0 0 0 0.8750000 1.0000000 0.8095238 0.9500000 1.0000000 0.6000000
## 2 0 0 0 0 0 0.9310345 0.8095238 1.0000000 0.5454545 0.8500000 0.7586207
## 2 0 0 0 0 0 0.6296296 0.9500000 0.5454545 1.0000000 0.8333333 0.3200000
## 2 0 0 0 0 0 0.9047619 1.0000000 0.8500000 0.8333333 1.0000000 0.6666667
## 2 0 0 0 0 0 0.7333333 0.6000000 0.7586207 0.3200000 0.6666667 1.0000000
## 2 0 0 0 0 0 0.7272727 0.6000000 0.6875000 0.3684211 0.7500000 1.0000000
## 2 0 0 0 0 0 0.9354839 0.8333333 0.9200000 0.5238095 0.8800000 0.8064516
## 2 0 0 0 0 0 0.7407407 0.6190476 0.7000000 0.3500000 0.6842105 1.0000000
## 2 0 0 0 0 0 0.7083333 0.6000000 0.7272727 0.3000000 0.6500000 1.0000000
## 2 2 2 2
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.7272727 0.9354839 0.7407407 0.7083333
## 2 0.6000000 0.8333333 0.6190476 0.6000000
## 2 0.6875000 0.9200000 0.7000000 0.7272727
## 2 0.3684211 0.5238095 0.3500000 0.3000000
## 2 0.7500000 0.8800000 0.6842105 0.6500000
## 2 1.0000000 0.8064516 1.0000000 1.0000000
## 2 1.0000000 0.8421053 1.0000000 1.0000000
## 2 0.8421053 1.0000000 0.8214286 0.8000000
## 2 1.0000000 0.8214286 1.0000000 1.0000000
## 2 1.0000000 0.8000000 1.0000000 1.0000000
To produce a heatmap of consensus matrix clustheatmap
can be applied in the consensus matrix. The consensus matrix heatmap of Iris data by Park and Jun is produced.
We can also create a heatmap of the kmeans algorithm for the iris data.
consensusiris2 <- consensusmatrix(kmboot, nclust = k, wardorder)
clustheatmap(consensusiris2, "Iris Data via kmeans")