The kmed package was designed to analyse k-medoids based clustering; the features include:
There are four distance options, i.e. manhattan weighted by rank (mrw), squared euclidean weighted by rank (ser) and variance (sev), and squared euclidean unweighted (se). The distNumeric in the method provides its desired distance method.
library(kmed)
num <- as.matrix(iris[,1:4])
mrwdist <- distNumeric(num, num, method = "mrw")
mrwdist[1:6,1:6]## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0.00000000 0.2638889 0.2530603 0.3225047 0.06944444 0.3841808
## [2,] 0.26388889 0.0000000 0.1558380 0.1419492 0.27777778 0.6480697
## [3,] 0.25306026 0.1558380 0.0000000 0.1033427 0.26694915 0.6372411
## [4,] 0.32250471 0.1419492 0.1033427 0.0000000 0.33639360 0.6727872
## [5,] 0.06944444 0.2777778 0.2669492 0.3363936 0.00000000 0.3702919
## [6,] 0.38418079 0.6480697 0.6372411 0.6727872 0.37029190 0.0000000
Two options of distances are available for binary or categorical variables, namely simple matching (matching) and coocurrence (coocurrence) distance.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000000 0.3333333 0.6666667 0.6666667 0.6666667 0.3333333 0.6666667
## [2,] 0.3333333 0.0000000 1.0000000 1.0000000 0.3333333 0.6666667 1.0000000
## [3,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [4,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [5,] 0.6666667 0.3333333 0.6666667 0.6666667 0.0000000 1.0000000 0.6666667
## [6,] 0.3333333 0.6666667 0.3333333 0.3333333 1.0000000 0.0000000 0.3333333
## [7,] 0.6666667 1.0000000 0.0000000 0.0000000 0.6666667 0.3333333 0.0000000
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.0000000 0.4500000 0.7916667 0.7916667 0.7000000 0.5416667 0.7916667
## [2,] 0.4500000 0.0000000 1.2416667 1.2416667 0.2500000 0.9916667 1.2416667
## [3,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [4,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
## [5,] 0.7000000 0.2500000 0.9916667 0.9916667 0.0000000 1.2416667 0.9916667
## [6,] 0.5416667 0.9916667 0.2500000 0.2500000 1.2416667 0.0000000 0.2500000
## [7,] 0.7916667 1.2416667 0.0000000 0.0000000 0.9916667 0.2500000 0.0000000
There are five distances for mixed variables data, i.e gower (gower), wishart (wishart), podani (podani), huang (huang) and harikumar-pv (harikumar).
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""), paste(c("bin"), 1:3, sep = ""), paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)## 1 2 3 4 5 6 7
## 1 0.0000000 0.4228395 0.5648148 0.4799383 0.5817901 0.3966049 0.5262346
## 2 0.4228395 0.0000000 0.6358025 0.5262346 0.3101852 0.7083333 0.6466049
## 3 0.5648148 0.6358025 0.0000000 0.1929012 0.5632716 0.6280864 0.3996914
## 4 0.4799383 0.5262346 0.1929012 0.0000000 0.5895062 0.4876543 0.3981481
## 5 0.5817901 0.3101852 0.5632716 0.5895062 0.0000000 0.8425926 0.4135802
## 6 0.3966049 0.7083333 0.6280864 0.4876543 0.8425926 0.0000000 0.7006173
## 7 0.5262346 0.6466049 0.3996914 0.3981481 0.4135802 0.7006173 0.0000000
## 1 2 3 4 5 6 7
## 1 0.0000000 0.6956580 0.8514500 0.9512342 0.5817152 1.212672 0.8039374
## 2 0.6956580 0.0000000 0.7345406 0.6560616 0.6778587 2.454716 0.8768878
## 3 0.8514500 0.7345406 0.0000000 0.4346564 0.8401231 2.804699 0.4706517
## 4 0.9512342 0.6560616 0.4346564 0.0000000 1.0477822 2.193826 0.5181166
## 5 0.5817152 0.6778587 0.8401231 1.0477822 0.0000000 1.668443 0.6046386
## 6 1.2126721 2.4547164 2.8046989 2.1938258 1.6684434 0.000000 2.3092201
## 7 0.8039374 0.8768878 0.4706517 0.5181166 0.6046386 2.309220 0.0000000
There are some k-medoids algorithms, partitioning around medoids, for example, is available in cluster package. For this moment, in kmed package, the available algorithm is park and jun.
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 39 3
## 3 0 11 47
To evaluate the clustering algorithm, a bootstrap evaluation can be run by make a function. This function input arguments must be only a distance matrix and a number of cluster and the output is only membership.
k <- 3
parkboot <- function(x, nclust) {
res <- fastkmed(x, nclust, iterate = 50)
return(res$cluster)
}
irisboot <- clustboot(mrwdist, nclust=k, parkboot, nboot=50)
irisboot[1:5,c(1:5,46:50)]## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 0 1 0 1 3 0 3 1 1
## [2,] 3 0 1 1 1 3 0 0 1 1
## [3,] 0 0 0 0 1 3 3 0 0 1
## [4,] 3 0 1 0 0 3 3 0 0 1
## [5,] 0 3 0 1 1 0 3 3 1 1
We can change the algorithm, for example kmeans from stats package.
kmboot <- function(x, nclust) {
res <- kmeans(x, nclust)
return(res$cluster)
}
kmboot <- clustboot(num, nclust=k, kmboot, nboot=50, diss = FALSE)
kmboot[1:5,c(1:5,46:50)]## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 0 2 0 3 3 0 1 2 2 0
## [2,] 1 0 3 3 3 2 1 2 2 2
## [3,] 1 2 3 3 3 2 1 2 2 2
## [4,] 0 2 0 3 0 2 1 2 0 2
## [5,] 1 0 3 0 0 1 1 2 0 0
A consensus matrix (n x n) can be produced from a bootstrap replicate matrix. The reorder input is a function to reorder the objects in the consensus matrix. This function input arguments must be only a distance matrix and a number of cluster and the output is only membership. This matrix can be visualized using heatmap directly.
wardorder <- function(x, nclust) {
res <- hclust(x, method = "ward.D2")
member <- cutree(res, nclust)
return(member)
}
consensusiris <- consensusmatrix(irisboot, nclust = k, wardorder)
consensusiris[c(1:5,51:55,101:105),c(1:5,51:55,101:105)]## 1 1 1 1 1 2 2 2 2 2 2
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 1 1 1 1 1 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0 0 0 0 0 1.0000000 0.8750000 0.9310345 0.6296296 0.9047619 0.7333333
## 2 0 0 0 0 0 0.8750000 1.0000000 0.8095238 0.9500000 1.0000000 0.6000000
## 2 0 0 0 0 0 0.9310345 0.8095238 1.0000000 0.5454545 0.8500000 0.7586207
## 2 0 0 0 0 0 0.6296296 0.9500000 0.5454545 1.0000000 0.8333333 0.3200000
## 2 0 0 0 0 0 0.9047619 1.0000000 0.8500000 0.8333333 1.0000000 0.6666667
## 2 0 0 0 0 0 0.7333333 0.6000000 0.7586207 0.3200000 0.6666667 1.0000000
## 2 0 0 0 0 0 0.7272727 0.6000000 0.6875000 0.3684211 0.7500000 1.0000000
## 2 0 0 0 0 0 0.9354839 0.8333333 0.9200000 0.5238095 0.8800000 0.8064516
## 2 0 0 0 0 0 0.7407407 0.6190476 0.7000000 0.3500000 0.6842105 1.0000000
## 2 0 0 0 0 0 0.7083333 0.6000000 0.7272727 0.3000000 0.6500000 1.0000000
## 2 2 2 2
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 1 0.0000000 0.0000000 0.0000000 0.0000000
## 2 0.7272727 0.9354839 0.7407407 0.7083333
## 2 0.6000000 0.8333333 0.6190476 0.6000000
## 2 0.6875000 0.9200000 0.7000000 0.7272727
## 2 0.3684211 0.5238095 0.3500000 0.3000000
## 2 0.7500000 0.8800000 0.6842105 0.6500000
## 2 1.0000000 0.8064516 1.0000000 1.0000000
## 2 1.0000000 0.8421053 1.0000000 1.0000000
## 2 0.8421053 1.0000000 0.8214286 0.8000000
## 2 1.0000000 0.8214286 1.0000000 1.0000000
## 2 1.0000000 0.8000000 1.0000000 1.0000000
To produce a heatmap of consensus matrix clustheatmap can be applied in the consensus matrix. The consensus matrix heatmap of Iris data by Park and Jun is produced.
We can also create a heatmap of the kmeans algorithm for the iris data.
consensusiris2 <- consensusmatrix(kmboot, nclust = k, wardorder)
clustheatmap(consensusiris2, "Iris Data via kmeans")