ClusTorus is a package for clustering angular data, especially for protein structure data. ClusTorus provides various clustering algorithms designed with conformal prediction framework, which can deal with the outliers. The package suggests various methods for fitting the algorithms, and some of them will be introduced soon. The package also provides some simple tools for handling angluar data, such as angular subtraction, computing angular distance, etc. Now, check how to use ClusTorus briefly.
ClusTorus provides a protein structure data of SARS-CoV-2, which provoked the recent pandemic situation with COVID-19. With this data, we will introduce how to load the data and how to use data handling tools of ClusTorus. Since SARS-CoV-2 data is originally extracted from protein data bank, it contains various information for describing protein structure. For our purpose, we will only use SARS_CoV_2$tbl
, which only contains angluar information of the data. Moreover, for simplicity, we will only use backbone-chain angles, that is, \(\phi\) and \(\psi\) columns.
library(ClusTorus)
library(tidyverse)
data <- SARS_CoV_2$tbl
data <- data.frame(data, id = rownames(data)) %>%
separate(id,into = c(',','position','type','rest'))
data <- data %>% select(-",")
data <- data[(data$type == "B"), ]
data <- na.omit(data[, 1:2])
head(data)
#> phi psi
#> 27.B.ALA -52.38603 146.62861
#> 28.B.TYR -134.44609 152.58700
#> 29.B.THR -134.15780 175.49552
#> 30.B.ASN -89.74427 117.60396
#> 31.B.SER -71.24783 -15.14771
#> 32.B.PHE 51.81731 -117.41821
To use the functions of ClusTorus, we need to convert the data to be on radian scale, especially on \([0, 2\pi)\).
data <- data / 180 * pi
data <- on.torus(data)
ggplot(data = as.data.frame(data), mapping = aes(x = phi, y = psi)) + geom_point()
on.torus
converts the radian scale data to be on \([0, 2\pi)^d\), where \(d\) means the number of dimension. In this case, \(d = 2\) and thus the data is converted to be on \([0, 2\pi)^2\).
Now, we are ready to implement clustering algorithms to the data. ClusTorus provides various options for clustering, but we will provide only several cases: “kde”, “mixture - axis aligned”, and “kmeans - general”. On the other hand, we need to choose hyperparameters: the number of modes or ellipsoids \(J\) and the significance level \(\alpha\). Before choosing the hyperparameter, we will implement the model fitting function icp.torus.score
with various hyperparameter options, first.
set.seed(2021)
Jvec <- 3:35
l <- list()
for (j in Jvec){
l[[j]] <- icp.torus.score(as.matrix(data),
method = "all",
mixturefitmethod = "axis-aligned",
kmeansfitmethod = "general",
param = list(J = j, concentration = 25),
verbose = FALSE)
}
The list l
contains the icp.torus
objects, which consist of fitted parameters for generating clusters, by varying the hypterparameter \(J\). That is, These objects are optimally fitted ingredients for generating clusters for given \(J\). By specifying the significance level, we can get the clusters. But, how to generate optimal clusters? One may think that the hyperparameter which generates the cluster of the minimum volume/area will be the optimum for given significance level. For dealing with such volume/area based hyperparameter searching, ClusTorus provides icp.torus.eval
function, which generates boolean matrices for indicating the inclusion of pre-specified evaluation points for given significance level(s). You need to know that the result also changes by varying \(\alpha\). Thus, we need to carefully choose \(\alpha\) for better performance under pre-specified hyperparameter-choosing criterion. But, for simplicity, we will assume that the significance level \(\alpha\) is given with \(0.1\). Then, we can simply get the results by using pre-obtained icp.torus
objects as below:
alpha <- 0.1
out <- data.frame()
for (j in Jvec){
a<-icp.torus.eval(l[[j]], level = alpha, eval.point = grid.torus())
mu_kde <- sum(a$Chat_kde[,1])
mu_e <- sum(a$Chat_e[,1])
mu_kmeans <- sum(a$Chat_kmeans[,1])
out <- rbind(out, data.frame(J = j, mu_kde = mu_kde, mu_e = mu_e, mu_kmeans = mu_kmeans))
}
head(out)
#> J mu_kde mu_e mu_kmeans
#> 1 3 1619 2316 2239
#> 2 4 1372 2209 2024
#> 3 5 1601 2221 1645
#> 4 6 1483 1896 2253
#> 5 7 1799 1952 1853
#> 6 8 1578 1871 1703
We count the number of points included in the generated clusters. In this 2-dimensional case, grid.torus()
generates the 10,000 grid points in default. That is, icp.torus.eval
generates the matrices for indicating whether each grid point is included in the clusters. By counting the included points, we can approximately estimate the volume of clusters. mu_kde
shows the volume of cluster generated by option kde
, mu_e
by option mixture
with axis-aligned
, and mu_kmeans
by option kmeans
with general
. Now, with “minimum volume” criterion, choose optimal \(J\) for each method, and generate the clusters.
J_kde <- out$J[which.min(out$mu_kde)]
J_e <- out$J[which.min(out$mu_e)]
J_kmeans <- out$J[which.min(out$mu_kmeans)]
icp.torus.kde <- l[[J_kde]]
icp.torus.mix <- l[[J_e]]
icp.torus.kmeans <- l[[J_kmeans]]
c_mix <- cluster.assign.torus(data, icp.torus.mix, level = alpha)
c_kmeans <- cluster.assign.torus(data, icp.torus.kmeans, level = alpha)
If there are some difficulties for evaluating such volume or for implementing your hyperparameter-choosing criterion, ClusTorus alternatively suggests the function hyperparam.torus
, which suggests two hyperparameters \(J\) and \(\alpha\). For the choice of \(J\), there are several options such as “elbow”, “risk”, “AIC” and “BIC”. The option “elbow” is actually the volume based criterion, which is already described above. “risk” is the criterion based on the sum of the conformity scores which are evaluated under the inductive conformal prediction framework, that is, evaluated with the function icp.torus.score
. “AIC” and “BIC” are well-known information criteria, proposed by Akaike (1974) and Schwarz (1978) respectively. These options can be designated with the argument option
. The function hyperparam.torus
generates data.frame
objects for \(J\) versus the evaluated criterion and for \(\alpha\) versus the number of resultant clusters. It also generates icp.torus
object based on the selected \(J\).
output <- hyperparam.torus(data, icp.torus.objects = l, option = "risk")
output$IC.results
#> J criterion
#> 1 3 1730.001
#> 2 4 1682.979
#> 3 5 1743.655
#> 4 6 1547.127
#> 5 7 1644.920
#> 6 8 1538.558
#> 7 9 1571.729
#> 8 10 1497.837
#> 9 11 1473.851
#> 10 12 1441.767
#> 11 13 1420.352
#> 12 14 1534.647
#> 13 15 1465.857
#> 14 16 1450.854
#> 15 17 1465.153
#> 16 18 1536.920
#> 17 19 1578.592
#> 18 20 1520.175
#> 19 21 1596.795
#> 20 22 1516.367
#> 21 23 1483.848
#> 22 24 1528.707
#> 23 25 1608.701
#> 24 26 1392.156
#> 25 27 1352.146
#> 26 28 1463.455
#> 27 29 1786.932
#> 28 30 1409.073
#> 29 31 1678.029
#> 30 32 1958.643
#> 31 33 1637.330
#> 32 34 1551.388
#> 33 35 1603.736
output$alpha.results
#> alpha ncluster
#> 1 0.002427184 1
#> 2 0.004854369 1
#> 3 0.007281553 1
#> 4 0.009708738 1
#> 5 0.012135922 1
#> 6 0.014563107 1
#> 7 0.016990291 1
#> 8 0.019417476 2
#> 9 0.021844660 2
#> 10 0.024271845 2
#> 11 0.026699029 2
#> 12 0.029126214 2
#> 13 0.031553398 2
#> 14 0.033980583 2
#> 15 0.036407767 2
#> 16 0.038834951 2
#> 17 0.041262136 2
#> 18 0.043689320 2
#> 19 0.046116505 3
#> 20 0.048543689 3
#> 21 0.050970874 3
#> 22 0.053398058 3
#> 23 0.055825243 3
#> 24 0.058252427 3
#> 25 0.060679612 3
#> 26 0.063106796 3
#> 27 0.065533981 3
#> 28 0.067961165 3
#> 29 0.070388350 3
#> 30 0.072815534 3
#> 31 0.075242718 3
#> 32 0.077669903 5
#> 33 0.080097087 5
#> 34 0.082524272 5
#> 35 0.084951456 5
#> 36 0.087378641 6
#> 37 0.089805825 6
#> 38 0.092233010 6
#> 39 0.094660194 6
#> 40 0.097087379 7
#> 41 0.099514563 7
#> 42 0.101941748 7
#> 43 0.104368932 7
#> 44 0.106796117 7
#> 45 0.109223301 7
#> 46 0.111650485 7
#> 47 0.114077670 7
#> 48 0.116504854 7
#> 49 0.118932039 7
#> 50 0.121359223 7
#> 51 0.123786408 9
#> 52 0.126213592 9
#> 53 0.128640777 9
#> 54 0.131067961 9
#> 55 0.133495146 9
#> 56 0.135922330 9
#> 57 0.138349515 9
#> 58 0.140776699 9
#> 59 0.143203883 10
#> 60 0.145631068 10
#> 61 0.148058252 10
output$optim$hyperparam
#> J alpha
#> 27.00000000 0.06067961
With cluster.assign.torus
, we can generate the cluster for option mixture
and kmeans
, and for each data point, the label of cluster is assigned. Moreover, for the cases of mixture
and kmeans
, we can check how the cluster is generated directly as below:
c_mix$mixture$plot
#> [[1]]
c_kmeans$kmeans$plot
#> [[1]]
Also, with the assigned cluster labels, we can visualize the clusters with colored data points;
library(cowplot)
result.dat.mix <- data.frame(data, membership = as.factor(c_mix$mixture$cluster.id.outlier)) %>%
mutate(membership = ifelse(membership == max(c_mix$mixture$cluster.id.outlier), "out", membership))
result.dat.kmeans <- data.frame(data, membership = as.factor(c_kmeans$kmeans$cluster.id.outlier)) %>%
mutate(membership = ifelse(membership == max(c_kmeans$kmeans$cluster.id.outlier), "out", membership))
g1 <- ggplot(data = result.dat.mix, aes(phi,psi, color = membership)) + geom_point() +
scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))
g2 <- ggplot(data = result.dat.kmeans, aes(phi,psi, color = membership)) + geom_point() +
scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))
plot_grid(g1, g2)
On the other hand, to check the clustering result of method kde
, we can visualize the results as below:
ia <- icp.torus.eval(icp.torus.kde, level = alpha, eval.point = grid.torus())
b <- data.frame(ia$eval.point, ia$Chat_kde == 1)
colnames(b) <- c("phi","psi","C_kde")
g0 <- ggplot() + geom_contour(aes(phi, psi, z = ifelse(C_kde,1,0)), data = b, size = 1,lineend = "round" ) +
geom_point(mapping = aes(x,y), data = data.frame(x = data[,1],y =data[,2])) +
scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))
g0
If you clearly follow the examples, you may need the detailed manual. Check the package manual and see the details.