# Description

The NCutYX package includes functions for clustering genomic data using graph theory. Each function in this package is a variation on the NCut measure used to cluster vertices in a graph. The running theme is to use data sets from different sources and types to improve the clustering results.

• The ncut function clusters the columns of a data set using the classical normalized cut measure from graph theory.
• The ancut function clusters one type of data, say gene expressions, with the help of a second type of data, like copy number aberrations.
• The muncut function clusters a three-layered graph into K different clusters of 3 different data types, say gene expression, copy number aberrations and proteins.
• The pwncut function clusters the columns of X into K clusters by giving a weight for each cluster while penalizing them to be similar to each other.
• The mlbncut function works similarly to muncut but it also clusters samples into R clusters.
• The awncut builds similarity matrices for the row of X and an assisted dataset Z. Clusters them into K groups while conducting feature selection based on the AWNCut method.

To install:

• latest development version:
1. install and load package devtools
2. install_github("Seborinos/NCutYX")

# NCut

The Normalized Cut (NCut) clusters the columns of Y into K groups using the NCut graph measure. Builds a similarity matrix for the columns of Y and clusters them into K groups based on the NCut graph measure. Correlation, Euclidean and Gaussian distances can be used to construct the similarity matrix. The NCut measure is minimized using the cross entropy method, a Monte Carlo optimization technique.

## Method

Consider a dataset with $$n$$ iid samples. For the $$i$$th sample, assume that measurements are available on $$p$$ variables, denoted as $$Y_i = (Y_{i1}, Y_{i2}, \cdots, Y_{ip})'$$. Define the weight matrix $$\mathbf{W}=(w_{jl})_{p\times p}$$, where the non-negative element $$w_{jl}$$ measures the similarity between columns $$j$$ and $$l$$ of $$\mathbf{Y}$$. We can define $$w_{jl}$$ to be equal the Gaussian kernel, the absolute value of the sample correlation or the inverse of their Euclidean distance. Denote $$A=\{A_1, \ldots, A_K\}$$ as a partition of $$\{1, \ldots, p \}$$ which leads to $$K$$ disjoint clusters. For a set $$A_k$$, denote $$A_k^c$$ as its complement. Then, the NCut measure is defined as $\text{NCut}(A)=\sum \limits_{k=1}^{K}\frac{\text{cut}(A_k,A_k^c;\mathbf{W})} {\text{cutvol}(A_k;\mathbf{W} )},$ where $\text{cut}(A_k,A_k^c;\mathbf{W})=\sum \limits_{j\in A_k,l \in A_k^c} w_{jl},$ and $\text{cutvol}(A_k; \mathbf{W})=\sum \limits_{j,l \in A_k} w_{jl}.$ With a fixed $$K$$, the optimal clustering minimizes the NCut measure. The minimization is done with the cross entropy method, a Monte Carlo optimization approach.

## Example

First, we set up the simulation parameters.

library(MASS)
n <- 100 # Sample size
B <- 30 # Number of iterations in the simulated annealing algorithm.
p <- 50 # Number of columns of Y.

We define the covariance matrix, the true incidence function and sample the data.

S <- matrix(0.2, p, p)
S[1:(p/2),(p/2+1):p] <- 0
S[(p/2+1):p,1:(p/2)] <- 0
S <- S-diag(diag(S)) + diag(p)
mu <- rep(0, p)

W0 <- matrix(1,p,p)
W0[1:(p/2),1:(p/2)] <- 0
W0[(p/2+1):p,(p/2+1):p] <- 0
Denum <- sum(W0)

Y <- mvrnorm(n, mu, S)

Apply ncut to the data Y and calculate the estimation error of the clusters.

Res <- ncut(Y,
K     = 2,
B     = 30,
N     = 1000,
dist  = 'correlation',
scale = TRUE,
q     = 0.2,
sigma = 0.1)

Cx  <- Res[]
f11 <- matrix(Cx[ ,1], p, 1)
f12 <- matrix(Cx[ ,2], p, 1)

errorL <- sum((f11%*%t(f11))*W0)/Denum + sum((f12%*%t(f12))*W0)/Denum
# This is the true error of the clustering solution.
errorL