---
title: "Running SIMLR"
author: 
  - Daniele Ramazzotti
  - Bo Wang
  - Luca De Sano
  - Serafim Batzoglou
date: "`r format(Sys.time(), '%B %d, %Y')`"
graphics: yes
package: SIMLR
output:
  BiocStyle::html_document:
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Running SIMLR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{SIMLR,BiocStyle}
---

The external R package igraph is required for the computation of the normalized mutual information to 
assess the results of the clustering. 

```{r}
library(igraph)
library(SIMLR)
```

We now run SIMLR as an example on an input dataset from Buettner, Florian, et al. "Computational analysis of 
cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells." Nature 
biotechnology 33.2 (2015): 155-160. For this dataset we have a ground true of 3 cell populations, i.e., clusters. 

```{r}
set.seed(11111)
data(BuettnerFlorian)
example = SIMLR(X = BuettnerFlorian$in_X, c = BuettnerFlorian$n_clust, cores.ratio = 0)
```

We now compute the normalized mutual information between the inferred clusters by SIMLR and the true ones. 
This measure with values in [0,1], allows us to assess the performance of the clustering with higher values 
reflecting better performance. 

```{r}
nmi_1 = compare(BuettnerFlorian$true_labs[,1], example$y$cluster, method="nmi")
print(nmi_1)
```

As a further understanding of the results, we now visualize the cell populations in a plot. 

```{r fig.width=12, fig.height=8, warning=FALSE, fig.cap="Visualization of the 3 cell populations retrieved by SIMLR on the dataset by Florian, et al."}
plot(example$ydata, 
    col = c(topo.colors(BuettnerFlorian$n_clust))[BuettnerFlorian$true_labs[,1]], 
    xlab = "SIMLR component 1",
    ylab = "SIMLR component 2",
    pch = 20,
    main="SIMILR 2D visualization for BuettnerFlorian")
```

We also run SIMLR feature ranking on the same inputs to get a rank of the key genes with the related pvalues. 

```{r}
set.seed(11111)
ranks = SIMLR_Feature_Ranking(A=BuettnerFlorian$results$S,X=BuettnerFlorian$in_X)
head(ranks$pval)
head(ranks$aggR)
```

Similarly, we show an example for SIMLR large scale on an input dataset being a reduced version of the dataset 
provided in Buettner, Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell 
RNA-seq." Science 347.6226 (2015): 1138-1142. For this dataset we have a ground true of 9 cell populations, i.e., clusters. 
But we should notice that here we use for computational reasons a reduced version of the data, so please refer to the 
original publication for the full data. 

```{r}
set.seed(11111)
data(ZeiselAmit)
example_large_scale = SIMLR_Large_Scale(X = ZeiselAmit$in_X, c = ZeiselAmit$n_clust, kk = 10)
```

We compute the normalized mutual information between the inferred clusters by SIMLR large scale and the true ones. 

```{r}
nmi_2 = compare(ZeiselAmit$true_labs[,1], example_large_scale$y$cluster, method="nmi")
print(nmi_2)
```

As a further understanding of the results, also in this case we visualize the cell populations in a plot. 

```{r fig.width=12, fig.height=8, warning=FALSE, fig.cap="Visualization of the 9 cell populations retrieved by SIMLR large scale on the dataset by Zeisel, Amit, et al."}
plot(example_large_scale$ydata, 
    col = c(topo.colors(ZeiselAmit$n_clust))[ZeiselAmit$true_labs[,1]], 
    xlab = "SIMLR component 1",
    ylab = "SIMLR component 2",
    pch = 20,
    main="SIMILR 2D visualization for ZeiselAmit")
```

Now, as a final example, we also provide the results of two heuristics (see the original SIMLR paper) to 
estimate the number of clusters from data. 

```{r}
set.seed(53900)
NUMC = 2:5
res_example = SIMLR_Estimate_Number_of_Clusters(BuettnerFlorian$in_X,
    NUMC = NUMC,
    cores.ratio = 0)
```

Best number of clusters, K1 heuristic: 
```{r}
NUMC[which.min(res_example$K1)]
```

K2 heuristic: 
```{r}
NUMC[which.min(res_example$K2)]
```

Results of the two heuristics: 
```{r}
res_example
```