---
title: "An introduction to mbkmeans"
author: "Yuwei Ni and Davide Risso"
date: "Last modified: March 28, 2019; Compiled: `r format(Sys.time(), '%B %d, %Y')`"
bibliography: biblio.bib
output:
  BiocStyle::html_document:
    toc: true
vignette: >
  %\VignetteEncoding{UTF-8}
---

<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{mbkmeans vignette}
-->

# Installation

To install the package, please use the following.

```{r, eval=FALSE}
if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("mbkmeans")
```

# Introduction

This vignette provides an introductory example on how
to work with the `mbkmeans` package, which contains an 
implementation of the mini-batch k-means algorithm 
proposed in [@sculley2010web] for large single-cell
sequencing data.

The main function to be used by the users is `mbkmeans`.
This is implemented as an S4 generic and methods are
implemented for `matrix`, `Matrix`, `HDF5Matrix`, 
`DelayedMatrix`, `SummarizedExperiment`, and 
`SingleCellExperiment`.

Most of this work was inspired by the 
`MiniBatchKmeans()` function implemented in the 
`ClusterR` R package and we re-use many of the
C++ functions implemented there.

Our main contribution here is to provide an 
interface to the `DelayedArray` and `HDF5Array`
packages, allowing the user to run the mini-batch 
_k_-means algorithm on data that do not fit 
entirely in memory.

The motivation for this work is the clustering of 
large single-cell RNA-sequencing (scRNA-seq) datasets, 
and hence the main focus is on Bioconductor's 
`SingleCellExperiment`and `SummarizedExperiment` 
data container. For this reason, `mbkmeans` assumes a 
data representation typical of genomic data, in which 
genes (variables) are in the rows and cells 
(observations) are in the column. This is contrary to 
most other statistical applications, and notably to 
the `stats::kmeans()` and `ClusterR::MiniBatchKmeans()`
functions that assume observations in rows.

We provide a lower-level `mini_batch()` function that 
expects observations in rows and is expected to be a direct
a replacement of `ClusterR::MiniBatchKmeans()` for on-disk 
data representations such as `HDF5` files. The rest of 
this document shows the typical use case through the 
`mbkmeans()` interface, users interested in the
`mini_batch()` function should refer to its manual page.

## Example dataset

```{r options, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(cache=FALSE, error=FALSE, message=FALSE, warning=FALSE)
library(TENxPBMCData)
library(scater)
library(SingleCellExperiment)
library(mbkmeans)
library(DelayedMatrixStats)
library(ClusterR)
```

To illustrate a typical use case, we use the 
`pbmc4k` dataset of the 
[`TENxPBMCData` package](https://bioconductor.org/packages/release/data/experiment/html/TENxPBMCData.html). 
This dataset contains a set of about 4,000 cells from
peripheral blood from a healthy donor and is expected 
to contain many types or clusters of cell.

Note that in this vignette, we do not aim at identifying 
biologically meaningful clusters here (that would 
entail a more sophisticated normalization of data and 
dimensionality reduction), but insted we only aim to show
how to run mini-batch _k_-means on a large HDF5-backed
matrix.

We normalize the data simply by scaling for the total 
number of counts using `scater` and select the 1,000 
most variable genes and a random set of 100 cells to 
speed-up computations.

```{r}
tenx_pbmc4k <- TENxPBMCData(dataset = "pbmc4k")

set.seed(1034)
idx <- sample(seq_len(NCOL(tenx_pbmc4k)), 100)
sce <- tenx_pbmc4k[, idx]

#normalization
sce <- normalize(sce)

vars <- rowVars(logcounts(sce))
names(vars) <- rownames(sce)
vars <- sort(vars, decreasing = TRUE)

sce1000 <- sce[names(vars)[1:1000],]

sce1000
```

# `mbkmeans`

The main function, `mbkmeans()`, returns a 
list object including `centroids`, 
`WCSS_per_cluster` (where WCSS stands for 
within-cluster-sum-of-squares), 
`best_initialization`, `iters_per_initiazation`
and `Clusters`.

It takes any matrix-like object as input, such 
as `SummarizedExperiment`, `SingleCellExperiment`,
`matrix`, `DelayedMatrix` and `HDF5Matrix`. 

In this example, the input is a `SingleCellExperiment` 
object.

```{r}
res <- mbkmeans(sce1000, clusters = 5,
                reduceMethod = NA,
                whichAssay = "logcounts")
```

The number of clusters (such as _k_ in the 
_k_-means algorithm) is set through the `clusters` 
argument. In this case, we set `clusters = 5` for 
no particular reason. For `SingleCellExperiment` 
objects, the function provides the `reduceMethod` 
and `whichAssay` arguments. The `reduceMethod` argument
should specify the dimensionality reduction slot 
to use for the clustering, and the default is 
"PCA". Note that this *does not perform* PCA but 
only looks at a slot called "PCA" already stored 
in the object. Alternatively, one can specify 
`whichAssay` as the assay to use as input to 
mini-batch _k_-means. This is used only when 
`reduceMethod` option is `NA`. See `?mbkmeans` 
for more details.

## Choice of arguments

There are additional arguements in `mbkmeans()` 
function that make the function more flexible 
and suitable for more situations. 

### Batch size

The size of the mini batches is set through the 
`batch_size` argument. The default value uses the 
`blocksize()` function. The `blocksize()` function 
considers both the number of columns in the dataset
and the amount of RAM on the current matchine to 
calculate as large of a batch size as reasonable for 
the RAM available to the session. The calculation uses 
`get_ram()` function in `benchmarkme` package. See 
the `benchmarkme` vignette for more details.

```{r}
batchsize <- blocksize(sce1000)
batchsize
```

In this case, as the whole data fits in memory,
the default batch size would be a single 
batch of size `r ncol(sce1000)`.

### Initialization

The performance of mini-batch _k_-means greatly 
depends on the process of initialization. We 
implemented two different initialization methods:

1. Random initialization, as in regular _k_-means; 
2. `kmeans++`, as proposed in [@arthur2007k]. The default is "kmeans++". 

The percentage of data to use for the initialization
centroids is set through the `init_fraction` argument, 
which should be larger than 0 and less than 1, with
default value of 0.25.

```{r}
res_random <- mbkmeans(sce1000, clusters = 5, 
                reduceMethod = NA,
                whichAssay = "logcounts",
                initializer = "random")
table(res$Clusters, res_random$Clusters)
```

# Comparison with _k_-means

Note that if we set `init_fraction=1`, 
`initializer = "random"`, and `batch_size=ncol(x)`, 
we recover the classic _k_-means algorithm.

```{r}
res_full <- mbkmeans(sce1000, clusters = 5,
                     reduceMethod = NA,
                     whichAssay = "logcounts",
                     initializer = "random",
                     batch_size = ncol(sce1000))
res_classic <- kmeans(t(logcounts(sce1000)), centers = 5)
table(res_full$Clusters, res_classic$cluster)
```

# References