---
title: Using the `ScaledMatrix` class
author:
- name: Aaron Lun
email: infinite.monkeys.with.keyboards@gmail.com
date: "Revised: 12 December 2020"
output:
BiocStyle::html_document:
toc_float: true
package: ScaledMatrix
vignette: >
%\VignetteIndexEntry{Using the ScaledMatrix}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo=FALSE, results="hide", message=FALSE}
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
```
# Overview
The `ScaledMatrix` provides yet another method of running `scale()` on a matrix.
In other words, these three operations are equivalent:
```{r}
mat <- matrix(rnorm(10000), ncol=10)
smat1 <- scale(mat)
head(smat1)
library(DelayedArray)
smat2 <- scale(DelayedArray(mat))
head(smat2)
library(ScaledMatrix)
smat3 <- ScaledMatrix(mat, center=TRUE, scale=TRUE)
head(smat3)
```
The biggest difference lies in how they behave in downstream matrix operations.
- `smat1` is an ordinary matrix, with the scaled and centered values fully realized in memory.
Nothing too unusual here.
- `smat2` is a `DelayedMatrix` and undergoes block processing whereby chunks are realized and operated on, one at a time.
This sacrifices speed for greater memory efficiency by avoiding a copy of the entire matrix.
In particular, it preserves the structure of the original `mat`, e.g., from a sparse or file-backed representation.
- `smat3` is a `ScaledMatrix` that refactors certain operations so that they can be applied to the original `mat` without any scaling or centering.
This takes advantage of the original data structure to speed up matrix multiplication and row/column sums,
albeit at the cost of numerical precision.
# Matrix multiplication
Given an original matrix $\mathbf{X}$ with $n$ columns, a vector of column centers $\mathbf{c}$ and a vector of column scaling values $\mathbf{s}$,
our scaled matrix can be written as:
$$
\mathbf{Y} = (\mathbf{X} - \mathbf{c} \cdot \mathbf{1}_n^T) \mathbf{S}
$$
where $\mathbf{S} = \text{diag}(s_1^{-1}, ..., s_n^{-1})$.
If we wanted to right-multiply it with another matrix $\mathbf{A}$, we would have:
$$
\mathbf{YA} = \mathbf{X}\mathbf{S}\mathbf{A} - \mathbf{c} \cdot \mathbf{1}_n^T \mathbf{S}\mathbf{A}
$$
The right-most expression is simply the outer product of $\mathbf{c}$ with the column sums of $\mathbf{SA}$.
More important is the fact that we can use the matrix multiplication operator for $\mathbf{X}$ with $\mathbf{SA}$,
as this allows us to use highly efficient algorithms for certain data representations, e.g., sparse matrices.
```{r}
library(Matrix)
mat <- rsparsematrix(20000, 10000, density=0.01)
smat <- ScaledMatrix(mat, center=TRUE, scale=TRUE)
blob <- matrix(runif(ncol(mat) * 5), ncol=5)
system.time(out <- smat %*% blob)
# The slower way with block processing.
da <- scale(DelayedArray(mat))
system.time(out2 <- da %*% blob)
```
The same logic applies for left-multiplication and cross-products.
This allows us to easily speed up high-level operations involving matrix multiplication by just switching to a `ScaledMatrix`,
e.g., in approximate PCA algorithms from the `r Biocpkg("BiocSingular")` package.
```{r}
library(BiocSingular)
set.seed(1000)
system.time(pcs <- runSVD(smat, k=10, BSPARAM=IrlbaParam()))
```
# Other utilities
Row and column sums are special cases of matrix multiplication and can be computed quickly:
```{r}
system.time(rowSums(smat))
system.time(rowSums(da))
```
Subsetting, transposition and renaming of the dimensions are all supported without loss of the `ScaledMatrix` representation:
```{r}
smat[,1:5]
t(smat)
rownames(smat) <- paste0("GENE_", 1:20000)
smat
```
Other operations will cause the `ScaledMatrix` to collapse to the general `DelayedMatrix` representation, after which point block processing will be used.
```{r}
smat + 1
```
# Caveats
For most part, the implementation of the multiplication assumes that the $\mathbf{A}$ matrix and the matrix product are small compared to $\mathbf{X}$.
It is also possible to multiply two `ScaledMatrix`es together if the underlying matrices have efficient operators for their product.
However, if this is not the case, the `ScaledMatrix` offers little benefit for increased overhead.
It is also worth noting that this speed-up is not entirely free.
The expression above involves subtracting two matrix with potentially large values, which runs the risk of catastrophic cancellation.
The example below demonstrates how `ScaledMatrix` is more susceptible to loss of precision than a normal `DelayedArray`:
```{r}
set.seed(1000)
mat <- matrix(rnorm(1000000), ncol=100000)
big.mat <- mat + 1e12
# The 'correct' value, unaffected by numerical precision.
ref <- rowMeans(scale(mat))
head(ref)
# The value from scale'ing a DelayedArray.
library(DelayedArray)
smat2 <- scale(DelayedArray(big.mat))
head(rowMeans(smat2))
# The value from a ScaledMatrix.
library(ScaledMatrix)
smat3 <- ScaledMatrix(big.mat, center=TRUE, scale=TRUE)
head(rowMeans(smat3))
```
In most practical applications, though, this does not seem to be a major concern,
especially as most values (e.g., log-normalized expression matrices) lie close to zero anyway.
# Session information {-}
```{r}
sessionInfo()
```