---
title: "sparseMatrixStats"
author: Constantin Ahlmann-Eltze
date: "`r Sys.Date()`"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{sparseMatrixStats}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Installation
You can install the release version of `r BiocStyle::Biocpkg("sparseMatrixStats")` from BioConductor:
``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("sparseMatrixStats")
```
# Introduction
The sparseMatrixStats package implements a number of summary functions for sparse matrices from the `r BiocStyle::CRANpkg("Matrix")` package.
Let us load the package and create a synthetic sparse matrix.
```{r}
library(sparseMatrixStats)
# Matrix defines the sparse Matrix class
# dgCMatrix that we will use
library(Matrix)
# For reproducibility
set.seed(1)
```
Create a synthetic table with customers, items, and how often they bought that item.
```{r}
customer_ids <- seq_len(100)
item_ids <- seq_len(30)
n_transactions <- 1000
customer <- sample(customer_ids, size = n_transactions, replace = TRUE,
prob = runif(100))
item <- sample(item_ids, size = n_transactions, replace = TRUE,
prob = runif(30))
tmp <- table(paste0(customer, "-", item))
tmp2 <- strsplit(names(tmp), "-")
purchase_table <- data.frame(
customer = as.numeric(sapply(tmp2, function(x) x[1])),
item = as.numeric(sapply(tmp2, function(x) x[2])),
n = as.numeric(tmp)
)
head(purchase_table, n = 10)
```
Let us turn the table into a matrix to simplify the analysis:
```{r}
purchase_matrix <- sparseMatrix(purchase_table$customer, purchase_table$item,
x = purchase_table$n,
dims = c(100, 30),
dimnames = list(customer = paste0("Customer_", customer_ids),
item = paste0("Item_", item_ids)))
purchase_matrix[1:10, 1:15]
```
We can see that some customers did not buy anything, where as
some bought a lot.
`sparseMatrixStats` can help us to identify interesting patterns in this data:
```{r}
# How often was each item bough in total?
colSums2(purchase_matrix)
# What is the range of number of items each
# customer bought?
head(rowRanges(purchase_matrix))
# What is the variance in the number of items
# each customer bought?
head(rowVars(purchase_matrix))
# How many items did a customer not buy at all, one time, 2 times,
# or exactly 4 times?
head(rowTabulates(purchase_matrix, values = c(0, 1, 2, 4)))
```
## Alternative Matrix Creation
In the previous section, I demonstrated how to create a sparse matrix from scratch using the `sparseMatrix()` function.
However, often you already have an existing matrix and want to convert it to a sparse representation.
```{r}
mat <- matrix(0, nrow=10, ncol=6)
mat[sample(seq_len(60), 4)] <- 1:4
# Convert dense matrix to sparse matrix
sparse_mat <- as(mat, "dgCMatrix")
sparse_mat
```
The *sparseMatrixStats* package is a derivative of the `r BiocStyle::CRANpkg("matrixStats")` package and implements it's API for
sparse matrices. For example, to calculate the variance for each column of `mat` you can do
```{r}
apply(mat, 2, var)
```
However, this is quite inefficient and *matrixStats* provides the direct function
```{r}
matrixStats::colVars(mat)
```
Now for sparse matrices, you can also just call
```{r}
sparseMatrixStats::colVars(sparse_mat)
```
# Benchmark
If you have a large matrix with many exact zeros, working on the sparse representation can considerably speed up the computations.
I generate a dataset with 10,000 rows and 50 columns that is 99% empty
```{r}
big_mat <- matrix(0, nrow=1e4, ncol=50)
big_mat[sample(seq_len(1e4 * 50), 5000)] <- rnorm(5000)
# Convert dense matrix to sparse matrix
big_sparse_mat <- as(big_mat, "dgCMatrix")
```
I use the bench package to benchmark the performance difference:
```{r}
bench::mark(
sparseMatrixStats=sparseMatrixStats::colVars(big_sparse_mat),
matrixStats=matrixStats::colVars(big_mat),
apply=apply(big_mat, 2, var)
)
```
As you can see `sparseMatrixStats` is ca. 50 times fast than `matrixStats`, which in turn is 7 times faster than the `apply()` version.
# Session Info
```{r}
sessionInfo()
```