---
title: "Scry Methods For Larger Datasets"
author: "Will Townes"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  BiocStyle::html_document:
    toc: false
vignette: >
  %\VignetteEngine{knitr::knitr}
  %\VignetteIndexEntry{Scry Methods For Larger Datasets}
  %\usepackage[UTF-8]{inputenc}
---

```{r}
suppressPackageStartupMessages(library(TENxPBMCData))
require(scry)
```

We illustrate the application of scry methods to disk-based data from the 
TENxPBMCData package. Each dataset in this package is stored in an HDF5 file
that is accessed through a DelayedArray interface. This avoids the need to 
load the entire dataset into memory for analysis.

## Feature Selection with Deviance

```{r}
sce<-TENxPBMCData(dataset="pbmc3k")
h5counts<-counts(sce)
seed(h5counts) #print information about object
h5counts<-h5counts[rowSums(h5counts)>0,]
system.time(h5devs<-devianceFeatureSelection(h5counts)) # 26 sec
```

We now compare the computation speed when the same data is converted to an 
ordinary array in-memory. Note this would not be possible with larger 
HDF5Array objects.

```{r}
denseCounts<-as.matrix(h5counts)
system.time(denseDevs<-devianceFeatureSelection(denseCounts)) # 5 sec
max(abs(denseDevs-h5devs)) #should be close to zero
```

Finally we compare the speed when the counts data are stored in a sparse 
in-memory Matrix format

```{r}
mean(denseCounts>0) #shows that the data are mostly zeros so sparsity useful
sparseCounts<-Matrix::Matrix(denseCounts,sparse=TRUE)
system.time(sparseDevs<-devianceFeatureSelection(sparseCounts)) #1.6 sec
max(abs(sparseDevs-h5devs)) #should be close to zero
```

Using disk-based data saves memory but slows computation time. When the data
contain mostly zeros, and are not too large, the sparse in-memory Matrix 
object achieves fastest computation times. The resulting deviance statistics 
are the same for all of the different data formats.

## Null residuals

One can run `nullResiduals` on `HDF5Matrix`, `DelayedArray` matrices, and sparse 
matrices from the `Matrix` package with the same syntax used for the base 
matrix case.

We illustrate this with the same dataset from the `TENxPBMCData` package.

```{r, eval=FALSE}
sce <- nullResiduals(sce, assay="counts", type="deviance")
str(sce)
```