---
title: "VariantExperiment: A RangedSummarizedExperiment Container for VCF/GDS Data with GDS Backend"
valauthor: 
- name: Qian Liu
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
- name: Martin Morgan
  affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY
date: "last edit: 08/14/2019"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
package: VariantExperiment
vignette: |
    %\VignetteIndexEntry{VariantExperiment-class}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r options, eval=TRUE, echo=FALSE}
options(showHeadLines=3)
options(showTailLines=3)
```

# User instruction

This package includes 2 vignettes. This One is about the class
definitions, basic methods and operations of `VariantExperiment` that
are defined or directly inherited from `SummarizedExperiment`. For
users who want to apply `VariantExperiment` directly to their analysis
high-throughput genotyping or DNA-seq data sets, please refer to the
other vignette.

# Introduction 

As the sequencing data (DNA-seq, RNA-seq, single cell RNA-seq...) gets
increasingly larger-profile, the memory space in R has been an
obstable for fast and efficient data processing, because most
available _R_ or _Bioconductor_ packages are developed based on
in-memory data manipulation. [SingleCellExperiment][] has achieved
efficient on-disk saving/reading of the large-scale count data as
[HDF5Array][] objects. However, there was still no such light-weight
containers available for high-throughput DNA-seq / genotyping data.

We have developed [VariantExperiment][], a _Bioconductor_ package for
saving the `VCF/GDS` format data into `RangedSummarizedExperiment`
object. The high-throughput genetic/genomic data are saved in
[GDSArray][] objects, a direct extension of [DelayedArray][] with GDS
back-end. In addition to the light-weight `Assay` data, We have also
realized the on-disk saving of annotation data for both
features/samples (corresponding to `rowData/colData`) by implementing
the [DelayedDataFrame][] data structure.  The on-disk representation
of both assay data and annotation data realizes on-disk reading and
processing and saves _R_ memory space significantly. The interface of
`RangedSummarizedExperiment` data format enables easy and common
manipulations for high-throughput genetic/genomic data with common
[SummarizedExperiment][] metaphor in _R_ and _Bioconductor_.

[VariantExperiment]: https://bioconductor.org/packages/VariantExperiment 
[SummarizedExperiment]: https://bioconductor.org/packages/SummarizedExperiment
[SingleCellExperiment]: https://bioconductor.org/packages/SingleCellExperiment
[DelayedArray]: https://bioconductor.org/packages/DelayedArray
[HDF5Array]: https://bioconductor.org/packages/HDF5Array
[GDSArray]: https://bioconductor.org/packages/GDSArray
[DelayedDataFrame]: http://bioconductor.org/packages/DelayedDataFrame

# Installation

1. Download the package from Bioconductor: 

```{r getPackage, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("VariantExperiment")
```
Or install the development version of the package from Github.
```{r, eval = FALSE}
BiocManager::install(â€œBioconductor/VariantExperimentâ€) 
``` 

2. Load the package into R session.
```{r Load, message=FALSE}
library(VariantExperiment)
```

# Background

## GDSArray

[GDSArray][] is a _Bioconductor_ package that represents `GDS` files as
objects derived from the [DelayedArray][] package and `DelayedArray`
class. It converts `GDS` nodes into a `DelayedArray`-derived data
structure. The rich common methods and data operations defined on
`GDSArray` makes it more _R_-user-friendly than working with the GDS
file directly. The array data from GDS files are always returned with
the first dimension being features (genes/variants/snps) and the
second dimension being `samples`. This feature is consistent with the
assay data saved in `SummarizedExperiment`, and makes the `GDSArray`
package more interoperable with other established _Bioconductor_ data
infrastructure.   

The `GDSArray()` constructor takes 2 arguments: the file path and the
GDS node name inside the GDS file. 

```{r, GDSArray}
file <- SeqArray::seqExampleFileName("gds")
GDSArray(file, "genotype")
GDSArray(file, "sample.id")
```
More details about `GDS` or `GDSArray` format can be found in the
vignettes of the [gdsfmt][], [SNPRelate][], [SeqArray][], [GDSArray][]
and [DelayedArray][] packages.

[gdsfmt]: https://bioconductor.org/packages/gdsfmt
[SNPRelate]: https://bioconductor.org/packages/SNPRelate
[SeqArray]: https://bioconductor.org/packages/SeqArray

## DelayedDataFrame

[DelayedDataFrame][] is a _Bioconductor_ package that implements
delayed operations on `DataFrame` objects using standard `DataFrame`
metaphor. Each column of data inside `DelayedDataFrame` is represented
as 1-dimensional `GDSArray` with on-disk GDS file. Methods like
`show`,`validity check`, `[`, `[[` subsetting, `rbind`, `cbind` are implemented for
`DelayedDataFrame`.  The `DelayedDataFrame` stays lazy until an
explicit realization call like `DataFrame()` constructor or
`as.list()` incurred. More details about [DelayedDataFrame][] data
structure could be found in the vignette of [DelayedDataFrame][]
package.

# `VariantExperiment` class

## `VariantExperiment` class
`VariantExperiment` class is defined to extend
`RangedSummarizedExperiment`. The difference would be that the assay
data are saved as `GDSArray`, and the annotation data are saved by
default as `DelayedDataFrame` (with option to save as ordinary
`DataFrame`), both of which are representing the data on-disk with `GDS`
back-end. There are coercion methods defined for `VCF` and `GDS` files into
`VariantExperiment` objects (Check the method vignette for more details). 
Here we take an example for illustration.  

```{r makeVariantExperimentFromGDS}
ve <- makeVariantExperimentFromGDS(file)
ve
```

## slot accessors

assay data are in `GDSArray` format, and could be retrieve by the
`assays()/assay()` function.

```{r makeVariantExperimentFromGDS2}
assays(ve)
assay(ve, 1)
```

The `rowData()` of the `VariantExperiment` is saved in
`DelayedDataFrame` format. We can use `rowRanges()` / `rowData()` to
retrieve the feature-related annotation file, with/without a
GenomicRange format.

```{r rrrd}
rowRanges(ve)
rowData(ve)
```

sample-related annotation is in `DelayedDataFrame` format, and could
be retrieved by `colData()`.

```{r colData}
colData(ve)
``` 

The `gdsfile()` will retrieve the gds file path associated with the
`VariantExperiment` object.
```{r gdsfile}
gdsfile(ve)
```

Some other getter function like `metadata()` will return any metadata
that we have saved inside the `VariantExperiment` object.
```{r metaData}
metadata(ve)
``` 

# Subsetting methods

The `VariantExperiment` object supports subsetting methods with `[`,
`$` and `subsetByOverlap` etc. as in `SummarizedExperiment`.

## two-dimensional subsetting
```{r, 2d}
ve[1:10, 1:5]
```

## `$` subsetting
The `$` subsetting could be operated directly on `colData()` columns,
for easy sample extraction.  Note that the `colData/rowData` are in the
`DelayedDataFrame` format, with each column saved as `GDSArray`. So
when doing subsetting, we need to use `as.logical()` to convert the
1-dimensional `GDSArray` into ordinary vector.

```{r colDataExtraction}
colData(ve)
ve[, as.logical(ve$family == "1328")]
```

subsetting by `rowData()` columns.

```{r rowDataExtraction}
rowData(ve)
ve[as.logical(rowData(ve)$REF == "T"),]
```

## Range-based operations

`VariantExperiment` objects support all of the `findOverlaps()`
methods and associated functions.  This includes `subsetByOverlaps()`,
which makes it easy to subset a `VariantExperiment` object by an
interval.

```{r overlap}
ve1 <- subsetByOverlaps(ve, GRanges("22:1-48958933"))
ve1
```
# Save / load `VariantExperiment` object

Note that after the subsetting by `[`, `$` or `Ranged-based
operations`, and you feel satisfied with the data for downstream
analysis, you need to save that `VariantExperiment` object to
synchronize the gds file (on-disk) associated with the subset of data
(in-memory representation) before any statistical analysis. Otherwise,
an error will be returned.

For example, after we subset the `ve` by `GRanges("22:1-48958933")`,
and we want to calculate the hwe based on the 23 variants, an error
will be generated indicating that we need to sync the on-disk and
in-memory representations.

```{r saveLoad, eval=FALSE}
hwe(ve1)
## Error in .saveGDSMaybe(gdsfile) : use
##   'saveVariantExperiment()' to synchronize on-disk and
##   in-memory representations
```

## save `VariantExperiment` object

Use the function `saveVariantExperiment` to synchronize the on-disk
and in-memory representation. This function writes the processed data
as `ve.gds`, and save the _R_ object (which lazily represent the
backend data set) as `ve.rds` under the specified directory. It
finally returns a new `VariantExperiment` object into current R
session generated from the newly saved data.

```{r saveVE}
a <- tempfile()
ve2 <- saveVariantExperiment(ve1, dir=a, replace=TRUE)
```

## load `VariantExperiment` object

You can alternatively use `loadVariantExperiment` to load the
synchronized data into R session, by providing only the file
directory. It reads the `VariantExperiment` object saved as `ve.rds`, as lazy
representation of the backend `ve.gds` file under the specific
directory.

```{r loadVE}
ve3 <- loadVariantExperiment(dir=a)
gdsfile(ve3)
all.equal(ve2, ve3)
```

Now we are all set for any downstream analysis as needed. 
```{r newVEstats}
head(hwe(ve2))
```

# Session Info

```{r sessionInfo}
sessionInfo()
```