--- title: "GDSArray: Representing GDS files as array-like objects" author: - name: Qian Liu affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY - name: Hervé Pagès affiliation: Fred Hutchinson Cancer Research Center, Seattle, WA - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center, Buffalo, NY date: "last edit: 3/16/2018" output: BiocStyle::html_document: toc: true toc_float: true package: GDSArray vignette: | %\VignetteIndexEntry{GDSArray: Representing GDS files as array-like objects} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r options, eval=TRUE, echo=FALSE} options(showHeadLines=3) options(showTailLines=3) ``` # Introduction [GDSArray][] is a _Bioconductor_ package that represents GDS files as objects derived from the [DelayedArray][] package and `DelayedArray` class. It converts a GDS node in the file to a `DelayedArray`-derived data structure. The rich common methods and data operations defined on `GDSArray` makes it more _R_-user-friendly than working with the GDS file directly. [GDSArray]: https://bioconductor.org/packages/GDSArray [DelayedArray]: https://bioconductor.org/packages/DelayedArray # Package installation 1. Download the package from Bioconductor. ```{r getPackage, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GDSArray") ``` 2. Load the package into R session. ```{r Load, message=FALSE} library(GDSArray) ``` # GDS format introduction ## Genomic Data Structure (GDS) The _Bioconductor_ package [gdsfmt][] has provided a high-level R interface to CoreArray Genomic Data Structure (GDS) data files, which is designed for large-scale datasets, especially for data which are much larger than the available random-access memory. More details about GDS format can be found in the vignettes of the [gdsfmt][], [SNPRelate][], and [SeqArray][] packages. [gdsfmt]: https://bioconductor.org/packages/gdsfmt [SNPRelate]: https://bioconductor.org/packages/SNPRelate [SeqArray]: https://bioconductor.org/packages/SeqArray # `GDSArray`, `GDSMatrix`, and `GDSFile` `GDSArray` represents GDS files as `DelayedArray` instances. It has methods like `dim`, `dimnames` defined, and it inherits array-like operations and methods from `DelayedArray`, e.g., the subsetting method of `[`. ## GDSArray, GDSMatrix, and DelayedArray The `GDSArray()` constructor takes as arguments the file path and the GDS node inside the GDS file. The `GDSArray()` constructor always returns the object with rows being features (genes / variants / snps) and the columns being "samples". This is consistent with the assay data inside `SummarizedExperiment`. **FIXME: should GDSArray() return that dim?** ```{r, GDSArray} file <- gdsExampleFileName("seqgds") GDSArray(file, "genotype/data") ``` A `GDSMatrix` is a 2-dimensional `GDSArray`, and will be returned from the `GDSArray()` constructor automatically if the input GDS node is 2-dimensional. ```{r, GDSMatrix} GDSArray(file, "annotation/format/DP/data") ``` ## `GDSFile` The `GDSFile` is a light-weight class to represent GDS files. It has the `$` completion method to complete any possible gds nodes. It could be used as a convenient `GDSArray` constructor if the slot of `current_path` in `GDSFile` object represents a valid gds node. Otherwise, it will return the `GDSFile` object with an updated `current_path`. ```{r, GDSFile} gf <- GDSFile(file) gf$annotation$info gf$annotation$info$AC ``` Try typing in `gf$ann` and pressing `tab` key for the completion. ## `GDSArray` methods ### slot accessors. - `seed` returns the `GDSArraySeed` of the `GDSArray` object. ```{r, seedAccessor} gt <- GDSArray(file, "genotype/data") seed(gt) ``` - `gdsfile` returns the file path of the corresponding GDS file. ```{r, gdsfileAccessor} gdsfile(gt) ``` ### Available GDS nodes `gdsnodes()` takes the GDS file path or `GDSFile` object as input, and returns all nodes that can be converted to `GDSArray` instances. The returned GDS node names can be used as input for the `GDSArray(name=)` constructor. ```{r, gdsnodes} gdsnodes(file) identical(gdsnodes(file), gdsnodes(gf)) varname <- gdsnodes(file)[2] GDSArray(file, varname) ``` ### `dim()`, `dimnames()` The `dimnames(GDSArray)` returns an unnamed list, with the value of NULL or dimension names with length being the same as return from `dim(GDSArray)`. ```{r, dims} dp <- GDSArray(file, "annotation/format/DP/data") dim(dp) class(dimnames(dp)) lengths(dimnames(dp)) ``` ### `[` subsetting `GDSArray` instances can be subset, following the usual _R_ conventions, with numeric or logical vectors; logical vectors are recycled to the appropriate length. ```{r, methods} dp[1:3, 10:15] dp[c(TRUE, FALSE), ] ``` ### some numeric calculation ```{r, numeric} log(dp) dp[rowMeans(dp) < 60, ] ``` ## Internals: `GDSArraySeed` The `GDSArraySeed` class represents the 'seed' for the `GDSArray` object. It is not exported from the [GDSArray][] package. Seed objects should contain the gds file of `gds.class`, GDS file path, GDS file node name, and are expected to satisfy the [seed contract](https://bioconductor.org/packages/release/bioc/vignettes/DelayedArray/inst/doc/02-Implementing_a_backend.html#the-seed-contract) for implementing a `DelayedArray` backend, i.e. to support dim() and dimnames(). ```{r, GDSArraySeed} gds <- openfn.gds(file) seed <- GDSArray:::GDSArraySeed(gds, "genotype/data") seed closefn.gds(gds) ``` The seed can be used to construct a `GDSArray` instance. ```{r, GDSArray-from-GDSArraySeed} GDSArray(seed) ``` The `DelayedArray()` constructor with `GDSArraySeed` object as argument will return the same content as the `GDSArray()` constructor over the same `GDSArraySeed`. ```{r, da} class(DelayedArray(seed)) ``` # sessionInfo ```{r, sessionInfo} sessionInfo() ```