Dependencies
library(MultiAssayExperiment)
library(HDF5Array)
library(SummarizedExperiment)
The HDF5Array package provides an on-disk representation of large datasets
without the need to load them into memory. Convenient lazy evaluation
operations allow the user to manipulate such large data files based on
metadata. The DelayedMatrix class in the DelayedArray package provides a
way to connect to a large matrix that is stored on disk.
First, we create a small matrix for constructing the DelayedMatrix class.
smallMatrix <- matrix(rnorm(10e5), ncol = 20)
We add rownames and column names to the matrix object for compatibility with
the MultiAssayExperiment representation.
rownames(smallMatrix) <- paste0("GENE", seq_len(nrow(smallMatrix)))
colnames(smallMatrix) <- paste0("SampleID", seq_len(ncol(smallMatrix)))
Here we use the DelayedArray constructor function to create a
DelayedMatrix object.
smallMatrix <- DelayedArray(smallMatrix)
class(smallMatrix)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "DelayedArray"
head(smallMatrix)
## <6 x 20> DelayedMatrix object of type "double":
## SampleID1 SampleID2 SampleID3 ... SampleID19 SampleID20
## GENE1 0.3654773 1.8547129 0.7394085 . 0.3164200 0.5157957
## GENE2 -0.1338232 0.8091005 0.4047900 . -0.2334411 1.2078164
## GENE3 1.1701883 -1.6972475 -2.2474354 . 2.1269241 -0.3236314
## GENE4 -0.2406371 1.3914202 -1.4340720 . -1.1372351 -0.4498882
## GENE5 0.3611254 -0.2978319 0.8371806 . 1.8871056 0.8887826
## GENE6 -1.5324085 -1.7743000 -0.1205001 . -0.1264347 0.2880376
dim(smallMatrix)
## [1] 50000 20
Note that a large matrix from an HDF5 file can also be loaded using the
HDF5Array function.
For example:
dataLocation <- system.file("extdata", "exMatrix.h5", package =
"MultiAssayExperiment", mustWork = TRUE)
h5ls(dataLocation)
## group name otype dclass dim
## 0 / exMatrix H5I_DATASET FLOAT 5000 x 20
hdf5Data <- HDF5ArraySeed(file = dataLocation, name = "exMatrix")
newDelayedMatrix <- DelayedArray(hdf5Data)
class(newDelayedMatrix)
## [1] "HDF5Matrix"
## attr(,"package")
## [1] "HDF5Array"
head(newDelayedMatrix)
## <6 x 20> DelayedMatrix object of type "double":
## [,1] [,2] [,3] ... [,19] [,20]
## [1,] 0.3261516 0.4149151 0.8154378 . -0.1876063 0.4156044
## [2,] 0.7243018 -0.9416687 -1.1290878 . -1.2820178 -0.3591841
## [3,] 1.5073255 0.7597899 -0.2756298 . -1.5666680 -0.1523462
## [4,] 0.1668286 1.2684049 0.9082990 . 0.3486139 1.8019041
## [5,] 0.5640491 -2.0222537 0.2881079 . 0.1210501 -1.4873598
## [6,] -0.3504778 -0.4149494 0.9145470 . 0.4291890 -0.4986399
Currently, the rhdf5 package does not store dimnames in the h5 file by
default. A request for this feature has been sent to the maintainer of the
rhdf5 package and any further development to the HDF5Array package is
contingent on such lower level dimension name storage.
DelayedMatrix with MultiAssayExperimentA DelayedMatrix alone conforms to the MultiAssayExperiment API requirements.
Shown below, the DelayedMatrix can be put into a named list and passed into
the MultiAssayExperiment constructor function.
HDF5MAE <- MultiAssayExperiment(experiments = list(smallMatrix = smallMatrix))
sampleMap(HDF5MAE)
## DataFrame with 20 rows and 3 columns
## assay primary colname
## <factor> <character> <character>
## 1 smallMatrix SampleID1 SampleID1
## 2 smallMatrix SampleID2 SampleID2
## 3 smallMatrix SampleID3 SampleID3
## 4 smallMatrix SampleID4 SampleID4
## 5 smallMatrix SampleID5 SampleID5
## ... ... ... ...
## 16 smallMatrix SampleID16 SampleID16
## 17 smallMatrix SampleID17 SampleID17
## 18 smallMatrix SampleID18 SampleID18
## 19 smallMatrix SampleID19 SampleID19
## 20 smallMatrix SampleID20 SampleID20
colData(HDF5MAE)
## DataFrame with 20 rows and 0 columns
SummarizedExperiment with DelayedMatrix backendA more information rich DelayedMatrix can be created when used in conjunction
with the SummarizedExperiment class and it can even include rowRanges.
The flexibility of the MultiAssayExperiment API supports classes with
minimal requirements. Additionally, this SummarizedExperiment with the
DelayedMatrix backend can be part of a bigger MultiAssayExperiment object.
Below is a minimal example of how this would work:
HDF5SE <- SummarizedExperiment(assays = smallMatrix)
assay(HDF5SE)
## <50000 x 20> DelayedMatrix object of type "double":
## SampleID1 SampleID2 SampleID3 ... SampleID19 SampleID20
## GENE1 0.3654773 1.8547129 0.7394085 . 0.3164200 0.5157957
## GENE2 -0.1338232 0.8091005 0.4047900 . -0.2334411 1.2078164
## GENE3 1.1701883 -1.6972475 -2.2474354 . 2.1269241 -0.3236314
## GENE4 -0.2406371 1.3914202 -1.4340720 . -1.1372351 -0.4498882
## GENE5 0.3611254 -0.2978319 0.8371806 . 1.8871056 0.8887826
## ... . . . . . .
## GENE49996 0.7984625 -2.3415584 -1.0812426 . 1.4327904 0.3700236
## GENE49997 0.8274643 2.2435007 0.7095783 . -1.2301209 -1.5926131
## GENE49998 -0.9446164 0.8485480 0.9379644 . 0.9443763 -0.7018718
## GENE49999 0.4499102 -0.6809208 -1.4620443 . -0.5380283 0.5260980
## GENE50000 0.3950852 -2.6593138 1.5633960 . 1.0369161 0.5897473
MultiAssayExperiment(list(HDF5SE = HDF5SE))
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] HDF5SE: SummarizedExperiment with 50000 rows and 20 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
Additional scenarios are currently in development where an HDF5Matrix is
hosted remotely. Many opportunities exist when considering on-disk and off-disk
representations of data with MultiAssayExperiment.