---
title: "_SummarizedExperiment_ for Coordinating Experimental Assays, Samples, and Regions of Interest"
author: "Martin Morgan, Valerie Obenchain, Jim Hester, Hervé Pagès"
date: "Revised: 5 Jan, 2023"
output:
BiocStyle::html_document:
toc: true
vignette: >
%\VignetteIndexEntry{1. SummarizedExperiment for Coordinating Experimental Assays, Samples, and Regions of Interest}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```
# Introduction
The `SummarizedExperiment` class is used to store rectangular matrices of
experimental results, which are commonly produced by sequencing and microarray
experiments. Note that `SummarizedExperiment` can simultaneously manage several
experimental results or `assays` as long as they be of the same dimensions.
Each object stores observations of one or more samples, along
with additional meta-data describing both the observations (features) and
samples (phenotypes).
A key aspect of the `SummarizedExperiment` class is the coordination of the
meta-data and assays when subsetting. For example, if you want to exclude a
given sample you can do for both the meta-data and assay in one operation,
which ensures the meta-data and observed data will remain in sync. Improperly
accounting for meta and observational data has resulted in a number of
incorrect results and retractions so this is a very desirable
property.
`SummarizedExperiment` is in many ways similar to the historical
`ExpressionSet`, the main distinction being that `SummarizedExperiment` is more
flexible in it's row information, allowing both `GRanges` based as well as those
described by arbitrary `DataFrame`s. This makes it ideally suited to a variety
of experiments, particularly sequencing based experiments such as RNA-Seq and
ChIp-Seq.
# Anatomy of a `SummarizedExperiment`
The _SummarizedExperiment_ package contains two classes:
`SummarizedExperiment` and `RangedSummarizedExperiment`.
`SummarizedExperiment` is a matrix-like container where rows represent features
of interest (e.g. genes, transcripts, exons, etc.) and columns represent
samples. The objects contain one or more assays, each represented by a
matrix-like object of numeric or other mode. The rows of a
`SummarizedExperiment` object represent features of interest. Information
about these features is stored in a `DataFrame` object, accessible using the
function `rowData()`. Each row of the `DataFrame` provides information on the
feature in the corresponding row of the `SummarizedExperiment` object. Columns
of the DataFrame represent different attributes of the features of interest,
e.g., gene or transcript IDs, etc.
`RangedSummarizedExperiment` is the child of the `SummarizedExperiment` class
which means that all the methods on `SummarizedExperiment` also work on a
`RangedSummarizedExperiment`.
The fundamental difference between the two classes is that the rows of a
`RangedSummarizedExperiment` object represent genomic ranges of interest
instead of a `DataFrame` of features. The `RangedSummarizedExperiment` ranges
are described by a `GRanges` or a `GRangesList` object, accessible using the
`rowRanges()` function.
The following graphic displays the class geometry and highlights the
vertical (column) and horizontal (row) relationships.

## Assays
The `airway` package contains an example dataset from an RNA-Seq experiment of
read counts per gene for airway smooth muscles. These data are stored
in a `RangedSummarizedExperiment` object which contains 8 different
experimental and assays 64,102 gene transcripts.
```{r, echo=FALSE}
suppressPackageStartupMessages(library(SummarizedExperiment))
suppressPackageStartupMessages(data(airway, package="airway"))
```
```{r}
library(SummarizedExperiment)
data(airway, package="airway")
se <- airway
se
```
To retrieve the experiment data from a `SummarizedExperiment` object one can
use the `assays()` accessor. An object can have multiple assay datasets
each of which can be accessed using the `$` operator.
The `airway` dataset contains only one assay (`counts`). Here each row
represents a gene transcript and each column one of the samples.
```{r assays, eval = FALSE}
assays(se)$counts
```
```{r assays_table, echo = FALSE}
knitr::kable(assays(se)$counts[1:10,])
```
## 'Row' (regions-of-interest) data
The `rowRanges()` accessor is used to view the range information for a
`RangedSummarizedExperiment`. (Note if this were the parent
`SummarizedExperiment` class we'd use `rowData()`). The data are stored in a
`GRangesList` object, where each list element corresponds to one gene
transcript and the ranges in each `GRanges` correspond to the exons in the
transcript.
```{r rowRanges}
rowRanges(se)
```
## 'Column' (sample) data
Sample meta-data describing the samples can be accessed using `colData()`, and
is a `DataFrame` that can store any number of descriptive columns for each
sample row.
```{r colData}
colData(se)
```
This sample metadata can be accessed using the `$` accessor which makes it
easy to subset the entire object by a given phenotype.
```{r columnSubset}
# subset for only those samples treated with dexamethasone
se[, se$dex == "trt"]
```
## Experiment-wide metadata
Meta-data describing the experimental methods and publication references can be
accessed using `metadata()`.
```{r metadata}
metadata(se)
```
Note that `metadata()` is just a simple list, so it is appropriate for _any_
experiment wide metadata the user wishes to save, such as storing model
formulas.
```{r metadata-formula}
metadata(se)$formula <- counts ~ dex + albut
metadata(se)
```
# Constructing a `SummarizedExperiment`
Often, `SummarizedExperiment` or `RangedSummarizedExperiment` objects are
returned by functions written by other packages. However it is possible to
create them by hand with a call to the `SummarizedExperiment()` constructor.
Constructing a `RangedSummarizedExperiment` with a `GRanges` as the
_rowRanges_ argument:
```{r constructRSE}
nrows <- 200
ncols <- 6
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
rowRanges <- GRanges(rep(c("chr1", "chr2"), c(50, 150)),
IRanges(floor(runif(200, 1e5, 1e6)), width=100),
strand=sample(c("+", "-"), 200, TRUE),
feature_id=sprintf("ID%03d", 1:200))
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
row.names=LETTERS[1:6])
SummarizedExperiment(assays=list(counts=counts),
rowRanges=rowRanges, colData=colData)
```
A `SummarizedExperiment` can be constructed with or without supplying
a `DataFrame` for the _rowData_ argument:
```{r constructSE}
SummarizedExperiment(assays=list(counts=counts), colData=colData)
```
# Top-level dimnames vs assay-level dimnames
In addition to the dimnames that are set on a `SummarizedExperiment` object
itself, the individual assays that are stored in the object can have their
own dimnames or not:
```{r construct_se3}
a1 <- matrix(runif(24), ncol=6, dimnames=list(letters[1:4], LETTERS[1:6]))
a2 <- matrix(rpois(24, 0.8), ncol=6)
a3 <- matrix(101:124, ncol=6, dimnames=list(NULL, LETTERS[1:6]))
se3 <- SummarizedExperiment(SimpleList(a1, a2, a3))
```
The dimnames of the `SummarizedExperiment` object (top-level dimnames):
```{r top_level_dimnames}
dimnames(se3)
```
When extracting assays from the object, the top-level dimnames are put on
them by default:
```{r top_level_dimnames_are_propagated}
assay(se3, 2) # this is 'a2', but with the top-level dimnames on it
assay(se3, 3) # this is 'a3', but with the top-level dimnames on it
```
However if using `withDimnames=FALSE` then the assays are returned
_as-is_, i.e. with their original dimnames (this is how they are stored
in the `SummarizedExperiment` object):
```{r assay_level_dimnames}
assay(se3, 2, withDimnames=FALSE) # identical to 'a2'
assay(se3, 3, withDimnames=FALSE) # identical to 'a3'
rownames(se3) <- strrep(letters[1:4], 3)
dimnames(se3)
assay(se3, 1) # this is 'a1', but with the top-level dimnames on it
assay(se3, 1, withDimnames=FALSE) # identical to 'a1'
```
# Common operations on `SummarizedExperiment`
## Subsetting
- `[` Performs two dimensional subsetting, just like subsetting a matrix
or data frame.
```{r 2d}
# subset the first five transcripts and first three samples
se[1:5, 1:3]
```
- `$` operates on `colData()` columns, for easy sample extraction.
```{r colDataExtraction}
se[, se$cell == "N61311"]
```
## Getters and setters
- `rowRanges()` / (`rowData()`), `colData()`, `metadata()`
```{r getSet}
counts <- matrix(1:15, 5, 3, dimnames=list(LETTERS[1:5], LETTERS[1:3]))
dates <- SummarizedExperiment(assays=list(counts=counts),
rowData=DataFrame(month=month.name[1:5], day=1:5))
# Subset all January assays
dates[rowData(dates)$month == "January", ]
```
- `assay()` versus `assays()`
There are two accessor functions for extracting the assay data from a
`SummarizedExperiment` object. `assays()` operates on the entire list of assay
data as a whole, while `assay()` operates on only one assay at a time.
`assay(x, i)` is simply a convenience function which is equivalent to
`assays(x)[[i]]`.
```{r assay_assays}
assays(se)
assays(se)[[1]][1:5, 1:5]
# assay defaults to the first assay if no i is given
assay(se)[1:5, 1:5]
assay(se, 1)[1:5, 1:5]
```
## Range-based operations
- `subsetByOverlaps()`
`SummarizedExperiment` objects support all of the `findOverlaps()` methods and
associated functions. This includes `subsetByOverlaps()`, which makes it easy
to subset a `SummarizedExperiment` object by an interval.
```{r overlap}
# Subset for only rows which are in the interval 100,000 to 110,000 of
# chromosome 1
roi <- GRanges(seqnames="1", ranges=100000:1100000)
subsetByOverlaps(se, roi)
```
# Interactive visualization
The `r BiocStyle::Biocpkg("iSEE")` package provides functions for creating an interactive user interface based on the `r BiocStyle::CRANpkg("shiny")` package for exploring data stored in `SummarizedExperiment` objects.
Information stored in standard components of `SummarizedExperiment` objects -- including assay data, and row and column metadata -- are automatically detected and used to populate the interactive multi-panel user interface.
Particular attention is given to the `r BiocStyle::Biocpkg("SingleCellExperiment")` extension of the `SummarizedExperiment` class, with visualization of dimensionality reduction results.
Extensions to the `r BiocStyle::Biocpkg("iSEE")` package provide support for more context-dependent functionality:
- `r BiocStyle::Biocpkg("iSEEde")` provides additional panels that facilitate the interactive visualization of differential expression results, including the `DESeqDataSet` extension of `SummarizedExperiment` implemented in `r BiocStyle::Biocpkg("DESeq2")`.
- `r BiocStyle::Biocpkg("iSEEpathways")` provides additional panels for the interactive visualization of pathway analysis results.
- `r BiocStyle::Biocpkg("iSEEhub")` provides functionality to import data sets stored in the Bioconductor `r BiocStyle::Biocpkg("ExperimentHub")`.
- `r BiocStyle::Biocpkg("iSEEhub")` provides functionality to import data sets from custom sources (local and remote).
# Session information
```{r}
sessionInfo()
```