---
title: "Transitioning from SCESet to SingleCellExperiment"
author:
- name: "Davis McCarthy"
  affiliation: 
  - EMBL-EBI
package: scater
output:
    BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Transition from SCESet to SingleCellExperiment}
  %\VignetteEngine{knitr::rmarkdown}
  %VignetteEncoding{UTF-8}
---


```{r knitr-options, echo=FALSE, message=FALSE, warning=FALSE}
## To render an HTML version that works nicely with github and web pages, do:
## rmarkdown::render("vignettes/vignette.Rmd", "all")
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))
```

This document provides advice for users of early versions of `scater` who will 
need to transition from the use of the `SCESet` class to the `SingleCellExperiment` class.

As of July 2017, `scater` has switched from the `SCESet` class previously 
defined within the package to the more widely applicable `SingleCellExperiment`
class. From Bioconductor 3.6 (October 2017), the release version of `scater` 
will use `SingleCellExperiment`. 

`SingleCellExperiment` is a more modern and robust class that provides a common 
data structure used by many single-cell Bioconductor packages. Advantages 
include support for sparse data matrices and the capability for on-disk storage 
of data to minimise memory usage for large single-cell datasets.

It should be straight-forward to convert existing scripts based on `SCESet`
objects to `SingleCellExperiment` objects, with key changes outlined immediately
below.


# Executive summary

* The functions `toSingleCellExperiment` and `updateSCESet` (for backwards
compatibility) can be used to convert an old `SCESet` object to a 
`SingleCellExperiment` object;
* Create a new `SingleCellExperiment` object with the function 
`SingleCellExperiment` (actually less fiddly than creating a new `SCESet`);
* `scater` functions have been refactored to take `SingleCellExperiment` 
objects, so once data is in a `SingleCellExperiment` object, the user experience
is almost identical to that with the `SCESet` class.

Potential "gotchas":

* Cell names can now be accessed/assigned with the `colnames` function (instead
of `sampleNames` or `cellNames` for an `SCESet` object);
* Feature (gene/transcript) names should now be accessed/assigned with the 
`rownames` function (instead of `featureNames`);
* Cell metadata, stored as `phenoData` in an `SCESet`, corresponds to `colData` 
in a `SingleCellExperiment` object and is accessed/assigned with the `colData` 
function (this replaces the `pData` function);
* Individual cell-level variables can still be accessed with the `$` operator 
(e.g. `sce$total_counts`);
* Feature metadata, stored as `featureData` in an `SCESet`, corresponds to 
`rowData` in a `SingleCellExperiment` object and is accessed/assigned with the
`rowData` function (this replaces the `fData` function);
* `plotScater`, which produces a cumulative expression, overview plot, replaces 
the generic `plot` function for `SCESet` objects.


#  A note on terminology

In Bioconductor terminology we assay numerous "features" for a number of
"samples". Features, in the context of `scater`, correspond most commonly to 
genes or transcripts, but could be any general genomic or transcriptomic regions 
(e.g. exon) of interest for which we take measurements. Samples correspond to 
cells.

With the switch to using the `SingleCellExperiment` class, the terminology has
become more general again. Now we have "rows" representing features and "cols" 
representing samples (cells). Thus, applying the `rownames` function returns the
names of the features defined for a `SingleCellExperiment` object, which in 
typical `scater` usage would correspond to gene IDs. In much of
what follows, it may be more intuitive to mentally replace "feature" with "gene"
or "transcript" (depending on the context of the study) wherever "feature"
appears.

In the `scater` context, "samples" refer to individual cells that we have
assayed. This differs from common usage of "sample" in other contexts, where
we might usually use "sample" to refer to an individual subject, a biological
replicate or similar. A "sample" in this sense in `scater` may be referred to as
a "block" in the more classical statistical sense. Within a "block" (e.g.
individual) we may have assayed numerous cells. Thus, the function `colnames`,
when applied to a `SingleCellExperiment` object returns the cell IDs.


#  The `SingleCellExperiment` class and methods

In `scater` we organise single-cell expression data in objects of the 
`SingleCellExperiment` class. The class inherits the Bioconductor 
`SummarizedExperiment` class, which provides a common interface across many
Bioconductor packages. For more details about other features inherited from
Bioconductor's `SummarizedExperiment` class, type `?SummarizedExperiment` at the
R prompt.

The class only requires some "assay data" (i.e. expression values of some sort) 
as input. Most commonly, these will be "counts" (e.g. molecule or read counts) 
and/or log2-scale transformed counts.

Cell metadata can be supplied as a `DataFrame` object, where rows are cells, and
columns are cell attributes (such as cell type, culture condition, day captured,
etc.). Feature metadata can be supplied as a `DataFrame` object, where rows are features (e.g. genes), and columns are feature attributes, such as Ensembl ID, 
biotype, gc content, etc.

We can create a minimal `SingleCellExperiment` object as follows:

```{r sceset-make-sce-minimal}
data("sc_example_counts")
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts))
example_sce
```


The requirements for the `SingleCellExperiment` class (as with other S4 classes 
in R and Bioconductor) are strict. The idea is that strictness with generating a valid
class object ensures that downstream methods applied to the class will work
reliably. 

Thus, if we supply `colData` and/or `rowData` when building an obejct, the 
expression value matrix *must* have the same number of columns as the `colData` 
`DataFrame` has rows, and it must have the same number of rows as the `rowData`
`DataFrame` has rows. Row names of the `colData` object need to match the column
names of the expression matrix and row names of the `rowData` object need to 
match row names of the expression matrix.

We can create a new `SingleCellExperiment` object with count data, cell metadata
and gene metadata as follows.

```{r sceset-make-sceset-counts-only}
data("sc_example_cell_info")
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts), 
                                    colData = sc_example_cell_info, 
                                    rowData = gene_df)
example_sce
```

Frequently (typically), we will want both raw counts and log2-scale counts in 
our `SingleCellExperiment` object. It is straight-forward to add 
log2-counts-per-million to an object containing counts.

We can use the `normalise` (or, if you prefer, `normalize`) function:

```{r normalise, eval=TRUE}
example_sce <- normalise(example_sce)
```

(This gives a warning to let us know that as size factors for normalisation
have not yet been defined, library sizes (total counts) are used instead. This
function can also be used for more sophisticated size-factor normalisation once
size factors have been calculated.)

Or, we use `calculateCPM` directly (with equivalent results): 

```{r cpm, eval=TRUE}
logcounts(example_sce) <- log2(calculateCPM(example_sce, 
                                            use.size.factors = FALSE) + 1)
```

The log-scale count data is stored in the `logcounts` assay slot of a 
`SingleCellExperiment` object. The `exprs` getter/setter function also accesses
this `logcounts` slot, to enable equivalent usage as in previous versions of 
`scater`.


# Subsetting, accessing and assigning data in a `SingleCellExperiment` object

We have accessor functions to access elements of the `SingleCellExperiment` 
object. Furthermore, subsetting `SingleCellExperiment` objects is 
straightforward and reliable, using the usual R `[]` notation, with rows 
representing features and columns representing cells.

* `counts(object)`: returns the matrix of read counts. As you can see above, if
no counts are defined for the object, then the counts matrix slot is simpy
`NULL`.

```{r counts-accessor, eval=TRUE}
counts(example_sce)[1:3, 1:6]
```
* `exprs(object)`: returns the matrix of (log-counts) expression values, in fact
accessing the `logcounts` slot of the object (synonym for `logcounts`). Typically these
should be log2(counts-per-million) values or
log2(reads-per-kilobase-per-million-mapped), appropriately normalised of course.
The package will generally assume that these are the values to use for 
expression.

```{r exprs-accessor, eval=TRUE}
exprs(example_sce)[1:3, 1:6]
```
* Generically, we can access any assay data from the object with the `assay` 
function. We simply supply the function with the `SingleCellExperiment` object 
and the name of the desired expression matrix:
```{r assay, eval=FALSE}
assay(example_sce, "counts")[1:3, 1:6]
```

Similarly we can assign a new (say, transformed) expression matrix to an 
`SingleCellExperiment` object using `assay` as follows:
```{r assay-set, eval=TRUE}
assay(example_sce, "counts") <- counts(example_sce)
```

For convenience (and backwards compatibility) getters and setters are provided
as follows:  `exprs`, `tpm`, `cpm`, `fpkm` and versions of these with the prefix "norm_"):

Handily, it is also easy to replace other data in slots of the `SCESet` object
using generic accessor and replacement functions.

```{r sce-demo-replacement, eval=TRUE}
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
## replace rowData (previously featureData)
rowData(example_sce) <- gene_df
## replace colData (previously phenotype data)
colData(example_sce) <- DataFrame(sc_example_cell_info)
```

After gaining familiarity with creating and manipulating `SingleCellExperiment`
objects, see the other `scater` vignettes for guidance on using `scater` for 
quality control, data visualisation and more.