---
title: "Fluent genomics with plyranges and tximeta"
author: "Stuart Lee, Michael Lawrence, Michael Love"
abstract: "We construct a simple workflow for fluent genomics data analysis using the R/Bioconductor ecosystem. This involves three core steps: **import** the data into an appropriate abstraction, **model** the data with respect to the biological questions of interest, and **integrate** the results with respect to their underlying genomic coordinates. Here we show how to implement these steps to integrate published RNA-seq and ATAC-seq experiments on macrophage cell lines. Using *tximeta*, we **import** RNA-seq transcript quantifications into an analysis-ready data structure, called the *SummarizedExperiment*, that contains the ranges of the reference transcripts and metadata on their provenance. Using *SummarizedExperiment*s to represent the ATAC-seq and RNA-seq data, we **model**  differentially accessible (DA) chromatin peaks and differentially expressed (DE) genes with existing Bioconductor packages. Using *plyranges* we then **integrate** the results to see if there is an enrichment of DA peaks near DE genes by finding overlaps and aggregating over log-fold change thresholds. The combination of these packages and their integration with the Bioconductor ecosystem provide a coherent framework for analysts to iteratively and reproducibly explore their biological data."
output:
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    toc: true
    toc_depth: 3
    fig_width: 5
bibliography: library.bib
link-citations: true
vignette: >
  %\VignetteIndexEntry{fluentGenomics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding[utf8]{inputenc}
---

```{r setup, include = FALSE}
library(fluentGenomics)
dir <- system.file("extdata", package="macrophage")

library(tximeta)
makeLinkedTxome(
  indexDir=file.path(dir, "gencode.v29_salmon_0.12.0"),
  source="Gencode",
  organism="Homo sapiens",
  release="29",
  genome="GRCh38",
  fasta="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz",
  gtf=file.path(dir, "gencode.v29.annotation.gtf.gz"), # local version
  write=FALSE
)

knitr::opts_chunk$set(
  fig.align = "center"
)
```

(ref:workflow) An overview of the fluent genomics workflow. First, we *import* data as a *SummarizedExperiment* object, which enables interoperability with downstream analysis packages. Then we *model* our assay data, using the existing Bioconductor packages *DESeq2* and *limma*. We take the results of our models for each assay with respect to their genomic coordinates, and *integrate* them. First, we compute the overlap between the results of each assay, then aggregate over the combined genomic regions, and finally summarize to compare enrichment for differentially expressed genes to non differentially expressed genes. The final output can be used for downstream visualization or further transformation.

(ref:maplot) Visualization of *DESeq2* results as an "MA plot". Genes that have an adjusted *p-value* below 0.01 are colored red.

(ref:boxplot) A boxplot of maximum LFCs for DA peaks for DE genes compared to non-DE genes where genes have at least one DA peak.

(ref:linechart) A line chart displaying how relative enrichment of DA peaks change between DE genes compared to non-DE genes as the absolute DA LFC threshold increases.

(ref:linechart2) A line chart displaying how gene and peak counts change as the absolute DA LFC threshold increases. Lines are colored according to whether they represent a gene that is DE or not. Note the x-axis is on a $\log_{10}$ scale.

# Introduction

In this workflow, we examine a subset of the RNA-seq and ATAC-seq data from
@alasoo, a study that involved treatment of macrophage cell lines from a number
of human donors with interferon gamma (IFNg), *Salmonella* infection, or both
treatments combined. @alasoo examined gene expression and chromatin
accessibility in a subset of 86 successfully differentiated induced pluripotent
stem cells (iPSC) lines, and compared baseline and response with respect to
chromatin accessibility and gene expression at specific quantitative trait loci
(QTL). The authors found that many of the stimulus-specific expression QTL were
already detectable as chromatin QTL in naive cells, and further hypothesize
about the nature and role of transcription factors implicated in the response
to stimulus.

We will perform a much simpler analysis than the one found in @alasoo, using
their publicly available RNA-seq and ATAC-seq data (ignoring the genotypes). We
will examine the effect of IFNg stimulation on gene expression and chromatin
accessibility, and look to see if there is an enrichment of differentially
accessible (DA) ATAC-seq peaks in the vicinity of differentially expressed (DE)
genes. This is plausible, as the transcriptomic response to IFNg stimulation
may be mediated through binding of regulatory proteins to accessible regions,
and this binding may increase the accessibility of those regions such that it
can be detected by ATAC-seq.

```{r workflow, fig.cap = "(ref:workflow)", fig.align='center', out.width="\\textwidth", echo = FALSE}
knitr::include_graphics("workflow.png")
```

Throughout the workflow (Figure \@ref(fig:workflow)), we will use existing
Bioconductor infrastructure to understand these datasets. In particular, we
will emphasize the use of the Bioconductor packages *plyranges* and *tximeta*.
The *plyranges* package fluently transforms data tied to genomic ranges using
operations like shifting, window construction, overlap detection, etc. It is
described by @Lee2019 and leverages underlying core Bioconductor infrastructure
[@granges; @bioc] and the *tidyverse* design principles @tidyverse.

The *tximeta* package described by @Love2019-tximeta is used to read RNA-seq
quantification data into R/Bioconductor, such that the transcript ranges and
their provenance are automatically attached to the object containing expression
values and differential expression results.

## Experimental Data

The data used in this workflow is available from two packages: the *macrophage*
Bioconductor ExperimentData package and from the workflow package
*fluentGenomics*.

The *macrophage* package contains RNA-seq quantification from 24 RNA-seq
samples, a subset of the RNA-seq samples generated and analyzed by @alasoo. The
paired-end reads were quantified using *Salmon* [@salmon], using the Gencode 29
human reference transcripts [@gencode]. For more details on quantification, and
the exact code used, consult the vignette of the
[macrophage](http://bioconductor.org/packages/macrophage) package. The package
also contains the `Snakemake` file that was used to distribute the *Salmon*
quantification jobs on a cluster [@snakemake].

The *fluentGenomics* package contains functionality to download and generate a
cached *SummarizedExperiment* object from the normalized ATAC-seq data provided
by @alasooZenodo. This object contains all 145 ATAC-seq samples across all
experimental conditions as analyzed by @alasoo. The data can be also be
downloaded directly from the
[Zenodo](https://zenodo.org/record/1188300#.XIAhXlNKjOQ) deposition.

The following code loads the path to the cached data file, or if it is not
present, will create the cache and generate a *SummarizedExperiment* using the
the *BiocFileCache* package [@bcfilecache].

We can then read the cached file and assign it to an object called `atac`.

```{r read-cache, eval = FALSE}
library(fluentGenomics)
atac <- readRDS(cache_atac_se())
```

A precise description of how we obtained this *SummarizedExperiment* object can
be found in section \@ref(atac).

# Import Data as a *SummarizedExperiment* {#se}

## Using *tximeta* to import RNA-seq quantification data

First, we specify a directory `dir`, where the quantification files are stored.
You could simply specify this directory with:

```{r dir, eval=FALSE}
dir <- "/path/to/quant/files"
```

where the path is relative to your current R session. However, in this case we
have distributed the files in the *macrophage* package. The relevant directory
and associated files can be located using `system.file`.

```{r setdir}
dir <- system.file("extdata", package="macrophage")
```

Information about the experiment is contained in the `coldata.csv` file. We
leverage the *dplyr* and *readr* packages (as part of the *tidyverse*) to read
this file into R [@tidyverse]. We will see later that *plyranges* extends these
packages to accommodate genomic ranges.

```{r coldata-rna}
library(dplyr)
library(readr)
colfile <- file.path(dir, "coldata.csv")
coldata <- read_csv(colfile) %>%
  dplyr::select(
    names,
    id = sample_id,
    line = line_id,
    condition = condition_name
  ) %>%
  dplyr::mutate(
    files = file.path(dir, "quants", names, "quant.sf.gz"),
    line = factor(line),
    condition = relevel(factor(condition), "naive")
  )
coldata
```

After we have read the `coldata.csv` file, we select relevant columns from this
table, create a new column called `files`, and transform the existing `line`
and `condition` columns into factors.  In the case of `condition`, we specify
the "naive" cell line as the reference level.  The `files` column points to the
quantifications for each observation -- these files have been gzipped, but
would typically not have the 'gz' ending if used from *Salmon* directly. One
other thing to note is the use of the pipe operator,`%>%`, which can be read as
"then", i.e. first read the data, *then* select columns, *then* mutate them.

Now we have a table summarizing the experimental design and the locations of
the quantifications. The following lines of code do a lot of work for the
analyst: importing the RNA-seq quantification (dropping *inferential
replicates* in this case), locating the relevant reference transcriptome,
attaching the transcript ranges to the data, and fetching genome information.
Inferential replicates are especially useful for performing transcript-level
analysis, but here we will use a point estimate for the per-gene counts and
perform gene-level analysis.

The result is a *SummarizedExperiment* object.

```{r tximeta-run}
suppressPackageStartupMessages(library(SummarizedExperiment))
library(tximeta)
se <- tximeta(coldata, dropInfReps=TRUE)
se
```
<!-- should we describe the data structure in more detail, or use a figure -->

On a machine with a working internet connection, the above command works
without any extra steps, as the `tximeta` function obtains any necessary
metadata via FTP, unless it is already cached locally. The *tximeta* package
can also be used without an internet connection, in this case the linked
transcriptome can be created directly from a *Salmon* index and gtf.

```{r linkedtxome-ex, eval = FALSE}
makeLinkedTxome(
  indexDir=file.path(dir, "gencode.v29_salmon_0.12.0"),
  source="Gencode",
  organism="Homo sapiens",
  release="29",
  genome="GRCh38",
  fasta="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.transcripts.fa.gz",
  gtf=file.path(dir, "gencode.v29.annotation.gtf.gz"), # local version
  write=FALSE
)
```

Because *tximeta* knows the correct reference transcriptome, we can ask
*tximeta* to summarize the transcript-level data to the gene level using the
methods of @Soneson2015.

```{r gse}
gse <- summarizeToGene(se)
```

One final note is that the `start` of positive strand genes and the `end` of
negative strand genes is now dictated by the genomic extent of the isoforms of
the gene (so the `start` and `end` of the reduced *GRanges*). Another
alternative would be to either operate on transcript abundance, and perform
differential analysis on transcript (and so avoid defining the TSS of a set of
isoforms), or to use gene-level summarized expression but to pick the most
representative TSS based on isoform expression.

## Importing ATAC-seq data as a *SummarizedExperiment* object {#atac}

The *SummarizedExperiment* object containing ATAC-seq peaks can be created from
the following tab-delimited files from @alasooZenodo:

* The sample metadata: `ATAC_sample_metadata.txt.gz` (<1M)
* The matrix of normalized read counts: `ATAC_cqn_matrix.txt.gz` (109M)
* The annotated peaks: `ATAC_peak_metadata.txt.gz` (5.6M)

To begin, we read in the sample metadata, following similar steps to those we
used to generate the `coldata` table for the RNA-seq experiment:

```{r coldata-atac, eval=FALSE}
atac_coldata <- read_tsv("ATAC_sample_metadata.txt.gz") %>%
  select(
    sample_id,
    donor,
    condition = condition_name
  ) %>%
  mutate(condition = relevel(factor(condition), "naive"))
```

The ATAC-seq counts have already been normalized with *cqn* [@Hansen2012] and
log2 transformed. Loading the *cqn*-normalized matrix of log2 transformed read
counts takes ~30 seconds and loads an object of ~370 Mb. We set the column
names so that the first column contains the rownames of the matrix, and the
remaining columns are the sample identities from the `atac_coldata` object.

```{r mat-atac, eval=FALSE}
atac_mat <- read_tsv("ATAC_cqn_matrix.txt.gz",
                     skip = 1,
                     col_names =c("rownames", atac_coldata[["sample_id"]]))
rownames <- atac_mat[["rownames"]]
atac_mat <- as.matrix(atac_mat[,-1])
rownames(atac_mat) <- rownames
```

We read in the peak metadata (locations in the genome), and convert it to a
*GRanges* object. The `as_granges()` function automatically converts the
*data.frame* into a *GRanges* object. From that result, we extract the peak_id
column and set the genome information to the build "GRCh38". We know this from
the [Zenodo entry](https://zenodo.org/record/1188300#.XJOFSlNKiL5).

```{r peaks-atac, eval=FALSE}
library(plyranges)
peaks_df <- read_tsv("ATAC_peak_metadata.txt.gz",
                     col_types = c("cidciicdc")
)

peaks_gr <- peaks_df %>%
  as_granges(seqnames = chr) %>%
  select(peak_id=gene_id) %>%
  set_genome_info(genome = "GRCh38")
```

Finally, we construct a *SummarizedExperiment* object.  We place the matrix
into the assays slot as a named list, the annotated peaks into the row-wise
ranges slot, and the sample metadata into the column-wise data slot:

```{r atac-se, eval=FALSE}
atac <- SummarizedExperiment(assays = list(cqndata=atac_mat),
                             rowRanges=peaks_gr,
                             colData=atac_coldata)
```

# Model assays

## RNA-seq differential gene expression analysis

We can easily run a differential expression analysis with *DESeq2* using the
following code chunks [@Love2014]. The design formula indicates that we want to
control for the donor baselines (`line`) and test for differences in gene
expression on the condition. For a more comprehensive discussion of DE
workflows in Bioconductor see @Love2016-f1000 and @Law2018-f1000.

```{r setup-deseq}
library(DESeq2)
dds <- DESeqDataSet(gse, ~line + condition)
# filter out lowly expressed genes
# at least 10 counts in at least 6 samples
keep <- rowSums(counts(dds) >= 10) >= 6
dds <- dds[keep,]
```

The model is fit with the following line of code:

```{r fit-model}
dds <- DESeq(dds)
```

Below we set the contrast on the condition variable, indicating we are
estimating the $\log_2$ fold change (LFC) of IFNg stimulated cell lines against
naive cell lines. We are interested in LFC greater than 1 at a nominal false
discovery rate (FDR) of 1%.

```{r results-DFrame}
res <- results(dds,
               contrast=c("condition","IFNg","naive"),
               lfcThreshold=1, alpha=0.01)
```

To see the results of the expression analysis, we can generate a summary table
and an MA plot:

```{r ma-plot, fig.cap="(ref:maplot)" }
summary(res)
DESeq2::plotMA(res, ylim=c(-10,10))
```

We now output the results as a *GRanges* object, and due to the conventions of
*plyranges*, we construct a new column called `gene_id` from the row names of
the results. Each row now contains the genomic region (`seqnames`, `start`,
`end`, `strand`) along with corresponding metadata columns (the `gene_id` and
the results of the test). Note that *tximeta* has correctly identified the
reference genome as "hg38", and this has also been added to the *GRanges* along
the results columns. This kind of book-keeping is vital once overlap operations
are performed to ensure that *plyranges* is not comparing across incompatible
genomes.

```{r results-GRanges}
suppressPackageStartupMessages(library(plyranges))
de_genes <- results(dds,
                    contrast=c("condition","IFNg","naive"),
                    lfcThreshold=1,
                    format="GRanges") %>%
  names_to_column("gene_id")
de_genes
```

From this, we can restrict the results to those that meet our FDR threshold and
select (and rename) the metadata columns we're interested in:

```{r de-genes}
de_genes <- de_genes %>%
  filter(padj < 0.01) %>%
  select(gene_id, de_log2FC = log2FoldChange, de_padj = padj)
```

We now wish to extract genes for which there is evidence that the LFC is *not*
large. We perform this test by specifying an LFC threshold and an alternative
hypothesis (`altHypothesis`) that the LFC is less than the threshold in
absolute value. To visualize the result of this test, you can run `results`
without `format="GRanges"`, and pass this object to `plotMA` as before. We
label these genes as `other_genes` and later as "non-DE genes", for comparison
with our `de_genes` set.

```{r not-de-genes}
other_genes <- results(dds,
                       contrast=c("condition","IFNg","naive"),
                       lfcThreshold=1,
                       altHypothesis="lessAbs",
                       format="GRanges") %>%
  filter(padj < 0.01) %>%
  names_to_column("gene_id") %>%
  dplyr::select(gene_id,
                de_log2FC = log2FoldChange,
                de_padj = padj)
```

## ATAC-seq peak differential abundance analysis

The following section describes the process we have used for generating a
*GRanges* object of differential peaks from the ATAC-seq data in @alasoo.

The code chunks for the remainder of this section are not run.

<!-- We can check the standard deviation over mean plot, to assess for any -->
<!-- systematic trends: -->

<!-- ```{r, eval = FALSE} --> <!-- library(ggplot2) --> <!-- mv <-
data.frame(rmu = rowMeans(assay(atac)), --> <!-- rvar = rowVars(assay(atac)))
--> <!-- ggplot(data =mv , aes(x = rmu, y = sqrt(rvar))) + --> <!-- geom_hex()
--> <!-- ``` -->

For assessing differential accessibility, we run *limma* [@Smyth2004], and
generate the a summary of LFCs and adjusted p-values for the peaks:

```{r limma, eval = FALSE}
library(limma)
design <- model.matrix(~donor + condition, colData(atac))
fit <- lmFit(assay(atac), design)
fit <- eBayes(fit)
idx <- which(colnames(fit$coefficients) == "conditionIFNg")
tt <- topTable(fit, coef=idx, sort.by="none", n=nrow(atac))
```

We now take the `rowRanges` of the *SummarizedExperiment* and attach the LFCs
and adjusted p-values from *limma*, so that we can consider the overlap with
differential expression. Note that we set the genome build to "hg38" and
restyle the chromosome information to use the "UCSC" style (e.g. "chr1",
"chr2", etc.). Again, we know the genome build from the Zenodo entry for the
ATAC-seq data.

```{r peaks-tidy, eval = FALSE}
atac_peaks <- rowRanges(atac) %>%
  remove_names() %>%
  mutate(
    da_log2FC = tt$logFC,
    da_padj = tt$adj.P.Val
  ) %>%
  set_genome_info(genome = "hg38")

seqlevelsStyle(atac_peaks) <- "UCSC"
```

The final *GRanges* object containing the DA peaks is included in the workflow
package and can be loaded as follows:

```{r load-peaks}
library(fluentGenomics)
peaks
```

# Integrate ranges

## Finding overlaps with *plyranges*

We have already used *plyranges* a number of times above, to `filter`,
`mutate`, and `select` on *GRanges* objects, as well as ensuring the correct
genome annotation and style has been used. The *plyranges* package provides a
grammar for performing transformations of genomic data [@Lee2019]. Computations
resulting from compositions of *plyranges* "verbs" are performed using
underlying, highly optimized range operations in the *GenomicRanges* package
[@granges].

For the overlap analysis, we filter the annotated peaks to have a nominal FDR
bound of 1%.

```{r filter-peaks}
da_peaks <- peaks %>%
  filter(da_padj < 0.01)
```

We now have *GRanges* objects that contain DE genes, genes without strong
signal of DE, and DA peaks. We are ready to answer the question: is there an
enrichment of DA ATAC-seq peaks in the vicinity of DE genes compared to genes
without sufficient DE signal?

## Down sampling non-differentially expressed genes

As *plyranges* is built on top of *dplyr*, it implements methods for many of
its verbs for *GRanges* objects. Here we can use `slice` to randomly sample the
rows of the `other_genes`. The `sample.int` function will generate random
samples of size equal to the number of DE-genes from the number of rows in
`other_genes`:

```{r slice-example}
size <- length(de_genes)
slice(other_genes, sample.int(n(), size))
```

We can repeat this many times to create many samples via `replicate`. By
replicating the sub-sampling multiple times, we minimize the variance on the
enrichment statistics induced by the sampling process.

```{r boot-set-01}
# set a seed for the results
set.seed(2019-08-02)
boot_genes <- replicate(10,
                        slice(other_genes, sample.int(n(), size)),
                        simplify = FALSE)
```

This creates a list of *GRanges* objects as a list, and we can bind these
together using the `bind_ranges` function. This function creates a new column
called "resample" on the result that identifies each of the input *GRanges*
objects:

```{r boot-set-02}
boot_genes <- bind_ranges(boot_genes, .id = "resample")
```

Similarly, we can then combine the `boot_genes` *GRanges*, with the DE
*GRanges* object. As the resample column was not present on the DE *GRanges*
object, this is given a missing value which we recode to a 0 using `mutate()`

```{r combine-results}
all_genes <- bind_ranges(
  de=de_genes,
  not_de = boot_genes,
  .id="origin"
) %>%
  mutate(
    origin = factor(origin, c("not_de", "de")),
    resample = ifelse(is.na(resample), 0L, as.integer(resample))
  )
all_genes
```

## Expanding genomic coordinates around the transcription start site

Now we would like to modify our gene ranges so they contain the 10 kilobases on
either side of their transcription start site (TSS). There are many ways one
could do this, but we prefer an approach via the anchoring methods in
*plyranges*. Because there is a mutual dependence between the start, end,
width, and strand of a *GRanges* object, we define anchors to fix one of
`start` and `end`, while modifying the `width`. As an example, to extract just
the TSS, we can anchor by the 5' end of the range and modify the width of the
range to equal 1.

```{r resize-01}
all_genes <- all_genes %>%
  anchor_5p() %>%
  mutate(width = 1)
```

Anchoring by the 5' end of a range will fix the `end` of negatively stranded
ranges, and fix the `start` of positively stranded ranges.

We can then repeat the same pattern but this time using `anchor_center()` to
tell *plyranges* that we are making the TSS the midpoint of a range that has
total width of 20kb, or 10kb both upstream and downstream of the TSS.

```{r resize-02}
all_genes <- all_genes %>%
  anchor_center() %>%
  mutate(width=2*1e4)
```

## Use overlap joins to find relative enrichment

We are now ready to compute overlaps between RNA-seq genes (our DE set and
bootstrap sets) and the ATAC-seq peaks. In *plyranges*, overlaps are defined as
joins between two *GRanges* objects: a _left_ and a _right_ *GRanges* object.
In an overlap join, a match is any range on the _left_ *GRanges* that is
overlapped by the _right_ *GRanges*. One powerful aspect of the overlap joins
is that the result maintains all (metadata) columns from each of the _left_ and
_right_ ranges which makes downstream summaries easy to compute.

To combine the DE genes with the DA peaks, we perform a left overlap join. This
returns to us the `all_genes` ranges (potentially with duplication), but with
the metadata columns from those overlapping DA peaks.  For any gene that has no
overlaps, the DA peak columns will have `NA`'s.

```{r olap-join}
genes_olap_peaks <- all_genes %>%
  join_overlap_left(da_peaks)
genes_olap_peaks
```

Now we can ask, how many DA peaks are near DE genes relative to "other" non-DE
genes?  A gene may appear more than once in `genes_olap_peaks`, because
multiple peaks may overlap a single gene, or because we have re-sampled the
same gene more than once, or a combination of these two cases.

For each gene (that is the combination of chromosome, the start, end, and
strand), and the "origin" (DE vs not-DE) we can compute the distinct number of
peaks for each gene and the maximum peak based on LFC. This is achieved via
`reduce_ranges_directed`, which allows an aggregation to result in a *GRanges*
object via merging neighboring genomic regions. The use of the directed suffix
indicates we're maintaining strand information. In this case, we are simply
merging ranges (genes) via the groups we mentioned above. We also have to
account for the number of resamples we have performed when counting if there
are any peaks, to ensure we do not double count the same peak:

```{r reduce-ex01}
gene_peak_max_lfc <- genes_olap_peaks %>%
  group_by(gene_id, origin)  %>%
  reduce_ranges_directed(
    peak_count = sum(!is.na(da_padj)) / n_distinct(resample),
    peak_max_lfc = max(abs(da_log2FC))
  )
```

We can then filter genes if they have any peaks and compare the peak fold
changes between non-DE and DE genes using a boxplot:

```{r boxplot, fig.cap = "(ref:boxplot)"}
library(ggplot2)
gene_peak_max_lfc %>%
  filter(peak_count > 0) %>%
  as.data.frame() %>%
  ggplot(aes(origin, peak_max_lfc)) +
  geom_boxplot()
```

In general, the DE genes have larger maximum DA fold changes relative to the
non-DE genes.

Next we examine how thresholds on the DA LFC modify the enrichment we observe
of DA peaks near DE or non-DE genes. First, we want to know how the number of
peaks within DE genes and non-DE genes change as we change threshold values on
the peak LFC. As an example, we could compute this by arbitrarily chosen LFC
thresholds of 1 or 2 as follows:

```{r summarize-ex01}
origin_peak_lfc <- genes_olap_peaks %>%
  group_by(origin) %>%
  summarize(
    peak_count = sum(!is.na(da_padj)) / n_distinct(resample),
    lfc1_peak_count =sum(abs(da_log2FC) > 1, na.rm=TRUE)/ n_distinct(resample),
    lfc2_peak_count = sum(abs(da_log2FC) > 2, na.rm=TRUE)/ n_distinct(resample)
  )
origin_peak_lfc
```

Here we see that DE genes tend to have more DA peaks near them, and that the
number of DA peaks decreases as we increase the DA LFC threshold (as expected).
We now show how to compute the ratio of peak counts from DE compared to non-DE
genes, so we can see how this ratio changes for various DA LFC thresholds.

For all variables except for the `origin` column we divide the first row's
values by the second row, which will be the enrichment of peaks in DE genes
compared to other genes. This requires us to reshape the summary table from
long form back to wide form using the *tidyr* package. First we pivot the
results of the `peak_count` columns into name-value pairs, then pivot again to
place values into the `origin` column. Then we create a new column with the
relative enrichment:

```{r pivot-enrich}
origin_peak_lfc %>%
  as.data.frame() %>%
  tidyr::pivot_longer(cols = -origin) %>%
  tidyr::pivot_wider(names_from = origin, values_from = value) %>%
  mutate(enrichment = de / not_de)
```

The above table shows that relative enrichment increases for a larger LFC
threshold.

Due to the one-to-many mappings of genes to peaks, it is unknown if we have the
same number of DE genes participating or less, as we increase the threshold on
the DA LFC. We can examine the number of genes with overlapping DA peaks at
various thresholds by grouping and aggregating twice. First, the number of
peaks that meet the thresholds are computed within each gene, origin, and
resample group. Second, within the origin column, we compute the total number
of peaks that meet the DA LFC threshold and the number of genes that have more
than zero peaks (again averaging over the number of resamples).

```{r reduce-summarize}
genes_olap_peaks %>%
  group_by(gene_id, origin, resample) %>%
  reduce_ranges_directed(
    lfc1 = sum(abs(da_log2FC) > 1, na.rm=TRUE),
    lfc2 = sum(abs(da_log2FC) > 2, na.rm=TRUE)
  ) %>%
  group_by(origin) %>%
  summarize(
    lfc1_gene_count = sum(lfc1 > 0) / n_distinct(resample),
    lfc1_peak_count = sum(lfc1) / n_distinct(resample),
    lfc2_gene_count = sum(lfc2 > 0) / n_distinct(resample),
    lfc2_peak_count = sum(lfc2) / n_distinct(resample)
  )
```

To do this for many thresholds is cumbersome and would create a lot of
duplicate code. Instead we create a single function called
`count_above_threshold` that accepts a variable and a vector of thresholds, and
computes the sum of the absolute value of the variable for each element in the
`thresholds` vector.

```{r count-fn}
count_if_above_threshold <- function(var, thresholds) {
  lapply(thresholds, function(.) sum(abs(var) > ., na.rm = TRUE))
}
```

The above function will compute the counts for any arbitrary threshold, so we
can apply it over possible LFC thresholds of interest. We choose a grid of one
hundred thresholds based on the range of absolute LFC values in the `da_peaks`
*GRanges* object:

```{r thresholds}
thresholds <- da_peaks %>%
  mutate(abs_lfc = abs(da_log2FC)) %>%
  with(
    seq(min(abs_lfc), max(abs_lfc), length.out = 100)
  )
```

The peak counts for each threshold are computed as a new list-column called
`value`. First, the *GRanges* object has been grouped by the gene, origin, and
the number of resamples columns. Then we aggregate over those columns, so each
row will contain the peak counts for all of the thresholds for a gene, origin,
and resample. We also maintain another list-column that contains the threshold
values.

```{r reduce-ex02}
genes_peak_all_thresholds <- genes_olap_peaks %>%
  group_by(gene_id, origin, resample) %>%
  reduce_ranges_directed(
    value = count_if_above_threshold(da_log2FC, thresholds),
    threshold = list(thresholds)
  )
genes_peak_all_thresholds
```

Now we can expand these list-columns into a long *GRanges* object using the
`expand_ranges()` function. This function will unlist the `value` and
`threshold` columns and lengthen the resulting *GRanges* object.  To compute
the peak and gene counts for each threshold, we apply the same summarization as
before:

```{r expand-summarize}
origin_peak_all_thresholds <- genes_peak_all_thresholds %>%
  expand_ranges() %>%
  group_by(origin, threshold) %>%
  summarize(
    gene_count = sum(value > 0) / n_distinct(resample),
    peak_count = sum(value) / n_distinct(resample)
  )
origin_peak_all_thresholds
```

Again we can compute the relative enrichment in LFCs in the same manner as
before, by pivoting the results to long form then back to wide form to compute
the enrichment. We visualize the peak enrichment changes of DE genes relative
to other genes as a line chart:

```{r line-chart, fig.cap = "(ref:linechart)"}
origin_threshold_counts <- origin_peak_all_thresholds %>%
  as.data.frame() %>%
  tidyr::pivot_longer(cols = -c(origin, threshold),
                      names_to = c("type", "var"),
                      names_sep = "_",
                      values_to = "count") %>%
  select(-var)

origin_threshold_counts %>%
  filter(type == "peak") %>%
  tidyr::pivot_wider(names_from = origin, values_from = count) %>%
  mutate(enrichment =  de / not_de) %>%
  ggplot(aes(x = threshold, y = enrichment)) +
  geom_line() +
  labs(x = "logFC threshold", y = "Relative Enrichment")
```

We computed the sum of DA peaks near the DE genes, for increasing LFC
thresholds on the accessibility change. As we increased the threshold, the
number of total peaks went down (likewise the mean number of DA peaks per
gene). It is also likely the number of DE genes with a DA peak nearby with such
a large change went down. We can investigate this with a plot that summarizes
many of the aspects underlying the enrichment plot above.

```{r line-chart2, fig.cap = "(ref:linechart2)"}
origin_threshold_counts %>%
  ggplot(aes(x = threshold,
             y = count + 1,
             color = origin,
             linetype = type)) +
  geom_line() +
  scale_y_log10()
```

# Discussion

We have shown that by using *plyranges* and *tximeta* (with support of
Bioconductor and *tidyverse* ecosystems) we can fluently iterate through the
biological data science workflow: from import, through to modeling, and data
integration.

There are several further steps that would be interesting to perform in this
analysis; for example, we could modify window size around the TSS to see how it
affects enrichment, and vary the FDR cut-offs for both the DE gene and DA peak
sets. We could also have computed variance in addition to the mean of the
bootstrap set, and so drawn an interval around the enrichment line.

Finally, our workflow illustrates the benefits of using appropriate data
abstractions provided by Bioconductor such as the *SummarizedExperiment* and
*GRanges*. These abstractions provide users with a mental model of their
experimental data and are the building blocks for constructing the modular and
iterative analyses we have shown here. Consequently, we have been able to
interoperate many decoupled R packages (from both Bioconductor and the
tidyverse) to construct a seamless end-to-end workflow that is far too
specialized for a single monolithic tool.


# Software Availability

The workflow materials, including this article can be fully reproduced
following the instructions found at the Github repository
[sa-lee/fluentGenomics](https://github.com/sa-lee/fluentGenomics). Moreover,
the development version of the workflow and all downstream dependencies can be
installed using the `BiocManager` package by running:

```{r, eval = FALSE}
# development version from Github
BiocManager::install("sa-lee/fluentGenomics")
# version available from Bioconductor
BiocManager::install("fluentGenomics")
```

This article and the analyses were performed with R [@baser] using the
*rmarkdown* [@rmarkdown], and *knitr* [@knitr; @xie2015] packages.

## Session Info

```{r}
sessionInfo()
```

## Author Contributions

All authors contributed to the writing and development of the workflow.

## Competing interests

The authors declare that they have no competing interests.

## Funding

SL is supported by an Australian Government Research Training Program (RTP)
scholarship with a top up scholarship from CSL Limited.

MIL's contribution is supported by NIH grant R01 HG009937.

__I confirm that the funders had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.__

## Acknowledgements

The authors would like to thank all participants of the Bioconductor 2019 and
BiocAsia 2019 conferences who attended and provided feedback on early versions
of this workflow paper.

# References