---
title: "Introduction to nullranges"
output:
  rmarkdown::html_document
bibliography: library.bib
vignette: |
  %\VignetteIndexEntry{0. Introduction to nullranges}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The *nullranges* package contains functions for generation of sets of
genomic ranges, represented as *GRanges* objects, for exploring
various null hypotheses. For example, one may want to assess the
significance of the overlap of two sets of ranges, by generating
statistics of what would be expected
*under the null distribution of no relationship between these sets*.
We note that many other test statistics are supported via
the flexible framework described here, combining *nullranges* with
*plyranges*.
The *nullranges* package contains a number of vignettes describing
different functionality, from basic to more advanced usage.  For a
listing of all the vignettes in the package, one can type:

```{r eval=FALSE}
vignette(package="nullranges")
```

## Choice of methods

The *nullranges* package has two distinct branches of functionality: 
*matching* or *bootstrapping* to generate null sets of genomic ranges. 

In the vignette sections below we describe these two branches and give
more formal definitions, where each branch has associated vignettes
and man page sections (see *Reference* tab from the package website).
To give a high level overview, we first provide a **decision tree**
that helps to indicate the considerations when choosing between these
branches of functionality.

Define **features** as a set of genomic locations of interest, these
are minimally represented with a chromosome, start and width,
and optionally strand and metadata (genomic ranges).
Given a set of features, suppose that we wish to create an alternate
set that represents a null feature set, sometimes also referred to as
"background" or a "control set". For example, we may see that our
primary features are close to transcription start sites, but are they
closer than we would expect compared to a reasonable choice of
null features?

Additionally, define **pool** as a much larger set of genomic
locations compared to the primary features, and **covariates** as
pieces of metadata attached to all of the considered features (stored
as `mcols(<GRanges>)`, which may be continuous, integer, factor,
etc.).

Our choice of methods is informed by the following:

* Are the features defined with respect to a pool, e.g. open chromatin
  sites bound by a protein, defined with respect to all open
  chromatin in a particular cell type?
* Are there important, potentially confounding covariates, e.g. does
  GC content, distance to particular landmarks, expression level, LD
  score, etc. potentially explain distribution of features in a way
  that should be controlled for in downstream analysis?
* Do the primary feature set and pool have different distribution of
  these potential confounding covariates?
* Does the local genomic context matter (*local* defined in genomic
  coordinate space)? Are we asking if primary features are 
  *near in the genome* to the gene start sites or some other genomic
  feature set?
* Do the features tend to be distributed in genomic coordinate space
  such that they are *clustered*, e.g. open chromatin sites tend to
  occur near each other?  Or do features that are near each other in
  the genome tend to have similar metadata of interest to the
  biological question, e.g. CpG's near each other having similar
  levels of methylation, which will be used in downstream computations
  of average methylation near gene start site? Or is there variable
  feature density that must be accounted for via genome segmentation?

The following decision tree then informs what methods to choose:

```{r nullranges-diagram, echo=FALSE}
DiagrammeR::grViz("digraph {
  graph [layout = dot, rankdir = TB]
  
  node [shape = rectangle]        
  source [label = 'Features defined with respect to a pool']
  poolyes [label = 'Yes']
  poolno [label =  'No']
  cov [label = 'Potential confounding covariates']
  covyes [label = 'Yes']
  covno [label = 'No']
  diffcov [label = 'Different covariate distribution']
  diffcovyes [label = 'Yes']
  diffcovno [label =  'No']
  match [label = 'matchRanges']
  random [label = 'Random sample']
  context [label = 'Local genomic context matters']
  contextyes [label = 'Yes']
  contextno [label = 'No']
  cluster [label = 'Features cluster in genome']
  random2 [label = 'Random sample']
  clusteryes [label = 'Yes']
  clusterno [label = 'No']
  boot [label = 'bootRanges']
  shuffle [label = 'Shuffle + exclusion']

  source -> poolyes
  source -> poolno
  poolyes -> cov
  cov -> covyes
  cov -> covno
  covyes -> diffcov
  covno -> random
  diffcov -> diffcovyes
  diffcov -> diffcovno
  diffcovyes -> match
  diffcovno -> random
  poolno -> context
  context -> contextyes
  context -> contextno
  contextyes -> cluster
  contextno -> random2
  cluster -> clusteryes
  cluster -> clusterno
  clusteryes -> boot
  clusterno -> shuffle
}", height = 500)

```

In summary, while *matchRanges* does not control for genomic
distribution and clustering of features, *bootRanges* does not
directly control for feature covariates independent of proximity and
local context.

## Related work

For general considerations of generation of null feature sets or
segmentation for enrichment or colocalization analysis, consider the
papers of
@de_2014,
@haiminen_2007,
@huen_2010,
@ferkingstad2015,
@dozmorov2017,
@kanduri_2019 
(with links in references below).
Other Bioconductor packages that offer randomization techniques for 
enrichment analysis include 
[LOLA](https://bioconductor.org/packages/LOLA) [@LOLA] and 
[regioneR](https://bioconductor.org/packages/regioneR) [@regioneR]. 
Methods implemented outside of Bioconductor include
[GAT](https://github.com/AndreasHeger/gat) [@GAT],
[GSC](https://github.com/ParkerLab/encodegsc) [@bickel_2010],
[GREAT](http://bejerano.stanford.edu/great/public/html/) [@GREAT],
[GenometriCorr](https://github.com/favorov/GenometriCorr) [@GenometriCorr],
[ChIP-Enrich](http://chip-enrich.med.umich.edu/) [@ChIP-Enrich],
and [OLOGRAM](https://dputhier.github.io/pygtftk/ologram.html) [@ologram].

We note that our block bootstrapping approach closely follows that of 
[GSC](https://github.com/ParkerLab/encodegsc), while offering
additional features/visualizations, and is re-implemented within
R/Bioconductor with efficient vectorized code for working with
*GRanges* objects [@granges]. In addition, *regioneR* and *OLOGRAM*
also offer randomization schemes that can preserve e.g. inter-feature
distances.

In the context of these related works, a design choice of *nullranges*
is its modularity: it solely generates null or background feature sets
as Bioconductor objects, such that these can be used in workflows that
compute and perform inference on arbitrary statistics within the
*plyranges* [@Lee2019] framework. We have referred to such workflows
as "fluent genomics" [@fluent], or "tidy genomics", as these workflows
build up complex operations from commonly recognized verbs (filter,
mutate, join, group by, summarize, etc.) strung together with pipes
(`%>%`).

## Further description of matching and bootstrapping

Suppose we want to examine the significance of overlaps
of genomic sets of features $x$ and $y$. To test the significance of
this overlap, we calculate the overlap expected under the null by
generating a null feature set $y'$ (potentially many times). The null
features in $y'$ may be characterized by:

1. Drawing from a larger pool $z$ ($y' \subset z$), such that $y$ and
   $y'$ have a similar distribution over one or more covariates. This
   is the "matching" case. Note that the features in $y'$ are original
   features, just drawn from a different pool than y. The
   *matchRanges* method is described in @matchRanges
   [doi: 10.1101/2022.08.05.502985](https://doi.org/10.1101/2022.08.05.502985).
2. Generating a new set of genomic features $y'$, constructing them
   from the original set $y$ by selecting blocks of the genome with
   replacement, i.e. such that features can be sampled more than once.
   This is the "bootstrapping" case. Note that, in this case, $y'$ is an
   artificial feature set, although the re-sampled features can retain
   covariates such as score from the original feature set $y$.
   The *bootRanges* method is described in @bootRanges
   [doi: 10.1101/2022.09.02.506382](https://doi.org/10.1101/2022.09.02.506382).

## In other words

1. Matching -- drawing from a pool of features but controlling for 
   certain characteristics, or *covariates*
2. Bootstrapping -- placing a number of artificial features in the 
   genome but controlling for the local dependence among features
   (e.g. features clustering in the genome and/or having correlated
   metadata)

## Options and features

We provide a number of vignettes to describe the different matching
and bootstrapping use cases. In the matching case, we have implemented
a number of options, including nearest neighbor matching or
rejection sampling based matching. In the bootstrapping case, we have
implemented options for bootstrapping across or within chromosomes, and
bootstrapping only within states of a segmented genome. We also
provide a function to segment the genome by density of features. For
example, supposing that $x$ is a subset of genes, we may want to
generate $y'$ from $y$ such that features are re-sampled in blocks
from segments across the genome with similar gene density.
In both cases, we provide a number of functions for performing quality
control via visual inspection of diagnostic plots.

## Consideration of excluded regions

Finally, we recommend to incorporate list of regions where artificial
features should *not* be placed, including the ENCODE Exclusion List
[@encode_exclude]. This and other excluded ranges are made available
in the [excluderanges](https://dozmorovlab.github.io/excluderanges/)
Bioconductor package by Mikhail Dozmorov *et al.* [@excluderanges].
Use of excluded ranges is demonstrated in the segmented block bootstrap
vignette.

# References