---
title: "geneAttribution: Identification of Candidate Genes Associated with Noncoding Genetic Variation"
author: "Arthur Wuster"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

`geneAttribution` is an R package for identifying the most likely gene or genes
through which variation at a given genomic locus in the human genome acts. A
typical use case is the annotation of results from genome-wide association
studies (GWAS). The majority of variants identified by GWAS are located in
noncoding regions and likely act by affecting gene expression
([Maurano et al. 2012](http://doi.org/10.1126/science.1222794)). Variants
typically contain multiple genes in the region of linkage disequilibrium and
identifying the causative one is challenging.

The most basic functionality of `geneAttribution` assumes that the closer gene
is to the input locus, the more likely the gene is to be causative.
Additionally, any empirical data that links genomic regions to genes (e.g.
expression quantitative trait loci (eQTL) or genome conformation data) can be
used if it is supplied in the
[UCSC .BED file](https://genome.ucsc.edu/FAQ/FAQformat.html) format.

## Basic functionality

```r
# Basic workflow, default parameters
library(geneAttribution)
geneLocs <- geneModels() # Define gene models
geneAttribution("chr2", 127156000, geneLocs) # Get candidate gene probabilities
```

The most basic functionality assumes that the closer a gene is to the input
locus, the more likely it is to be a candidate gene. The relationship between
distance to locus and candidate gene likelihood is modeled as an exponential
distribution. The likelihoods for each gene are then normalized by dividing by
the sum of likelihoods. As a result, the presence of multiple genes in the
vicinity of a locus decreases the posterior probabilities of the individual
genes.

### geneModels()

To calculate the distance between the input locus and genes, gene models are
required. For this, `geneAttribution` provides the `geneModels()` function.
`geneModels()` takes a TxDb object containing genomic coordinates of genes as
input and returns gene models in GenomicRanges format, with gene names in the
`symbol` column. Loading the gene models may take a few minutes. The default
for the TxDb input is `TxDb.Hsapiens.UCSC.hg38.knownGene`, which contains gene
models for genome build GRCh38. An alternative input would be
`TxDb.Hsapiens.UCSC.hg19.knownGene`, which contains gene models for build hg19.
The geneModels function has additional optional input:

* `maxGeneLength`. Gene models that are longer than this are excluded
* `genesToInclude` and `genesToExclude`. A character vector of gene symbols of
  genes to include (e.g. only protein coding genes) or exclude

### geneAttribution()

The minimum required input for `geneAttribution()` is a chromosome identifier
in the same format than in the gene models and a chromosomal position in the
same build than the gene models. The `geneAttribution()` function has
additional optional input:

* `lambda`, the $\lambda$ parameter for the exponential distribution modeling
  the candidate gene likelihood based on an exponential distribution. The
  default, based on empirical eQTL data from the Genome-Tissue Expression
  Project ([GTEx Consortium 2015](http://doi.org/10.1126/science.1262110)), is
  7.61e-06. Decreasing `lambda` gives genes that are close to the input locus
  a higher probability and decreasing it gives genes that are further away a
  higher probability
* `maxDist`, the maximum distance from the input locus at which a gene will be
  considered. Genes that are more than `maxDist` bases away from the input
  locus will be ignored. The default is 1,000,000 bases
* `minPP`, the minimum posterior probability at which a gene will be reported.
  Genes whose posterior probability is less than `minPP` will be summarized as
  "Other". Set this to 0 to report all genes

## Using empirical data

```r
# Typical workflow, using sample data supplied with geneAttribution
geneLocs <- geneModels()
fileName1 <- system.file("extdata", "hiCRegions.b38.bed", package="geneAttribution")
fileName2 <- system.file("extdata", "eqtlHaplotypeBlocks.b38.bed", package="geneAttribution")
empirical <- loadBed(c(fileName1, fileName2), c(2, 5))
geneAttribution("chr2", 127156000, geneLocs, empirical)

# As above, but with user-supplied empirical data file in UCSC .BED format
geneLocs <- geneModels()
empirical <- loadBed("INPUT_FILE.bed", weights=1.5) # INPUT_FILE.bed is a correctly formatted .BED file
geneAttribution("chr2", 127156000, geneLocs, empirical)
```

In addition, `geneAttribution` can make use of empirical data that links
genomic loci to genes. eQTLs, which link genetic variants to the expression of
specific genes, are an example of this. If the input locus is located inside
the regions defined in the empirical data, the likelihood of the associated
gene is multiplied by the associated weighting. User-provided empirical data
can be loaded by using the `loadBed()` function.

The loadBed funtion reads user-provided files in UCSC .BED format. .BED files
must be tab-separated and the columns must be in the following order:
chromosome, start, end, gene symbol. An optional fifth score column may also
be provided. Gene symbols used in the empirical data must match the symbols
used for the gene models and the genome build (e.g. GRCh38) must also match the
genome build of the gene models.

Together with the .BED files, weightings for the data may be provided. The
default for this is 2, which doubles the likelihoods for genes if the input
locus is located in a region defined by the empirical data. A weighting of 1
would not change the likelihood, and a weighting of less than 1 would decrease
the likelihood. Alternatively to reading .BED files, users may construct
empirical data themselves as a list of GenomicRanges objects with a score
column containing the weights.

The `extdata` directory provides two .BED files in genome build GRCh38:

* `hiCRegions.b38.bed`, which contains Capture Hi-C genome conformation data
  linking promoters to other genomic regions in the GM12878 CD34 cell line
  ([Mifsud et al. 2015](http://doi.org/10.1038/ng.3286))
* ` eqtlHaplotypeBlocks.b38.bed`, which defines haplotype blocks with eQTLs in
  at least two different tissues from the Genome-Tissue Expression Project
  ([GTEx Consortium 2015](http://doi.org/10.1126/science.1262110))

Because both files are supplied as examples, they are limited to a 10 MB region
(120,000,000-130,000,000) on chromosome 2.

## Obtaining the coordinates of candidate genes
The output of the `geneAttribution` function is a named numerical vector of
candidate gene probabilities. In some instances, it may be useful to also know
the coordinates of the candidate genes, as this annotation can help with
further work with package results. It can easily be obtained by subsetting the
gene models object.

```r
geneLocs <- geneModels()
geneLocs <- geneModels() # Define gene models
pp <- geneAttribution("chr2", 127156000, geneLocs, minPP=0) # Posterior prob.
geneLocs[match(names(pp), geneLocs$symbol)] # Subset gene models
```