---
title: "Building an annotation database for metaseqR2"
author: "Panagiotis Moulos"
date: "`r BiocStyle::doc_date()`"
output: 
  BiocStyle::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Building an annotation database for metaseqR2}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

Simple, flexible and reusable annotation for metaseqR2 pipeline
================================================================================

When using the first version of metaseqR, one had to either embed the annotation
to the gene/exon/3' UTR counts, or to download and construct on-the-fly the
required annotation when starting from BAM files. Although the counting and gene
model construction results could (anf still can) be saved and re-used with other
analysis parameters changed (e.g. statistical algorithms), one could not easily
add for example new data to an existing dataset without re-running the whole
pipeline and re-downloading annotation. On top of that, many times, the main
Ensembl servers (when using Ensembl annotations) do not respond well to biomaRt 
calls, so the whole pipeline may stall until the servers are back.

Another main issue with the annotation used by metaseqR was that there was no 
straightforward way, provided by metaseqR, to archive and version the annotation
used by a specific analysis and was up to the user to take care of 
reproducibility at this level. Furthermore, there was no straightforward way for
a user to plugin own annotation elements (e.g. in the form of a GTF file) and 
use it in the same manner as standard annotations supported by metaseqR, e.g. 
when analyzing data from not-so-often studied organisms such as insects. 
Plugging-in own annotation was possible but usually a painful procedure, which
has become now very easy.

The annotation database builder for metaseqR2 remedies the above situations. The
`buildAnnotationDatabase` function should be run once with the organisms one
requires to have locally to work with and then that's it! Of course you can
manage your database by adding and removing specific annotations (and you even
can play with an SQLite browser, although not advised, as the database structure
is rather simple). Furthermore, you can use the metaseqR2 annotation database
and management mechanism for any other type of analysis where you require to 
have a simple tab-delimited annotation file, acquired with very little effort.

# Supported organisms

The following organisms (essentially genome versions) are supported for 
automatic database builds:

* Human (*Homo sapiens*) genome version **hg38**
* Human (*Homo sapiens*) genome version **hg19**
* Human (*Homo sapiens*) genome version **hg18**
* Mouse (*Mus musculus*) genome version **mm10**
* Mouse (*Mus musculus*) genome version **mm9**
* Rat (*Rattus norvegicus*) genome version **rn6**
* Rat (*Rattus norvegicus*) genome version **rn5**
* Fruitfly (*Drosophila melanogaster*) genome version **dm6**
* Fruitfly (*Drosophila melanogaster*) genome version **dm3**
* Zebrafish (*Danio rerio*) genome version **danRer7**
* Zebrafish (*Danio rerio*) genome version **danRer10**
* Zebrafish (*Danio rerio*) genome version **danRer11**
* Chimpanzee (*Pan troglodytes*) genome version **panTro4**
* Chimpanzee (*Pan troglodytes*) genome version **panTro5**
* Pig (*Sus scrofa*) genome version **susScr3**
* Pig (*Sus scrofa*) genome version **susScr11**
* Horse (*Equus cabalus*) genome version **equCab2**
* Arabidopsis (*Arabidobsis thaliana*) genome version **TAIR10**

# Using the local database

## Installation of metaseqR2

To install the metaseqR2 package, start R and enter:

```{r install-0, eval=FALSE, echo=TRUE}
if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("metaseqR2")
```

## Setup the database

By default, the database file will be written in the
`system.file(package="metaseqR2")` directory. You can specify another prefered
destination for it using the `db` argument in the function call, but if you do 
that, you will have to supply the `localDb` argument pointing to the SQLite 
database file you created to every metaseqr2 call you perform, otherwise, the
pipeline will download and use annotations on-the-fly.

In this vignette, we will build a minimal database comprising only the mouse
*mm10* genome version from Ensembl. The database will be build in a temporary
directory inside session `tempdir()`.

**Important note**: As the annotation build function makes use of 
[Kent](http://hgdownload.soe.ucsc.edu/admin/exe/) utilities for creating 3'UTR
annotations from RefSeq and UCSC, the latter cannot be built in Windows. 
Therefore it is advised to either build the annotation database in a Linux 
system or use our pre-built databases.

```{r load-0, eval=TRUE, echo=FALSE, tidy=FALSE, message=FALSE, warning=FALSE}
library(metaseqR2)
```

```{r example-1, eval=TRUE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
library(metaseqR2)

buildDir <- file.path(tempdir(),"test_anndb")
dir.create(buildDir)

# The location of the custom database
myDb <- file.path(buildDir,"testann.sqlite")

# Since we are using Ensembl, we can also ask for a version
organisms <- list(mm10=100)
sources <- ifelse(.Platform$OS.type=="unix",c("ensembl","refseq"),"ensembl")

# If the example is not running in a multicore system, rc is ignored
buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,db=myDb,rc=0.5)
```

## Use the database

Now, that a small database is in place, let's retrieve some data. Remember that
since the built database is not in the default location, we need to pass the
database file in each data retrieval function. The annotation is retrieved as
a `GRanges` object by default.

```{r example-2, eval=TRUE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="gene",
    db=myDb)
genes

# Load standard annotation based on 3' UTR coordinates
utrs <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="utr",
    db=myDb)
utrs

# Load summarized exon annotation based used with RNA-Seq analysis
sumEx <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="exon",
    summarized=TRUE,db=myDb)
sumEx

# Load standard annotation based on gene body coordinates from RefSeq
if (.Platform$OS.type=="unix") {
    refGenes <- loadAnnotation(genome="mm10",refdb="refseq",level="gene",
        type="gene",db=myDb)
    refGenes
}
```

Or as a data frame if you prefer using `asdf=TRUE`. The data frame however does 
not contain metadata like `Seqinfo` to be used for any susequent validations:

```{r example-3, eval=TRUE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
# Load standard annotation based on gene body coordinates
genes <- loadAnnotation(genome="mm10",refdb="ensembl",level="gene",type="gene",
    db=myDb,asdf=TRUE)
head(genes)
```

## Add a custom annotation

Apart from the supported organisms and databases, you can add a custom 
annotation. Such an annotation can be: 

* A non-supported organism (e.g. an insect or another mammal e.g. dog)
* A modification or further curation you have done to existing/supported
annotations
* A supported organism but from a different source
* Any other case where the provided annotations are not adequate

This can be achieved through the usage of
[GTF](https://www.ensembl.org/info/website/upload/gff.html) files, along with
some simple metadata that you have to provide for proper import to the
annotation database. This can be achieved through the usage of the
`buildCustomAnnotation` function. Details on required metadata can be found
in the function's help page.

**Important note:** Please note that importing a custom genome annotation 
directly from UCSC (UCSC SQL database dumps) is not supported in Windows as the
process involves using the `genePredToGtf` which is not available for Windows.

Let's try a couple of exammples. The first one is a custom annotation for the
Ebola virus from UCSC:

```{r example-4, eval=TRUE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
# Setup a temporary directory to download files etc.
customDir <- file.path(tempdir(),"test_custom")
dir.create(customDir)

# Convert from GenePred to GTF - Unix/Linux only!
if (.Platform$OS.type == "unix" && !grepl("^darwin",R.version$os)) {
    # Download data from UCSC
    goldenPath="http://hgdownload.cse.ucsc.edu/goldenPath/"
    # Gene annotation dump
    download.file(paste0(goldenPath,"eboVir3/database/ncbiGene.txt.gz"),
        file.path(customDir,"eboVir3_ncbiGene.txt.gz"))
    # Chromosome information
    download.file(paste0(goldenPath,"eboVir3/database/chromInfo.txt.gz"),
        file.path(customDir,"eboVir3_chromInfo.txt.gz"))

    # Prepare the build
    chromInfo <- read.delim(file.path(customDir,"eboVir3_chromInfo.txt.gz"),
        header=FALSE)
    chromInfo <- chromInfo[,1:2]
    rownames(chromInfo) <- as.character(chromInfo[,1])
    chromInfo <- chromInfo[,2,drop=FALSE]
    
    # Coversion from genePred to GTF
    genePredToGtf <- file.path(customDir,"genePredToGtf")
    if (!file.exists(genePredToGtf)) {
        download.file(
        "http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
            genePredToGtf
        )
        system(paste("chmod 775",genePredToGtf))
    }
    gtfFile <- file.path(customDir,"eboVir3.gtf")
    tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
        "tgtf",sep="."))
    command <- paste0(
        "zcat ",file.path(customDir,"eboVir3_ncbiGene.txt.gz"),
        " | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
        " -source=eboVir3"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
        gtfFile
    )
    system(command)

    # Build with the metadata list filled (you can also provide a version)
    buildCustomAnnotation(
        gtfFile=gtfFile,
        metadata=list(
            organism="eboVir3_test",
            source="ucsc_test",
            chromInfo=chromInfo
        ),
        db=myDb
    )

    # Try to retrieve some data
    eboGenes <- loadAnnotation(genome="eboVir3_test",refdb="ucsc_test",
        level="gene",type="gene",db=myDb)
    eboGenes
}
```

Another example, the Atlantic cod from UCSC. The same things apply for the
operating system.

```{r example-5, eval=FALSE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
if (.Platform$OS.type == "unix") {
    # Gene annotation dump
    download.file(paste0(goldenPath,"gadMor1/database/augustusGene.txt.gz"),
        file.path(customDir,"gadMori1_augustusGene.txt.gz"))
    # Chromosome information
    download.file(paste(goldenPath,"gadMor1/database/chromInfo.txt.gz",sep=""),
        file.path(customDir,"gadMori1_chromInfo.txt.gz"))

    # Prepare the build
    chromInfo <- read.delim(file.path(customDir,"gadMori1_chromInfo.txt.gz"),
        header=FALSE)
    chromInfo <- chromInfo[,1:2]
    rownames(chromInfo) <- as.character(chromInfo[,1])
    chromInfo <- chromInfo[,2,drop=FALSE]
    
    # Coversion from genePred to GTF
    genePredToGtf <- file.path(customDir,"genePredToGtf")
    if (!file.exists(genePredToGtf)) {
        download.file(
        "http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/genePredToGtf",
            genePredToGtf
        )
        system(paste("chmod 775",genePredToGtf))
    }
    gtfFile <- file.path(customDir,"gadMori1.gtf")
    tmpName <- file.path(customDir,paste(format(Sys.time(),"%Y%m%d%H%M%S"),
        "tgtf",sep="."))
    command <- paste0(
        "zcat ",file.path(customDir,"gadMori1_augustusGene.txt.gz"),
        " | ","cut -f2- | ",genePredToGtf," file stdin ",tmpName,
        " -source=gadMori1"," -utr && grep -vP '\t\\.\t\\.\t' ",tmpName," > ",
        gtfFile
    )
    system(command)

    # Build with the metadata list filled (you can also provide a version)
    buildCustomAnnotation(
        gtfFile=gtfFile,
        metadata=list(
            organism="gadMor1_test",
            source="ucsc_test",
            chromInfo=chromInfo
        ),
        db=myDb
    )

    # Try to retrieve some data
    gadGenes <- loadAnnotation(genome="gadMor1_test",refdb="ucsc_test",
        level="gene",type="gene",db=myDb)
    gadGenes
}
```

Another example, Armadillo from Ensembl. This should work irrespectively of 
operating system. We are downloading chromosomal information from UCSC.

```{r example-6, eval=FALSE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
# Gene annotation dump from Ensembl
download.file(paste0("ftp://ftp.ensembl.org/pub/release-98/gtf/",
    "dasypus_novemcinctus/Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
    file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"))

# Chromosome information will be provided from the following BAM file
# available from Ensembl. We have noticed that when using Windows as the OS,
# a remote BAM files cannot be opened by scanBamParam, so for this example,
# chromosome length information will not be available when running in Windows.
bamForInfo <- NULL
if (.Platform$OS.type == "unix")
    bamForInfo <- paste0("ftp://ftp.ensembl.org/pub/release-98/bamcov/",
        "dasypus_novemcinctus/genebuild/Dasnov3.broad.Ascending_Colon_5.1.bam")

# Build with the metadata list filled (you can also provide a version)
buildCustomAnnotation(
    gtfFile=file.path(customDir,"Dasypus_novemcinctus.Dasnov3.0.98.gtf.gz"),
    metadata=list(
        organism="dasNov3_test",
        source="ensembl_test",
        chromInfo=bamForInfo
    ),
    db=myDb
)

# Try to retrieve some data
dasGenes <- loadAnnotation(genome="dasNov3_test",refdb="ensembl_test",
    level="gene",type="gene",db=myDb)
dasGenes
```

## A complete build

A quite complete build (with latest versions of Ensembl annotations) would look
like (supposing the default annotation database location):

```{r example-7, eval=FALSE, echo=TRUE, tidy=FALSE, message=TRUE, warning=FALSE}
organisms <- list(
    hg18=67,
    hg19=75,
    hg38=97:98,
    mm9=67,
    mm10=97:98,
    rn5=79,
    rn6=97:98,
    dm3=78,
    dm6=97:98,
    danrer7=79,
    danrer10=91,
    danrer11=97:98,
    pantro4=90,
    pantro5=97:98,
    susscr3=89,
    susscr11=97:98,
    equcab2=97:98
)

sources <- c("ensembl","ucsc","refseq")

buildAnnotationDatabase(organisms,sources,forceDownload=FALSE,rc=0.5)
```

The aforementioned complete built can be found
[here](https://tinyurl.com/ybycpr6b)
Complete builts will become available from time to time (e.g. with every new
Ensembl relrase) for users who do not wish to create annotation databases on
their own. Root access may be required (depending on the metaseqR2 library
location) to place it in the default location where it can be found 
automatically.

# Annotations on-the-fly

If for some reason you do not want to build and use an annotation database for
metaseqR2 analyses (not recommended) or you wish to perform an analysis with an
organism that does not yet exist in the database, the `loadAnnotation` function
will perform all required actions (download and create a `GRanges` object) 
on-the-fly as long as there is an internet connection.

However, the above function does not handle custom annotations in GTF files.
In a scenario where you want to use a custom annotation only once, you should
supply the `annotation` argument to the `metaseqr2` function, which is almost
the same as the `metadata` argument used in `buildCustomAnnotation`, actually 
augmented by a list member for the GTF
file, that is:

```{r pseudo-1, eval=TRUE, echo=TRUE, message=TRUE, warning=FALSE}
annotation <- list(
    gtf="PATH_TO_GTF",
    organism="ORGANISM_NAME",
    source="SOURCE_NAME",
    chromInfo="CHROM_INFO"
)
```

The above argument can be passed to the metaseqr2 call in the respective 
position.

For further details about custom annotations on the fly, please check
`buildCustomAnnotation` and `importCustomAnnotation` functions.

# Session Info

```{r si-1, eval=TRUE, echo=TRUE}
sessionInfo()
```