---
title: 'SpliceWiz: the cookbook'
author: "Alex CH Wong"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
output:
    rmarkdown::html_document:
        highlight: pygments
        toc: true
        toc_float: true
abstract:
    This vignette is a guide containing example code for performing real-life
    tasks. Importantly, it covers some functionality that were not covered in
    the Quick-Start vignette (because they are too computationally intensive
    to be reproducible in a vignette).
    
    Version `r packageVersion("SpliceWiz")`
vignette: >
    %\VignetteIndexEntry{SpliceWiz: the cookbook}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```
\

# Loading SpliceWiz

For instructions on installing and configuring SpliceWiz, please see the
Quick-Start vignette.

```{r}
library(SpliceWiz)
```
\

# Reference Generation

First, define the path to the directory in which the reference should be stored.
This directory will be made by SpliceWiz, but its parent directory must exist,
otherwise an error will be returned.

```{r eval = FALSE}
ref_path <- "./Reference"
```
\


### Create a SpliceWiz reference from user-defined FASTA and GTF files locally

Note that setting `genome_path = "hg38"` will prompt SpliceWiz to use the 
default files for nonPolyA and Mappability exclusion references in the 
generation of its reference. Valid options for `genome_path` are "hg38", "hg19",
"mm10" and "mm9".

```{r eval=FALSE}
buildRef(
    reference_path = ref_path,
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = "hg38"
)
```
\

### Prepare genome resources and building the reference as separate steps

`buildRef()` first fetches the genome and gene annotations and makes a 
compressed local copy in the `resources` subdirectory of the given reference
path. This can sometimes be a long process (especially when downloading the
genome from the internet). Also, one may choose to generate the mappability
exclusion regions manually (see `Reference Generation` - 
`Mappability exclusion generation using Rsubread` section). This needs to occur
prior to generation of the SpliceWiz reference.

To separately prepare the annotation data and build the SpliceWiz reference,
use the `getResources()` function to specify the FASTA and GTF files. Once
`getResources()` has completed successfully, call `buildRef()` and leave
the `fasta` and `gtf` arguments blank, as in the example below:

```{r eval=FALSE}
getResources(
    reference_path = ref_path,
    fasta = "genome.fa",
    gtf = "transcripts.gtf"
)

buildRef(
    reference_path = ref_path,
    genome_type = "hg38"
)
```
\

### Overwriting an existing reference, but using the same annotations

To re-build and overwrite an existing reference, using the same resource 
annotations:

```{r eval=FALSE}
# Assuming hg38 genome:

buildRef(
    reference_path = ref_path,
    genome_type = "hg38",
    overwrite = TRUE
)
```
\

### Create a SpliceWiz reference using web resources from Ensembl's FTP

The following will first download the genome and gene annotation files from the
online resource and store a local copy of it in a file cache, facilitated by
BiocFileCache. Then, it uses the downloaded resource to create the SpliceWiz
reference.

```{r eval=FALSE}
FTP <- "ftp://ftp.ensembl.org/pub/release-94/"

buildRef(
    reference_path = ref_path,
    fasta = paste0(FTP, "fasta/homo_sapiens/dna/",
        "Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz"), 
    gtf = paste0(FTP, "gtf/homo_sapiens/",
        "Homo_sapiens.GRCh38.94.chr.gtf.gz"),
    genome_type = "hg38"
)
```
\

### Create a SpliceWiz reference using AnnotationHub resources

AnnotationHub contains Ensembl references for many genomes. To browse what is
available:

```{r}
require(AnnotationHub)

ah <- AnnotationHub()
query(ah, "Ensembl")
```

For a more specific query:

```{r}
query(ah, c("Homo Sapiens", "release-94"))
```

We wish to fetch "AH65745" and "AH64631" which contains the desired FASTA
and GTF files, respectively. To build a reference using these resources:

```{r eval=FALSE}
buildRef(
    reference_path = ref_path,
    fasta = "AH65745",
    gtf = "AH64631",
    genome_type = "hg38"
)
```

`Build-Reference-methods` will recognise the inputs of `fasta` and `gtf` as
AnnotationHub resources as they begin with "AH".
\
\

### Create a SpliceWiz reference from species other than human or mouse

For human and mouse genomes, we highly recommend specifying `genome_type` as
the default mappability file is used to exclude intronic regions with repeat
sequences from intron retention analysis. For other species, one could
generate a SpliceWiz reference without this reference:

```{r eval=FALSE}
buildRef(
    reference_path = ref_path,
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = ""
)
```
\

# STAR reference generation

### Checking if STAR is installed

To use `STAR` to align FASTQ files, one must be using a system with `STAR` 
installed.
This software is not available in Windows. To check if `STAR` is available:

```{r}
STAR_version()
```
\

### Building a STAR reference

```{r eval = FALSE}
ref_path = "./Reference"

# Ensure genome resources are prepared from genome FASTA and GTF file:

if(!dir.exists(file.path(ref_path, "resource"))) {
    getResources(
        reference_path = ref_path,
        fasta = "genome.fa",
        gtf = "transcripts.gtf"
    )
}

# Generate a STAR genome reference:
STAR_BuildRef(
    reference_path = ref_path,
    n_threads = 8
)

```
\

### Building a STAR reference alongside the SpliceWiz reference

If `STAR` is available on the same computer or server where R/RStudio
is being run, we can use the one-line function `buildFullRef`. This
function will:
* Prepare the resources from the given FASTA and GTF files
* Generate a STAR genome
* Use the STAR genome and the FASTA file to *de-novo* calculate and define low
  mappability regions
* Build the SpliceWiz reference using the genome resources and mappability file

```{r eval=FALSE}
buildFullRef(
    reference_path = ref_path,
    fasta = "genome.fa", gtf = "transcripts.gtf",
    genome_type = "",
    use_STAR_mappability = TRUE,
    n_threads = 4
)
```

`n_threads` specify how many threads should be used to build the STAR reference
and to calculate the low mappability regions
\
\

### Mappability exclusion generation using Rsubread

Finally, if `STAR` is not available, `Rsubread` is available on Bioconductor to
perform mappability calculations. The example code in the manual is displayed
here for convenience, to demonstrate how this would be done:

```{r eval = FALSE}
# (1a) Creates genome resource files 

ref_path <- file.path(tempdir(), "Reference")

getResources(
    reference_path = ref_path,
    fasta = chrZ_genome(),
    gtf = chrZ_gtf()
)

# (1b) Systematically generate reads based on the SpliceWiz example genome:

generateSyntheticReads(
    reference_path = ref_path
)

# (2) Align the generated reads using Rsubread:

# (2a) Build the Rsubread genome index:

setwd(ref_path)
Rsubread::buildindex(basename = "./reference_index", 
    reference = chrZ_genome())

# (2b) Align the synthetic reads using Rsubread::subjunc()

Rsubread::subjunc(
    index = "./reference_index", 
    readfile1 = file.path(ref_path, "Mappability", "Reads.fa"), 
    output_file = file.path(ref_path, "Mappability", "AlignedReads.bam"), 
    useAnnotation = TRUE, 
    annot.ext = chrZ_gtf(), 
    isGTF = TRUE
)

# (3) Analyse the aligned reads in the BAM file for low-mappability regions:

calculateMappability(
    reference_path = ref_path,
    aligned_bam = file.path(ref_path, "Mappability", "AlignedReads.bam")
)

# (4) Build the SpliceWiz reference using the calculated Mappability Exclusions

buildRef(ref_path)      

```
\

# Aligning Raw RNA-seq data using SpliceWiz's STAR wrappers
\
First, remember to check that STAR is available via command line:

```{r}
STAR_version()
```

### Aligning a single sample using STAR

```{r eval = FALSE}
STAR_alignReads(
    STAR_ref_path = file.path(ref_path, "STAR"),
    BAM_output_path = "./bams/sample1",
    fastq_1 = "sample1_1.fastq", fastq_2 = "sample1_2.fastq",
    n_threads = 8
)
```
\

### Aligning multiple samples using STAR

```{r eval = FALSE}
Experiment <- data.frame(
    sample = c("sample_A", "sample_B"),
    forward = file.path("raw_data", c("sample_A", "sample_B"),
        c("sample_A_1.fastq", "sample_B_1.fastq")),
    reverse = file.path("raw_data", c("sample_A", "sample_B"),
        c("sample_A_2.fastq", "sample_B_2.fastq"))
)

STAR_alignExperiment(
    Experiment = Experiment,
    STAR_ref_path = file.path("Reference_FTP", "STAR"),
    BAM_output_path = "./bams",
    n_threads = 8,
    two_pass = FALSE
)
```

To use two-pass mapping, set `two_pass = TRUE`. We recommend disabling this
feature, as one-pass mapping is adequate in typical-use cases.
\
\

### Finding FASTQ files recursively from a given directory

SpliceWiz can identify sequencing FASTQ files recursively from a given 
directory. It assumes that forward and reverse reads are suffixed as `_1` and
`_2`, respectively. Users can choose to identify such files using a specified
file extension. For example, to recursively identify FASTQ files of the format
`{sample}_1.fq.gz` and `{sample}_2.fq.gz`, use the following:

```{r eval = FALSE}
# Assuming sequencing files are named by their respective sample names
fastq_files <- findFASTQ("./sequencing_files", paired = TRUE,
    fastq_suffix = ".fq.gz", level = 0)
```

Then, these can be aligned as follows:

```{r eval = FALSE}
STAR_alignExperiment(
    Experiment = fastq_files,
    STAR_ref_path = file.path("Reference_FTP", "STAR"),
    BAM_output_path = "./bams",
    n_threads = 8,
    two_pass = FALSE
)
```

# Running processBAM on BAM files

To conveniently find all BAM files recursively in a given path:

```{r eval=FALSE}
bams <- findBAMS("./bams", level = 1)
```

This convenience function returns the putative sample names, either from BAM
file names themselves (`level = 0`), or from the names of their parent 
directories (`level = 1`).

To run `processBAM()` using 4 OpenMP threads:

```{r eval=FALSE}
# assume SpliceWiz reference has been generated in `ref_path`

processBAM(
    bamfiles = bams$path,
    sample_names = bams$sample,
    reference_path = ref_path,
    output_path = "./pb_output",
    n_threads = 4,
    useOpenMP = TRUE
)
```
\

### Creating COV files from BAM files without running processBAM

Sometimes one may wish to create a COV file from a BAM file without running
`processBAM()`. One reason might be because a SpliceWiz reference
is not available.

To convert a list of BAM files, run `BAM2COV()`. This is a function structurally
similar to `processBAM()` but without the need to give the path to the SpliceWiz
reference:

```{r eval=FALSE}
BAM2COV(
    bamfiles = bams$path,
    sample_names = bams$sample,
    output_path = "./pb_output",
    n_threads = 4,
    useOpenMP = TRUE
)
```
\

# Collating the experiment

Assuming the SpliceWiz reference is in `ref_path`, after running `processBAM()`
as shown in the previous section, use the convenience function 
`findSpliceWizOutput()` to tabulate a list of samples and their corresponding 
`processBAM()` outputs:

```{r eval=FALSE}
expr <- findSpliceWizOutput("./pb_output")
```

This data.frame can be directly used to run `collateData`:

```{r eval = FALSE}
collateData(
    Experiment = expr,
    reference_path = ref_path,
    output_path = "./NxtSE_output"
)
```

* NB: Novel splicing detection can be enabled by setting `novelSplicing = TRUE`.
See the SpliceWiz Quick-Start vignette for more details.

Then, the collated data can be imported as a `NxtSE` object, which is an object
that inherits `SummarizedExperiment` and has specialized containers to hold
additional data required by SpliceWiz.

```{r eval = FALSE}
se <- makeSE("./NxtSE_output")
```
\

# Downstream analysis using SpliceWiz

Please refer to SpliceWiz: Quick-Start vignette for worked examples using the 
example dataset.
\
\

# SessionInfo

```{r}
sessionInfo()
```