%\VignetteIndexEntry{Rsubread Vignette}
%\VignetteDepends{}
%\VignetteKeywords{Read mapping}
%\VignettePackage{Rsubread}
\documentclass[12pt]{article}

\usepackage{hyperref}

\newcommand{\R}[1]{{\texttt{#1}}}
\newcommand{\C}[1]{{\texttt{#1}}}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=0.2in
\evensidemargin=0.2in
\headheight=0in
\headsep=0in

\begin{document}
\title{Rsubread package: high-performance read alignment, quantification and mutation discovery}
\author{Wei Shi}
\date{21 November 2022}
\maketitle

\section{Introduction}
This vignette provides a brief description to the Rsubread package.
For more details, please refer to the Users Guide which can brought up in your R session via the following commands: \\

\noindent\R{> library(Rsubread)}\\
\R{> RsubreadUsersGuide()}\\

The Rsubread package provides facilities for processing short and long reads generated from the sequencing technologies.
These facilities include quality assessment, read alignment, read summarization, exon-exon junction detection, absolute expression calling and SNP discovery.

The Subread aligner (\R{align} function) is a highly efficient and accurate aligner for mapping genomic DNA and RNA sequencing reads.
It adopts a novel mapping paradigm called ``seed-and-vote".
Under this paradigm, a number of 16mers (called seeds or subreads) are extracted from each read and they were mapped to the reference genome to vote for the mapping location of the read.
Read mapping performed under this paradigm has been found to be more efficient and accurate than that carried out under the conventional ``seed-and-extend" paradigm (Liao et al. 2013).

Another aligner included in Rsubread is the Subjunc aligner (\R{subjunc} function).
It was also developed under the ``seed-and-vote" paradigm, but different from the Subread aligner it performs full alignment for exon spanning reads and reports exon-exon junctions in addition to the read mapping results.
The Subread aligner is recommended for gene-level expression analysis.
For other types of RNA-seq analyses such as alternative splicing analysis, the Subjunc aligner should be used.

An important step in processing next-gen sequencing data is to assign mapped reads to genomic features such as genes, exons and genomic windows.
This package includes a general-purpose read summarization function \R{featureCounts} that takes mapped reads as input and assigns them to genomic features.
In-built annotations are provided for users convenience.

Different from microarray technologies, the next-gen sequencing technologies do not provide Present/Absent calls for genomic features such as genes.
We have developed an algorithm to use the background noise measured from the RNA-seq data to call absolutely expressed genes.
The function \R{detectionCall} returns a detection p value for each gene from the read mapping results.

We have also developed a new SNP calling algorithm which is being implemented in function \R{exactSNPs}.
Our results showed that it compared favorably to competing methods, but was an order of magnitude faster.

This package also includes some other useful functions such as quality assessment (\R{qualityScores}, \R{atgcContent}), duplicate read removal (\R{removeDupReads}) and mapping percentage calculation (\R{propmapped}).

\section{Read alignment}
An index needs to be built first and then alignments can be carried out.
Building the index is an one-off operation.
The generated index can be re-used in subsequent read alignments.\\

{\noindent\bf Step 1: Index building \\}

The Rsubread package includes a dummy reference sequence that was generated by concatenating 900 100bp reads that were taken from a pilot dataset generated from the SEquencing Quality Control (SEQC) project.
We further extracted 100 reads from the same dataset and combine them with the 900 reads to make a read dataset for mapping.
Below is the command for building an index for the reference sequence:

\begin{scriptsize}
<<>>=
library(Rsubread)
ref <- system.file("extdata","reference.fa",package="Rsubread")
buildindex(basename="reference_index",reference=ref)
@
\end{scriptsize}

The generated index files were saved to the current working directory.
Rsubread creates a hash table for indexing the reference genome. 
Keys in the hash table are the 16bp sequences and hash values are their corresponding chromosomal locations. 
Color space index can be built by setting the \R{colorsapce} argument to \R{TRUE}.\\

A unique feature of Rsubread is that it allows users to control the computer memory usage in read mapping process.
Users can do this by tuning the amount of memory (in MB) to be used in read mapping.\\

{\noindent\bf Step 2: read mapping\\}

After the index was successfully built, we map the read dataset (including 1,000 reads) to the reference sequence:

\begin{scriptsize}
<<>>=
reads <- system.file("extdata","reads.txt.gz",package="Rsubread")
align.stat <- align(index="reference_index",readfile1=reads,output_file="alignResults.BAM",phredOffset=64)
@
\end{scriptsize}

Map paired-end reads:

\begin{scriptsize}
<<>>=
reads1 <- system.file("extdata","reads1.txt.gz",package="Rsubread")
reads2 <- system.file("extdata","reads2.txt.gz",package="Rsubread")
align.stat2 <- align(index="reference_index",readfile1=reads1,readfile2=reads2,
output_file="alignResultsPE.BAM",phredOffset=64)
@
\end{scriptsize}

\section{Counting mapped reads for genomic features}

The \R{featureCounts} function is a general-purpose read summarization function that assigns mapped reads (RNA-seq or gDNA-seq reads) to genomic features such as genes, exons, promoters, gene bodies and genomic windows.

This function takes as input a set of files that contain read mapping results and an annotation file that includes genomic features.
It automatically detects the format of input read files (supported formats include SAM and BAM).
Input reads can be name-sorted or location-sorted.
Users do not need to resort the reads before feeding them to \R{featureCounts}.

In-built NCBI RefSeq gene annotations for genomes mm9, mm10, hg19 and hg38 are provided for convenience.
These annotations include chromosomal coordinates of exons of each gene.
When these annotations are used for summarization, only reads overlapping with exons will be counted by \R{featureCounts}.
Users can use {\texttt{getInBuiltAnnotation}} function to retrieve these annotations.

Below gives the example code of assigning reads and fragments, generated in the last section, to two artificial genes.
Assign single end reads to genes:

\begin{scriptsize}
<<>>=
ann <- data.frame(
GeneID=c("gene1","gene1","gene2","gene2"),
Chr="chr_dummy",
Start=c(100,1000,3000,5000),
End=c(500,1800,4000,5500),
Strand=c("+","+","-","-"),
stringsAsFactors=FALSE)
ann
fc_SE <- featureCounts("alignResults.BAM",annot.ext=ann)
fc_SE
@
\end{scriptsize}

Assign fragments (read pairs) to the two genes:

\begin{scriptsize}
<<>>=
fc_PE <- featureCounts("alignResultsPE.BAM",annot.ext=ann,isPairedEnd=TRUE)
fc_PE
@
\end{scriptsize}
\section{Quantifying 10x scRNA-seq data}

The \R{cellCounts} function can be used to quantify the scRNA-seq data generated by the 10x Genomics Chromium platform. It employs the seed-and-vote strategy to align reads to a reference genome, collapses reads to UMIs (Unique Molecular Identifiers) and then assigns UMIs to genes based on the \R{featureCounts} program. \R{cellCounts} starts with processing raw reads and finishes with outputting a UMI count matrix and also other information. It is able to take both BCL and FASTQ format reads as input. When input format is BCL, \R{cellCounts} directly processes reads from the raw data files instead of converting them into FASTQ reads before processing. \R{cellCounts} supports barcode correction and whitelisting. It utilizes a barcode whitelist, a list of barcode sequences included in the Chromium assay kit that can be freely downloaded from 10x Genomics website, to detect valid barcodes from experimental data. When matching cell barcodes observed in a 10x dataset against the barcode whitelist, \R{cellCounts} allows for one base mismatch to account for sequencing errors so as to detect more valid cell barcodes. After obtaining UMI counts for each gene in each cell, \R{cellCounts} uses the EmptyDrops algorithm to call valid cell barcodes. \R{cellCounts} reports both high-confidence and rescued cells. The rescued cells have low total UMI count but are found to have a gene expression profile that is distinct from that of ambient RNAs. Below is an example of using \R{cellCounts} to quantify a very small 10x scRNA-seq dataset.

\begin{scriptsize}
<<>>=
if(grepl("linux", R.version$os) && grepl("x86_64", R.version$arch)) {
  md5.zip <- "ffd5036b36e25e9b61efc412e71820dd"
  URL <- "https://shilab-bioinformatics.github.io/cellCounts-Example/cellCounts-Example.zip"
  temp.file <- tempfile()
  temp.dir <- tempdir()
  downloaded <- tryCatch({
      download.file(URL, destfile = temp.file)
      tools::md5sum(temp.file) %in% md5.zip
    },
    error = function(cond){
      return(FALSE)
    }
  )
  if(!downloaded) cat("Unable to download the file.\n") 
} else downloaded <- FALSE

if(downloaded){
  unzip(temp.file, exdir=paste0(temp.dir,"/cellCounts-Example"))
  library(Rsubread)
  buildindex(paste0(temp.dir,"/chr1"), 
    paste0(temp.dir,"/cellCounts-Example/hg38_chr1.fa.gz"))
}
@
<<>>=
if(downloaded){
  sample.sheet <- data.frame(
    BarcodeUMIFile = paste0(temp.dir,"/cellCounts-Example/reads_R1.fastq.gz"), 
    ReadFile = paste0(temp.dir,"/cellCounts-Example/reads_R2.fastq.gz"),
    SampleName="Example", stringsAsFactors=FALSE
  )
  counts <- cellCounts(paste0(temp.dir,"/chr1"), sample.sheet,  nthreads=1,
    input.mode="FASTQ", annot.inbuilt="hg38")
}
@
<<>>=
if(downloaded) print(counts$sample.info)
@
<<>>=
if(downloaded) print(dim(counts$counts$Example))
@
\end{scriptsize}

\section{Finding exon junctions}

The RNA-seq technology provides a unique opportunity to identify the alternative splicing events that occur during the gene transcription process.
The \R{subjunc} function can be used to detect exon-exon junctions.
It first extracts a number of subreads (16mers) from each read, maps them to the reference genome and identifies the two best mapping locations for each read (representing potential locations of exons spanned by the read).
Then, it builds a junction table including all putative junctions.
Finally, it carries out a verification step to remove false positives in junction detection by realigning all the reads. 
The donor (`GT') and receptor sites(`AG'), are required to be present when calling exon-exon junctions. 
Output of this function includes the discovered exon-exon junctions and also read mapping results.

\section{Base quality scores}

Quality scores give the probabilities of read bases being incorrectly called, which is useful for examining the quality of sequencing data.
The \R{qualityScores} function can be used to quickly retrieve and display the quality score data extracted from a read file.

\begin{scriptsize}
<<>>=
x <- qualityScores(filename=reads,offset=64,nreads=1000)
x[1:10,1:10]
@
\end{scriptsize}

\section{GC content}

The \R{atgcContent} function returns fractions of A, T, G and C bases at each base location of reads or in the entire dataset.

%\begin{scriptsize}
%<<>>=
%## Fraction of A,T,G and C bases in all the reads
%x <- atgcContent(filename=reads)
%x
%## Fraction of A,T,G and C bases at each read location
%xb <- atgcContent(filename=reads,basewise=TRUE)
%xb[,1:5]
%@
%\end{scriptsize}

\section{Mapping percentage}

Function \R{propmapped} returns the proportion of mapped reads included in a SAM/BAM file.
For paired end reads, it can return the proportion of mapped fragments (ie. read pairs).

\begin{scriptsize}
<<>>=
propmapped("alignResults.BAM")
@
\end{scriptsize}

\section{Citation}
Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Research, 41(10):e108.\\

\noindent{Yang Liao, Gordon K Smyth and Wei Shi (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30}

\section{Authors}
Wei Shi and Yang Liao \\
Bioinformatics Division \\
The Walter and Eliza Hall Institute of Medical Research \\
1G Royal Parade, Parkville, Victoria 3052 \\
Australia \\

\section{Contact}

Please post to the Bioconductor Support site (\url{https://support.bioconductor.org/}) if you have any questions or suggestions.

\end{document}