---
title: "InPAS Vignette"
author: "Jianhong Ou, Lihua Julie Zhu"
date: "`r doc_date()`"
package: "`r pkg_ver('InPAS')`"
abstract: >
   Alternative polyadenylation (APA) is one of the important 
   post-transcriptional regulation mechanisms which occurs in 
   most human genes. InPAS facilitates the discovery of novel 
   APA sites from RNAseq data. It leverages cleanUpdTSeq to fine 
   tune identified APA sites.
vignette: >
  %\VignetteIndexEntry{InPAS Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
output:
  BiocStyle::html_document
---
```{r preloadLibrary, echo=FALSE, results="hide", warning=FALSE}
suppressPackageStartupMessages({
    library(InPAS)
    library(BSgenome.Mmusculus.UCSC.mm10)
    library(TxDb.Mmusculus.UCSC.mm10.knownGene)
    library(org.Hs.eg.db)
    library(cleanUpdTSeq)
})
```
# Introduction

Alternative polyadenylation (APA) is one of the most important 
post-transcriptional regulation mechanisms which occurs in most human 
genes. APA in a gene can result in altered expression of the gene, 
which can lead pathological effect to the cells, such as uncontrolled cell 
cycle and growth. However, there are only limited ways to identify 
and quantify APA in genes, and most of them suffers from complicated 
process for library construction and requires large amount of RNAs. 

RNA-seq has become one of the most commonly used methods for quantifying 
genome-wide gene expression. There are massive RNA-seq datasets available 
publicly such as GEO and TCGA. For this reason, we develop the InPAS 
algorithm for identifying APA from conventional RNA-seq data.

The workflow for InPAS is:

- Calculate coverage from BEDGraph files
- Predict cleavage sites
- Estimate 3UTR usage

# How to

To use InPAS, BSgenome and TxDb object have to be loaded before run.

```{r loadLibrary}
library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
path <- file.path(find.package("InPAS"), "extdata")
```
Users can prepare annotaiton by **utr3Annotation** with a TxDb and 
org annotation. The 3UTR annotation prepared by **utr3Annotation** 
includes the gaps to next annotation region.

```{r prepareAnno}
library(org.Hs.eg.db)
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
txdb <- loadDb(samplefile)
utr3.sample.anno <- utr3Annotation(txdb=txdb, 
                   orgDbSYMBOL="org.Hs.egSYMBOL")
utr3.sample.anno
```

Users can load mm10 and hg19 annotation from pre-prepared data. 
Here we loaded the prepared mm10 3UTR annotation file.

```{r loadAnno}
##step1 annotation
# load from dataset
data(utr3.mm10)
```

The coverage is calculated from BEDGraph file. The RNA-seq BAM files 
could be converted to BEDGraph files by bedtools genomecov tool 
(parameter: -bg -split). PWM and a classifier of polyA signal can be 
used for adjusting CP sites prediction. 

```{r step2HugeDataF}
load(file.path(path, "polyA.rds"))
library(cleanUpdTSeq)
data(classifier)
bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), 
               file.path(path, "UM15.extract.bedgraph"))
hugeData <- FALSE
##step1 Calculate coverage
coverage <- coverageFromBedGraph(bedgraphs, 
                                 tags=c("Baf3", "UM15"),
                                 genome=BSgenome.Mmusculus.UCSC.mm10,
                                 hugeData=hugeData)

## we hope the coverage rate of should be greater than 0.75 in the expressed gene.
## which is used because the demo data is a subset of genome.
coverageRate(coverage=coverage,
             txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
             genome=BSgenome.Mmusculus.UCSC.mm10,
             which = GRanges("chr6", ranges=IRanges(98013000, 140678000)))

##step2 Predict cleavage sites
CPs <- CPsites(coverage=coverage, 
               genome=BSgenome.Mmusculus.UCSC.mm10,
               utr3=utr3.mm10, 
               search_point_START=200, 
               cutEnd=.2, 
               long_coverage_threshold=3,
               background="10K",
               txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
               PolyA_PWM=pwm, 
               classifier=classifier,
               shift_range=50,
               step=10)
head(CPs)
##step3 Estimate 3UTR usage
res <- testUsage(CPsites=CPs, 
                 coverage=coverage, 
                 genome=BSgenome.Mmusculus.UCSC.mm10,
                 utr3=utr3.mm10, 
                 method="fisher.exact",
                 gp1="Baf3", gp2="UM15")

##step4 view the results
as(res, "GRanges") 

filterRes(res, gp1="Baf3", gp2="UM15",
          background_coverage_threshold=3,
          adj.P.Val_cutoff=0.05, 
          dPDUI_cutoff=0.3, 
          PDUI_logFC_cutoff=0.59)
```

The process described above can be done in one step.

```{r OneStep}
if(interactive()){
    res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), 
              genome=BSgenome.Mmusculus.UCSC.mm10, 
              utr3=utr3.mm10, gp1="Baf3", gp2="UM15",
              txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
              search_point_START=200,
              short_coverage_threshold=15,
              long_coverage_threshold=3, 
              cutStart=0, cutEnd=.2, 
              hugeData=FALSE,
              shift_range=50,
              PolyA_PWM=pwm, classifier=classifier,
              method="fisher.exact",
              adj.P.Val_cutoff=0.05,
              dPDUI_cutoff=0.3, 
              PDUI_logFC_cutoff=0.59)
}
```

InPAS can handle single group data.

```{r SingleGroupData}
inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), 
      genome=BSgenome.Mmusculus.UCSC.mm10, 
      utr3=utr3.mm10, gp1="Baf3", gp2=NULL,
      txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
      search_point_START=200,
      short_coverage_threshold=15,
      long_coverage_threshold=3, 
      cutStart=0, cutEnd=.2, 
      hugeData=FALSE, 
      PolyA_PWM=pwm, classifier=classifier,
      method="singleSample",
      adj.P.Val_cutoff=0.05,
      step=10)
```

# Session Info
```{r sessionInfo, results='asis'}
sessionInfo()
```