%\VignetteIndexEntry{InPAS Vignette}
%\VignetteDepends{InPAS}
%\VignetteKeywords{sequence 3UTR usage}
%\VignettePackage{InPAS}
\documentclass[12pt]{article}

<<style, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\usepackage{hyperref}
\usepackage{url}
\usepackage[numbers]{natbib}
\usepackage{graphicx}
\bibliographystyle{plainnat}

\author{Jianhong Ou, Sung Mi Park, Michael R. Green and Lihua Julie Zhu}
\begin{document}
\SweaveOpts{concordance=TRUE}
\title{InPAS guide}
\maketitle
\tableofcontents
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Introduction}
    Alternative polyadenylation (APA) is one of the most important 
    post-transcriptional regulation mechanisms which occurs in most human 
    genes. APA in a gene can result in altered expression of the gene, 
    which can lead pathological effect to the cells, such as uncontrolled cell 
    cycle and growth. However, there are only limited ways to identify 
    and quantify APA in genes, and most of them suffers from complicated 
    process for library construction and requires large amount of RNAs. 

    RNA-seq has become one of the most commonly used methods for quantifying 
    genome-wide gene expression. There are massive RNA-seq datasets available 
    publicly such as GEO and TCGA. For this reason, we develop the InPAS 
    algorithm for identifying APA from conventional RNA-seq data. 

%     Recently, two groups reported the methods how to mine APA usage from 
%     RNA-seq.
% 
%     Lu et al. developed the algorithm based on a Poisson hidden Markov 
%     model (PHMM) to identify potential APA in human liver and brain 
%     cortex tissues. This method is applicable to both single sample and 
%     pooled samples. However, if there are two or more groups of samples 
%     to compare, each group should be run separately. Moreover, as the 
%     limitation of the algorithm, each run of different group can result 
%     different APA prediction in a same gene, so it is hardly possible 
%     to compare the results from different groups. Furthermore, this 
%     method hardly can find the novel distal cleavage site.
% 
%     Recently, Xu et al. developed a novel algorithm, DaPars, which 
%     is possible for \textit{de novo} identification of dynamic APAs 
%     from RNA-seq data. This method can be used to find not only the 
%     novel proximal site but also the novel distal site. The limitation 
%     of this method is it can only handle two group dataset, but not 
%     one group dataset. Moreover, as DaPars algorithm regards the 
%     last exon as 3'UTR, it can lead biased detection of more proximal 
%     site than distal site. 
% 
%     Here, to overcome this limitations, we developed InPAS (
%     \underline{I}dentification of \underline{N}ovel alternative
%     \underline{P}oly\underline{A}denylation \underline{S}ites), a 
%     bioconductor tool for \textit{de novo} cleavage site discovery. 
%     Though InPAS was developed form DaPars algorithm, there are several 
%     improvements compared to DaPars.  It uses the power of 
%     \Rpackage{cleanUpdTSeq} to adjust cleavage site. Moreover, 
%     it extracts annotation from \Rpackage{GenomicFeatures} to make 
%     sure the correct 3'UTR ranges to prevent biased proximal 
%     site detection. 

    The workflow for InPAS is:

    \begin{enumerate}
        \item Calculate coverage from BEDGraph files
        \item Predict cleavage sites
        \item Estimate 3UTR usage
    \end{enumerate}

\section{How to}

    To use InPAS, BSgenome and TxDb object have to be loaded before run.
<<loadLibrary>>=
library(InPAS)
library(BSgenome.Mmusculus.UCSC.mm10)
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
path <- file.path(find.package("InPAS"), "extdata")
@

    Users can prepare annotaiton by \Rfunction{utr3Annotation} with a TxDb and 
    org annotation. The 3UTR annotation prepared by \Rfunction{utr3Annotation} 
    includes the gaps to next annotation region.

<<prepareAnno>>=
library(org.Hs.eg.db)
samplefile <- system.file("extdata", "hg19_knownGene_sample.sqlite",
            package="GenomicFeatures")
txdb <- loadDb(samplefile)
utr3.sample.anno <- utr3Annotation(txdb=txdb, 
                   orgDbSYMBOL="org.Hs.egSYMBOL")
utr3.sample.anno
@

    Users can load mm10 and hg19 annotation from pre-prepared data. 
    Here we loaded the prepared mm10 3UTR annotation file.
<<loadAnno>>=
##step1 annotation
# load from dataset
data(utr3.mm10)
@

    The coverage is calculated from BEDGraph file. The RNA-seq BAM files 
    could be converted to BEDGraph files by bedtools genomecov tool 
    (parameter: -bg -split).PWM and a classifier of polyA signal can be 
    used for adjusting CP sites prediction. 
<<step2HugeDataF>>=
load(file.path(path, "polyA.rds"))
library(cleanUpdTSeq)
data(classifier)
bedgraphs <- c(file.path(path, "Baf3.extract.bedgraph"), 
               file.path(path, "UM15.extract.bedgraph"))
hugeData <- FALSE
##step1 Calculate coverage
coverage <- coverageFromBedGraph(bedgraphs, 
                                 tags=c("Baf3", "UM15"),
                                 genome=BSgenome.Mmusculus.UCSC.mm10,
                                 hugeData=hugeData)

##step2 Predict cleavage sites
CPsites <- CPsites(coverage=coverage, 
                   gp1="Baf3", gp2="UM15", 
                   genome=BSgenome.Mmusculus.UCSC.mm10,
                   utr3=utr3.mm10, 
                   search_point_START=200, 
                   cutEnd=.2, 
                   long_coverage_threshold=3,
                   PolyA_PWM=pwm, 
                   classifier=classifier,
                   shift_range=100)

##step3 Estimate 3UTR usage
res <- utr3UsageEstimation(CPsites=CPsites, 
                           coverage=coverage, 
                           utr3=utr3.mm10,
                           genome=BSgenome.Mmusculus.UCSC.mm10,
                           gp1="Baf3", gp2="UM15", 
                           short_coverage_threshold=15, 
                           long_coverage_threshold=3, 
                           adjusted.P_val.cutoff=0.05, 
                           dPDUI_cutoff=0.3, 
                           PDUI_logFC_cutoff=0.59)
res[1]
@

    The process described above can be done in one step.
<<OneStep>>=
if(interactive()){
    res <- inPAS(bedgraphs=bedgraphs, tags=c("Baf3", "UM15"), 
              genome=BSgenome.Mmusculus.UCSC.mm10, 
              utr3=utr3.mm10, gp1="Baf3", gp2="UM15",
              txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
              search_point_START=200,
              short_coverage_threshold=15,
              long_coverage_threshold=3, 
              cutStart=0, cutEnd=.2, 
              hugeData=FALSE,
              shift_range=50,
              PolyA_PWM=pwm, classifier=classifier)
}
@

    InPAS can handle single group data.
<<SingleGroupData>>=
res <- inPAS(bedgraphs=bedgraphs[1], tags=c("Baf3"), 
              genome=BSgenome.Mmusculus.UCSC.mm10, 
              utr3=utr3.mm10, gp1="Baf3", gp2=NULL,
              txdb=TxDb.Mmusculus.UCSC.mm10.knownGene,
              search_point_START=200,
              short_coverage_threshold=15,
              long_coverage_threshold=3, 
              cutStart=0, cutEnd=.2, 
              hugeData=FALSE, 
              PolyA_PWM=pwm, classifier=classifier)
res[1]
@

% \section{References}
%     Lu J, Bushel PR. Dynamic expression of 3' UTRs revealed by 
%     Poisson hidden Markov modeling of RNA-Seq: implications in 
%     gene expression profiling. Gene. 2013 Sep 25;527(2):616-23. 
%     doi: 10.1016/j.gene.2013.06.052. Epub 2013 Jul 9. 
%     PubMed PMID: 23845781; PubMed Central PMCID: PMC3902974.
% 
%     Xia Z, Donehower LA, Cooper TA, Neilson JR, Wheeler DA, 
%     Wagner EJ, Li W. Dynamic analyses of alternative 
%     polyadenylation from RNA-seq reveal a 3'-UTRlandscape 
%     across seven tumour types. Nat Commun. 2014 Nov 20;5:5274. 
%     doi:10.1038/ncomms6274. PubMed PMID: 25409906.
% 
%     Sheppard S, Lawson ND, Zhu LJ. Accurate identification of 
%     polyadenylation sites from 3' end deep sequencing using a 
%     naive Bayes classifier. Bioinformatics. 2013 Oct 15;29(20):2564-71. 
%     doi: 10.1093/bioinformatics/btt446. Epub 2013 Aug 20. 
%     PubMed PMID: 23962617; PubMed Central PMCID: PMC3789547.

\section{Session Info}
<<sessionInfo, results=tex, print=TRUE>>=
toLatex(sessionInfo())
@
\end{document}