---
title: "cliProfiler Vignettes"
author:
- name: "You Zhou"
  affiliation: 
  - Buchmann Institute for Molecular Life Sciences, Frankfurt am Main, Germany
- name: "Kathi Zarnack"
  affiliation: 
  - Buchmann Institute for Molecular Life Sciences, Frankfurt am Main, Germany
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
    BiocStyle::html_document:
        toc_float: true
    BiocStyle::pdf_document: default
package: cliProfiler
vignette: |
    %\VignetteIndexEntry{cliProfiler Vignettes}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```

# Introduction

Cross-linking immunoprecipitation (CLIP) is a technique that combines UV 
cross-linking and immunoprecipitation to analyse protein-RNA interactions or to 
pinpoint RNA modifications (e.g. m6A). CLIP-based methods, such as iCLIP and 
eCLIP, allow precise mapping of RNA modification sites or RNA-binding protein 
(RBP) binding sites on a genome-wide scale. These techniques help us to unravel 
post-transcriptional regulatory networks. In order to make the visualization of 
CLIP data easier, we develop `r Biocpkg("cliProfiler")` package. The 
`r Biocpkg("cliProfiler")` includes seven functions which allow users easily 
make different profile plots.

The `r Biocpkg("cliProfiler")` package is available at
[https://bioconductor.org](https://bioconductor.org) and can be
installed via `BiocManager::install`:

```{r BiocManager, eval=FALSE}
if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("cliProfiler")
```

A package only needs to be installed once. Load the package into an
R session with

```{r initialize, results="hide", warning=FALSE, message=FALSE}
library(cliProfiler)
```

# The Requirement of Data and Annotation file 
The input data for using all the functions in `r Biocpkg("cliProfiler")` should 
be the peak calling result or other similar object that represents the RBP 
binding sites or RNA modification position. Moreover, these `peaks/signals` be 
stored in the **GRanges** object. The **GRanges** is an S4 class which defined 
by `r Biocpkg("GenomicRanges")`. The GRanges class is a container for the 
genomic locations and their associated annotations. For more information about 
GRanges objects please check `r Biocpkg("GenomicRanges")` package. An example 
of GRanges object is shown below:

```{r}
testpath <- system.file("extdata", package = "cliProfiler")
## loading the test GRanges object
test <- readRDS(file.path(testpath, "test.rds"))
## Show an example of GRanges object
test
```

The annotation file that required by functions `exonProfile`, 
`geneTypeProfile`, `intronProfile`, `spliceSiteProfile` and `metaGeneProfile` 
should be in the `gff3` format and download from 
[https://www.gencodegenes.org/](https://www.gencodegenes.org/). In the 
`r Biocpkg("cliProfiler")` package, we include a test `gff3` file.

```{r}
## the path for the test gff3 file
test_gff3 <- file.path(testpath, "annotation_test.gff3")
## the gff3 file can be loaded by import.gff3 function in rtracklayer package
shown_gff3 <- rtracklayer::import.gff3(test_gff3)
## show the test gff3 file
shown_gff3
```

The function `windowProfile` allows users to find out the enrichment of peaks 
against the customized annotation file. This customized annotation file should 
be stored in the **GRanges** object. 

# metaGeneProfile
`metaGeneProfile()` outputs a meta profile, which shows the location of binding 
sites or modification sites `( peaks/signals)` along transcript regions 
(5â€™UTR, CDS and 3â€™UTR). The input of this function should be a `GRanges` 
object. 

Besides the `GRanges` object, a path to the `gff3` annotation file which 
download from [Gencode](https://www.gencodegenes.org/) is required by 
`metaGeneProfile`.

The output of `metaGeneProfile` is a `List` objects. The `List` one contains 
the GRanges objects with the calculation result which can be used in different 
ways later.

```{r}
meta <- metaGeneProfile(object = test, annotation = test_gff3)
meta[[1]]
```

Here is an explanation of the metaData columns of the output GRanges objects:  
    
* __center__ The center position of each peaks. This center position is used 
for calculating the position of peaks within the assigned genomic regions.  
* __location__ The genomic region to which this `peak/signal` belongs to.  
* __Gene ID__ The gene to which this `peak/signal` belongs. 
* __Position__ The relative position of each `peak/signal` within the genomic 
region. This value close to 0 means this peak located close to the 5' end of 
the genomic feature. The position value close to 1 means the peak close to the 
3' end of the genomic feature. Value 5 means this peaks can not be mapped to 
any annotation.

The `List` two is the meta plot which in the `ggplot` class. The user can use 
all the functions from `ggplot2` to change the detail of this plot.

```{r}
library(ggplot2)
## For example if user want to have a new name for the plot
meta[[2]] + ggtitle("Meta Profile 2")
```

For the advance usage, the `metaGeneProfile` provides two methods to calculate 
the relative position. The first method return a relative position of the 
`peaks/signals` in the genomic feature without the introns. The second method 
return a relative position value of the peak in the genomic feature with the 
introns. With the parameter `include_intron` we can easily shift between these 
two methods. If the data is a polyA plus data, we will recommend you to set 
`include_intron = FALSE`.

```{r}
meta <- metaGeneProfile(object = test, annotation = test_gff3, 
                        include_intron = TRUE)
meta[[2]]
```

The `group` option allows user to make a meta plot with multiple conditions. 
Here is an example:

```{r}
test$Treat <- c(rep("Treatment 1",50), rep("Treatment 2", 50))
meta <- metaGeneProfile(object = test, annotation = test_gff3, 
                        group = "Treat")
meta[[2]]
```

Besides, we provide an annotation filtering option for making the meta plot. 
The `exlevel` and `extranscript_support_level` could be used for specifying 
which _level_ or _transcript support level_ should be excluded. For excluding 
the  _transcript support level_ NA, user can use _6_ instead of NA. About more 
information of _level_ and _transcript support level_ you can check the 
[Gencode data format](https://www.gencodegenes.org/pages/data_format.html).

```{r eval=FALSE}
metaGeneProfile(object = test, annotation = test_gff3, exlevel = 3, 
                extranscript_support_level = c(4,5,6))
```

The `split` option could help to make the meta profile for the `peaks/signals` 
in 5'UTR, CDS and 3'UTR separately. The grey dotted line represents the peaks's 
distribution across all region.

```{r}
meta <- metaGeneProfile(object = test, annotation = test_gff3, split = TRUE)
meta[[2]]
```

# intronProfile
The function `intronProfile` generates the profile of `peaks/signals` in the 
intronic region. Here is an example for a quick use of `intronProfile`.

```{r}
intron <-  intronProfile(test, test_gff3)
```

Similar to metaGeneProfile, the output of `intronProfile` is a `List` object 
which contains two `Lists`. `List` one is the input GRanges objects with the 
calculation result. 

```{r}
intron[[1]]
```

The explanation of meta data in the output of `intronProfile` list one is 
pasted down below:  

* __center__ The center position of each peaks. This center position is used 
for calculating the position of peaks within the genomic regions.  
* **Intron_S and Intron_E** The start and end coordinates of the intron 
to which the peak is assigned.
* __Intron_length__ The length of the intron to which the peak is assigned.  
* __Intron_transcript_id__ The transcript ID for the intron.
* **Intron_map** The relative position of each peak within the assigned intron.
This value close to 0 means this peak located close to the 5' SS. The position 
value close to one means the peak close to the 3' SS. Value 3 means this peaks 
can not map to any intron.

The `List` two includes a _ggplot_ object.

```{r}
intron[[2]]
```

Similar to metaGeneProfile, in intronProfile, we provide options , such as 
`group`, `exlevel` and `extranscript_support_level`. The `group` function could 
be used to generate the comparison plot.

```{r}
intron <-  intronProfile(test, test_gff3, group = "Treat")
intron[[2]]
```

The parameter `exlevel` and `extranscript_support_level` could be used for 
specifying which _level_ or _transcript support level_ should be excluded. 
For excluding the  _transcript support level_ NA, you can use _6_. About more 
information of _level_ and _transcript support level_ you can check the 
[Gencode data format](https://www.gencodegenes.org/pages/data_format.html).

```{r}
intronProfile(test, test_gff3, group = "Treat", exlevel = 3, 
    extranscript_support_level = c(4,5,6))
```

Moreover, in the intronProfile we provide parameters `maxLength` and 
`minLength` to filter the maximum and minimum length of the intron.

```{r}
intronProfile(test, test_gff3, group = "Treat", maxLength = 10000,
    minLength = 50)
```

# exonProfile
The `exonProfile` could help to generate a profile of `peaks/signals` in the 
exons which **surrounded by introns**. The output of exonProfile is a `List` 
object. The `List` one is the `GRanges` objects of input data with the 
calculation result. 

```{r}
## Quick use
exon <- exonProfile(test, test_gff3)
exon[[1]]
```

Here is the explanation of the meta data column that output from 
`exonProfile`:  
    
* __center__ The center position of each peaks. This center position is used 
for calculating the position of peaks within the genomic regions.  
* **exon_S and exon_E** The start and end coordinates of the exon to which the 
peak is assigned.
* __exon_length__ The length of the exon to which the peak is assigned. 
* __exon_transcript_id__ The transcript ID for the assigned exon.
* **exon_map** The relative position of each peak within the assigned exon. 
This value close to 0 means this peak located close to the 3' SS. The position 
value close to one means the peak close to the 5' SS. Value 3 means this peaks 
can not be assigned to any annotation.
    
The `List` two is a _ggplot_ object which contains the exon profile.

```{r}
exon[[2]]
```

The usage of all parameters of `exonProfile` is similar to `intronProfile`. For 
more detail of parameter usage please check the `intronProfile` section.

# windowProfile
Since the `metaGeneProfile`, `intronProfile` and `exonProfile` require a 
annotation file in `gff3` format and downloaded from 
[https://www.gencodegenes.org/](https://www.gencodegenes.org/). These functions 
could only be used for _human_ and _mouse_. For the user who works on other 
species or has a customized annotation file (not in gff3 format), we develop 
the function `windowProfile`.  
    
`windowProfile` requires GRanges object as input and annotation. And 
`windowProfile` output the relative position of each peak within the given 
annotation GRanges. For example, if user would like to make a profile against 
all the exons with `windowProfile`, you just need to first extract all the 
exonic region as a GRanges object then run the `windowProfile`. Here is an 
example about using `windowProfile` to generate the profile.

```{r, results='hide', warning=FALSE, message=FALSE}
library(rtracklayer)
## Extract all the exon annotation
test_anno <- rtracklayer::import.gff3(test_gff3)
test_anno <- test_anno[test_anno$type == "exon"]
## Run the windowProfile
window_profile <- windowProfile(test, test_anno)
```

The output of `windowProfile` is a `List` objects. In the `List` one, you will 
find the GRanges object of input peaks and calculation result. And the `List` 
two contains the _ggplot_ of `windowProfile`.

```{r}
window_profile[[1]]
```

Here is an explanation of the output of `windowProfile`:  
    
* __center__ The center position of each peaks. This center position is used 
for calculating the position of peaks within the genomic regions.  
* **window_S and window_E** The boundary of the annotation to which the peak is 
assigned.  
* **window_length** The length of the annotation feature.  
* **window_map** The relative position of each peak. This value close to 0 
means this peak located close to the 5' end of the annotation. The position 
value close to one means the peak close to the 3' end. Value 3 means this 
peaks can not map to any annotation.

```{r}
window_profile[[2]]
```

# motifProfile
`motifProfile` generates the motif enrichment profile around the center of the 
input peaks. The [IUPAC code](https://www.bioinformatics.org/sms/iupac.html) 
could be used for the `motif` parameter. The `IUPAC` code includes: A, T, C, G, 
R, Y, S, W, K, M, B, D, H, V, N. The parameter `flanking` represents to the 
size of window that user would like to check around the center of peaks. The 
parameter `fraction` could be used to change the y-axis from _frequency_ to 
_fraction_.  
    
For using the `motifProtile`, the `BSgenome` with the sequence information of 
the species that you are working with is required.

```{R}
## Example for running the motifProfile
## The working species is mouse with mm10 annotation.
## Therefore the package 'BSgenome.Mmusculus.UCSC.mm10' need to be installed in 
## advance.
motif <- motifProfile(test, motif = "DRACH",
    genome = "BSgenome.Mmusculus.UCSC.mm10",
    fraction = TRUE, title = "Motif Profile",
    flanking = 10)
```

The output of `motifProfile` is a `List` object. `List` 1 contains the 
`data.frame` of the motif enrichment information for each position around the 
center of the input `peaks/signals`. `List` 2 is the _ggplot_ object of the 
plot of motif enrichment.  

```{r}
motif[[1]]
```

Each bar in the plot of `motifProfile` represents for the start site of the 
motif that is defined by the user.

```{r}
motif[[2]]
```

# geneTypeProfile
The `geneTypeProfile` could give users the information of the gene type of the 
input peaks. The input peaks for `geneTypeProfile` should be stored in the 
GRanges objects. The annotation file should be a `gff3` file and downloaded 
from [https://www.gencodegenes.org/](https://www.gencodegenes.org/).

```{r}
## Quick use of geneTypeProfile
geneTP <- geneTypeProfile(test, test_gff3)
```

The output of `geneTypeProfile` is an `List` object. `List` one includes a 
GRanges object with the input peaks and the assignment information. 

```{r}
geneTP[[1]]
```

Here is an explanation of the output GRanges object from the 
`geneTypeProfile`.  
    
* __center__ The center position of each peaks. This center position is used 
for calculating the position of peaks within the genomic regions.  
* **geneType** The gene type of the gene to which the peak is assigned.
* **Gene_ID** The gene ID to which the peak is assigned.

```{r}
geneTP[[2]]
```

# spliceSiteProfile
The `spliceSiteProfile` gives users the information of the enrichment of peaks  
around the 5' and 3' splice site (SS) in the absolute distance.

```{r}
SSprofile <- spliceSiteProfile(test, test_gff3, flanking=200, bin=40)
```

The output of `spliceSiteProfile` is a `List` object. The `List` one includes 
the GRanges object of input peaks and the position information of this peak 
around the SS.

```{r}
SSprofile[[1]]
```

Here is an explanation of output of `spliceSiteProfile`.  
    
* __center__ The center position of each peaks. This center position is used 
for calculating the position of peaks to the splice site.  
* **Position5SS** The absolute distance between peak and 5'SS. Negative value 
means the peak locates in the exonic region. Positive value means the peaks 
located in the intron.
* **Position3SS** The absolute distance between peak and 3'SS. Negative value 
means the peak locates in the intronic region. Positive value means the peaks 
located in the exon.

```{r}
SSprofile[[2]]
```

Similar to `metaProfile`, The parameter `exlevel` and 
`extranscript_support_level` could be used for 
specifying which _level_ or _transcript support level_ should be excluded. 
For excluding the  _transcript support level_ NA, you can use _6_. About more 
information of _level_ and _transcript support level_ you can check the 
[Gencode data format](https://www.gencodegenes.org/pages/data_format.html).

```{r eval=FALSE}
spliceSiteProfile(test, test_gff3, flanking=200, bin=40, exlevel=3,
                        extranscript_support_level = 6,
                        title = "Splice Site Profile")
```

# SessionInfo
The following is the session info that generated this vignette:

```{r}
sessionInfo()
```