---
title: "cliProfiler Vignettes"
author:
- name: "You Zhou"
affiliation:
- Buchmann Institute for Molecular Life Sciences, Frankfurt am Main, Germany
- name: "Kathi Zarnack"
affiliation:
- Buchmann Institute for Molecular Life Sciences, Frankfurt am Main, Germany
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
BiocStyle::html_document:
toc_float: true
BiocStyle::pdf_document: default
package: cliProfiler
vignette: |
%\VignetteIndexEntry{cliProfiler Vignettes}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r style, echo=FALSE, results='asis'}
BiocStyle::markdown()
```
# Introduction
Cross-linking immunoprecipitation (CLIP) is a technique that combines UV
cross-linking and immunoprecipitation to analyse protein-RNA interactions or to
pinpoint RNA modifications (e.g. m6A). CLIP-based methods, such as iCLIP and
eCLIP, allow precise mapping of RNA modification sites or RNA-binding protein
(RBP) binding sites on a genome-wide scale. These techniques help us to unravel
post-transcriptional regulatory networks. In order to make the visualization of
CLIP data easier, we develop `r Biocpkg("cliProfiler")` package. The
`r Biocpkg("cliProfiler")` includes seven functions which allow users easily
make different profile plots.
The `r Biocpkg("cliProfiler")` package is available at
[https://bioconductor.org](https://bioconductor.org) and can be
installed via `BiocManager::install`:
```{r BiocManager, eval=FALSE}
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("cliProfiler")
```
A package only needs to be installed once. Load the package into an
R session with
```{r initialize, results="hide", warning=FALSE, message=FALSE}
library(cliProfiler)
```
# The Requirement of Data and Annotation file
The input data for using all the functions in `r Biocpkg("cliProfiler")` should
be the peak calling result or other similar object that represents the RBP
binding sites or RNA modification position. Moreover, these `peaks/signals` be
stored in the **GRanges** object. The **GRanges** is an S4 class which defined
by `r Biocpkg("GenomicRanges")`. The GRanges class is a container for the
genomic locations and their associated annotations. For more information about
GRanges objects please check `r Biocpkg("GenomicRanges")` package. An example
of GRanges object is shown below:
```{r}
testpath <- system.file("extdata", package = "cliProfiler")
## loading the test GRanges object
test <- readRDS(file.path(testpath, "test.rds"))
## Show an example of GRanges object
test
```
The annotation file that required by functions `exonProfile`,
`geneTypeProfile`, `intronProfile`, `spliceSiteProfile` and `metaGeneProfile`
should be in the `gff3` format and download from
[https://www.gencodegenes.org/](https://www.gencodegenes.org/). In the
`r Biocpkg("cliProfiler")` package, we include a test `gff3` file.
```{r}
## the path for the test gff3 file
test_gff3 <- file.path(testpath, "annotation_test.gff3")
## the gff3 file can be loaded by import.gff3 function in rtracklayer package
shown_gff3 <- rtracklayer::import.gff3(test_gff3)
## show the test gff3 file
shown_gff3
```
The function `windowProfile` allows users to find out the enrichment of peaks
against the customized annotation file. This customized annotation file should
be stored in the **GRanges** object.
# metaGeneProfile
`metaGeneProfile()` outputs a meta profile, which shows the location of binding
sites or modification sites `( peaks/signals)` along transcript regions
(5’UTR, CDS and 3’UTR). The input of this function should be a `GRanges`
object.
Besides the `GRanges` object, a path to the `gff3` annotation file which
download from [Gencode](https://www.gencodegenes.org/) is required by
`metaGeneProfile`.
The output of `metaGeneProfile` is a `List` objects. The `List` one contains
the GRanges objects with the calculation result which can be used in different
ways later.
```{r}
meta <- metaGeneProfile(object = test, annotation = test_gff3)
meta[[1]]
```
Here is an explanation of the metaData columns of the output GRanges objects:
* __center__ The center position of each peaks. This center position is used
for calculating the position of peaks within the assigned genomic regions.
* __location__ The genomic region to which this `peak/signal` belongs to.
* __Gene ID__ The gene to which this `peak/signal` belongs.
* __Position__ The relative position of each `peak/signal` within the genomic
region. This value close to 0 means this peak located close to the 5' end of
the genomic feature. The position value close to 1 means the peak close to the
3' end of the genomic feature. Value 5 means this peaks can not be mapped to
any annotation.
The `List` two is the meta plot which in the `ggplot` class. The user can use
all the functions from `ggplot2` to change the detail of this plot.
```{r}
library(ggplot2)
## For example if user want to have a new name for the plot
meta[[2]] + ggtitle("Meta Profile 2")
```
For the advance usage, the `metaGeneProfile` provides two methods to calculate
the relative position. The first method return a relative position of the
`peaks/signals` in the genomic feature without the introns. The second method
return a relative position value of the peak in the genomic feature with the
introns. With the parameter `include_intron` we can easily shift between these
two methods. If the data is a polyA plus data, we will recommend you to set
`include_intron = FALSE`.
```{r}
meta <- metaGeneProfile(object = test, annotation = test_gff3,
include_intron = TRUE)
meta[[2]]
```
The `group` option allows user to make a meta plot with multiple conditions.
Here is an example:
```{r}
test$Treat <- c(rep("Treatment 1",50), rep("Treatment 2", 50))
meta <- metaGeneProfile(object = test, annotation = test_gff3,
group = "Treat")
meta[[2]]
```
Besides, we provide an annotation filtering option for making the meta plot.
The `exlevel` and `extranscript_support_level` could be used for specifying
which _level_ or _transcript support level_ should be excluded. For excluding
the _transcript support level_ NA, user can use _6_ instead of NA. About more
information of _level_ and _transcript support level_ you can check the
[Gencode data format](https://www.gencodegenes.org/pages/data_format.html).
```{r eval=FALSE}
metaGeneProfile(object = test, annotation = test_gff3, exlevel = 3,
extranscript_support_level = c(4,5,6))
```
The `split` option could help to make the meta profile for the `peaks/signals`
in 5'UTR, CDS and 3'UTR separately. The grey dotted line represents the peaks's
distribution across all region.
```{r}
meta <- metaGeneProfile(object = test, annotation = test_gff3, split = TRUE)
meta[[2]]
```
# intronProfile
The function `intronProfile` generates the profile of `peaks/signals` in the
intronic region. Here is an example for a quick use of `intronProfile`.
```{r}
intron <- intronProfile(test, test_gff3)
```
Similar to metaGeneProfile, the output of `intronProfile` is a `List` object
which contains two `Lists`. `List` one is the input GRanges objects with the
calculation result.
```{r}
intron[[1]]
```
The explanation of meta data in the output of `intronProfile` list one is
pasted down below:
* __center__ The center position of each peaks. This center position is used
for calculating the position of peaks within the genomic regions.
* **Intron_S and Intron_E** The start and end coordinates of the intron
to which the peak is assigned.
* __Intron_length__ The length of the intron to which the peak is assigned.
* __Intron_transcript_id__ The transcript ID for the intron.
* **Intron_map** The relative position of each peak within the assigned intron.
This value close to 0 means this peak located close to the 5' SS. The position
value close to one means the peak close to the 3' SS. Value 3 means this peaks
can not map to any intron.
The `List` two includes a _ggplot_ object.
```{r}
intron[[2]]
```
Similar to metaGeneProfile, in intronProfile, we provide options , such as
`group`, `exlevel` and `extranscript_support_level`. The `group` function could
be used to generate the comparison plot.
```{r}
intron <- intronProfile(test, test_gff3, group = "Treat")
intron[[2]]
```
The parameter `exlevel` and `extranscript_support_level` could be used for
specifying which _level_ or _transcript support level_ should be excluded.
For excluding the _transcript support level_ NA, you can use _6_. About more
information of _level_ and _transcript support level_ you can check the
[Gencode data format](https://www.gencodegenes.org/pages/data_format.html).
```{r}
intronProfile(test, test_gff3, group = "Treat", exlevel = 3,
extranscript_support_level = c(4,5,6))
```
Moreover, in the intronProfile we provide parameters `maxLength` and
`minLength` to filter the maximum and minimum length of the intron.
```{r}
intronProfile(test, test_gff3, group = "Treat", maxLength = 10000,
minLength = 50)
```
# exonProfile
The `exonProfile` could help to generate a profile of `peaks/signals` in the
exons which **surrounded by introns**. The output of exonProfile is a `List`
object. The `List` one is the `GRanges` objects of input data with the
calculation result.
```{r}
## Quick use
exon <- exonProfile(test, test_gff3)
exon[[1]]
```
Here is the explanation of the meta data column that output from
`exonProfile`:
* __center__ The center position of each peaks. This center position is used
for calculating the position of peaks within the genomic regions.
* **exon_S and exon_E** The start and end coordinates of the exon to which the
peak is assigned.
* __exon_length__ The length of the exon to which the peak is assigned.
* __exon_transcript_id__ The transcript ID for the assigned exon.
* **exon_map** The relative position of each peak within the assigned exon.
This value close to 0 means this peak located close to the 3' SS. The position
value close to one means the peak close to the 5' SS. Value 3 means this peaks
can not be assigned to any annotation.
The `List` two is a _ggplot_ object which contains the exon profile.
```{r}
exon[[2]]
```
The usage of all parameters of `exonProfile` is similar to `intronProfile`. For
more detail of parameter usage please check the `intronProfile` section.
# windowProfile
Since the `metaGeneProfile`, `intronProfile` and `exonProfile` require a
annotation file in `gff3` format and downloaded from
[https://www.gencodegenes.org/](https://www.gencodegenes.org/). These functions
could only be used for _human_ and _mouse_. For the user who works on other
species or has a customized annotation file (not in gff3 format), we develop
the function `windowProfile`.
`windowProfile` requires GRanges object as input and annotation. And
`windowProfile` output the relative position of each peak within the given
annotation GRanges. For example, if user would like to make a profile against
all the exons with `windowProfile`, you just need to first extract all the
exonic region as a GRanges object then run the `windowProfile`. Here is an
example about using `windowProfile` to generate the profile.
```{r, results='hide', warning=FALSE, message=FALSE}
library(rtracklayer)
## Extract all the exon annotation
test_anno <- rtracklayer::import.gff3(test_gff3)
test_anno <- test_anno[test_anno$type == "exon"]
## Run the windowProfile
window_profile <- windowProfile(test, test_anno)
```
The output of `windowProfile` is a `List` objects. In the `List` one, you will
find the GRanges object of input peaks and calculation result. And the `List`
two contains the _ggplot_ of `windowProfile`.
```{r}
window_profile[[1]]
```
Here is an explanation of the output of `windowProfile`:
* __center__ The center position of each peaks. This center position is used
for calculating the position of peaks within the genomic regions.
* **window_S and window_E** The boundary of the annotation to which the peak is
assigned.
* **window_length** The length of the annotation feature.
* **window_map** The relative position of each peak. This value close to 0
means this peak located close to the 5' end of the annotation. The position
value close to one means the peak close to the 3' end. Value 3 means this
peaks can not map to any annotation.
```{r}
window_profile[[2]]
```
# motifProfile
`motifProfile` generates the motif enrichment profile around the center of the
input peaks. The [IUPAC code](https://www.bioinformatics.org/sms/iupac.html)
could be used for the `motif` parameter. The `IUPAC` code includes: A, T, C, G,
R, Y, S, W, K, M, B, D, H, V, N. The parameter `flanking` represents to the
size of window that user would like to check around the center of peaks. The
parameter `fraction` could be used to change the y-axis from _frequency_ to
_fraction_.
For using the `motifProtile`, the `BSgenome` with the sequence information of
the species that you are working with is required.
```{R}
## Example for running the motifProfile
## The working species is mouse with mm10 annotation.
## Therefore the package 'BSgenome.Mmusculus.UCSC.mm10' need to be installed in
## advance.
motif <- motifProfile(test, motif = "DRACH",
genome = "BSgenome.Mmusculus.UCSC.mm10",
fraction = TRUE, title = "Motif Profile",
flanking = 10)
```
The output of `motifProfile` is a `List` object. `List` 1 contains the
`data.frame` of the motif enrichment information for each position around the
center of the input `peaks/signals`. `List` 2 is the _ggplot_ object of the
plot of motif enrichment.
```{r}
motif[[1]]
```
Each bar in the plot of `motifProfile` represents for the start site of the
motif that is defined by the user.
```{r}
motif[[2]]
```
# geneTypeProfile
The `geneTypeProfile` could give users the information of the gene type of the
input peaks. The input peaks for `geneTypeProfile` should be stored in the
GRanges objects. The annotation file should be a `gff3` file and downloaded
from [https://www.gencodegenes.org/](https://www.gencodegenes.org/).
```{r}
## Quick use of geneTypeProfile
geneTP <- geneTypeProfile(test, test_gff3)
```
The output of `geneTypeProfile` is an `List` object. `List` one includes a
GRanges object with the input peaks and the assignment information.
```{r}
geneTP[[1]]
```
Here is an explanation of the output GRanges object from the
`geneTypeProfile`.
* __center__ The center position of each peaks. This center position is used
for calculating the position of peaks within the genomic regions.
* **geneType** The gene type of the gene to which the peak is assigned.
* **Gene_ID** The gene ID to which the peak is assigned.
```{r}
geneTP[[2]]
```
# spliceSiteProfile
The `spliceSiteProfile` gives users the information of the enrichment of peaks
around the 5' and 3' splice site (SS) in the absolute distance.
```{r}
SSprofile <- spliceSiteProfile(test, test_gff3, flanking=200, bin=40)
```
The output of `spliceSiteProfile` is a `List` object. The `List` one includes
the GRanges object of input peaks and the position information of this peak
around the SS.
```{r}
SSprofile[[1]]
```
Here is an explanation of output of `spliceSiteProfile`.
* __center__ The center position of each peaks. This center position is used
for calculating the position of peaks to the splice site.
* **Position5SS** The absolute distance between peak and 5'SS. Negative value
means the peak locates in the exonic region. Positive value means the peaks
located in the intron.
* **Position3SS** The absolute distance between peak and 3'SS. Negative value
means the peak locates in the intronic region. Positive value means the peaks
located in the exon.
```{r}
SSprofile[[2]]
```
Similar to `metaProfile`, The parameter `exlevel` and
`extranscript_support_level` could be used for
specifying which _level_ or _transcript support level_ should be excluded.
For excluding the _transcript support level_ NA, you can use _6_. About more
information of _level_ and _transcript support level_ you can check the
[Gencode data format](https://www.gencodegenes.org/pages/data_format.html).
```{r eval=FALSE}
spliceSiteProfile(test, test_gff3, flanking=200, bin=40, exlevel=3,
extranscript_support_level = 6,
title = "Splice Site Profile")
```
# SessionInfo
The following is the session info that generated this vignette:
```{r}
sessionInfo()
```