--- title: "Cogito: Compare annotated genomic intervals tool" author: "Annika Buerger" package: Cogito date: "`r Sys.Date()`" output: BiocStyle::html_document vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{Cogito: Compare annotated genomic intervals tool} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction Biological studies often consist of multiple conditions which are examined with different laboratory set ups like RNA-sequencing or ChIP-sequencing. To get an overview about the whole resulting data set, **Cogito** provides an automated and complete report about all samples and basic comparisons between all different samples. This overview can be used as documentation about the data set or as starting point for further custom analysis. Cogito is not meant to do detailed genetic or epigenetic analysis. But provides a reproducible and clear report about data sets consisting of samples of different conditions and mesuared with different laboratory base technologies. # Installation The package can be installed via `Bioconductor` and loaded into R by typing ```{r installation of package Cogito, eval=FALSE} BiocManager::install("Cogito") library("Cogito") ``` ```{r load package, echo = FALSE, message = FALSE} library("Cogito", quietly = TRUE, verbose = FALSE) ``` into the R console. # Workflow The workflow of Cogito consists of two main functions: 1. `aggregateRanges(ranges, configfile, organism,` `referenceRanges, name, verbose)` 2. `summarizeRanges(aggregated.ranges, outputFormat, verbose)` The process is demonstrated using a small murine example. The data set from King et al.^[\@ref(References)] was downloaded from the NCBI GEO database under accession number [GSE77004](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77004) and preprocessed^[for details on preprocessing of the data have a look at the manual pages of the single data sets] to `GRanges`/`GRangesList` objects and restricted to chromosome 5 to save space and processing time. This example data set is loaded with the package and consists of three data objects. The first `GRanges` object contains gene expression values from RNA-sequencing with `r length(MurEpi.RNA.small)` ranges and `r ncol(mcols(MurEpi.RNA.small))` columns with the corresponding expression values. Each column represents an experimental condition. ```{r murine example data set RNA} head(MurEpi.RNA.small[, 1:3]) ``` The second `GRanges` object contains information about the methylation status from RRBS. The object has `r length(MurEpi.RRBS.small)` ranges and `r ncol(mcols(MurEpi.RRBS.small))` columns with the corresponding methylation status. Where there is not information about the methylation status the value is `NA`. Each column represents an experimental condition. ```{r murine example data set RRBS} head(MurEpi.RRBS.small[, 1:3]) ``` The third object is a `GRangesList` object containing `r length(MurEpi.ChIP.small)` `GRanges` object. Each of this objects represents an experimental condition and contains between `r min(unlist(lapply(MurEpi.ChIP.small, length)))` and `r max(unlist(lapply(MurEpi.ChIP.small, length)))` ranges with one column of attached value each. This value is the peak score resulting from the peak calling with *homer findPeaks*^[\@ref(References)]. ```{r murine example data set ChIP} head(MurEpi.ChIP.small[[1]]) ``` The data set consists of the following samples: | | RNA-seq | RRBS | ChIP-seq | |:------ |:-------:|:----:|:--------:| |**J1** | 1 | 2 | 5 | |**TKO** | 1 | 2 | 5 | |**TKO3a1** | 2 | 2 | 4 | |**TKO3a2** | 2 | 2 | 5 | |**TKO3b1** | 1 | 2 | 4 | |**DKO** | 1 | 1 | 4 | |**DKO3a1** | 1 | 1 | 3 | |**DKO3a2** | 1 | 1 | 3 | |**DKO3b1** | 1 | 1 | 3 | : (\#tab:table) Overview of murine example data set. There are several samples of different conditions (*J1*, *TKO*, ...) and base technologies (RNA-seq, RRBS and ChIP-seq). The function `aggregateRanges` is called with the example data set and the corresponding genomic information. ```{r workflow aggregateRanges with small murine data set} mm9 <- TxDb.Mmusculus.UCSC.mm9.knownGene::TxDb.Mmusculus.UCSC.mm9.knownGene example.dataset <- list(ChIP = MurEpi.ChIP.small, RNA = MurEpi.RNA.small, RRBS = MurEpi.RRBS.small) aggregated.ranges <- aggregateRanges(ranges = example.dataset, organism = mm9, name = "murine.example") ``` The function aggregates all provided data to the genes of the genome, given in the parameter `organism`: all single `IRanges` are assigned to the closest gene within a predefined default maximal distance of 100.000bp. This parameter can be changed in the configuration file if necessary. Consequently, the columns of attached values of each provided `IRanges` object are assigned to the corresponding gene. The result is one single `GRanges` object, each row representing a gene and each column a sample of the provided input data. The function `aggregateRanges` returns a `list` object containing this single `GRanges` objects, i.e. a `GRanges` object with rows representing genes with columns of attached values `mcols` where each column contains values from a specific experimental condition (such as *wildtype*) and a specific underlying base technology (such as RNA-seq expression). ```{r workflow result genes of aggregateRanges with small murine data set} head(aggregated.ranges$genes[, c(1, 2:3, 13:14, 27:28)]) ``` The return value of the function `aggregateRanges` as well contains information about the supposed underlying base technologies and conditions. ```{r workflow result 1 genes of aggregateRanges with small murine data set} lapply(aggregated.ranges$config$technologies, head, 3) head(lapply(aggregated.ranges$config$conditions, head, 3), 3) ``` To advance in the workflow, these two `list`s should be carefully checked and corrected if necessary. Also it is possible to add user defined groups of conditions to include them in the following analysis. ```{r workflow summarizeRanges with small murine data set, eval=FALSE} summarizeRanges(aggregated.ranges = aggregated.ranges) ``` The function `summarizeRanges` does not has a return value but produces three output files: The main result is the report about the given data in a pdf (or html). The report is divided into four chapters. The introduction holds general information about the underlying data set as the numbers of conditions (*wildtype* etc.) and used base technologies (ChIP-seq, RNA-seq, etc.). The first chapter contains summaries about each single column of attached values, consisting of a location and a dispersion parameter as well as a plot for a visual impression. There are `r ncol(mcols(aggregated.ranges$genes))-1` samples in total included in the presented murine example data set. The second chapter summarizes groups of columns of attached values which have the same condition and the same base technology, for example if there are two RNA-sequencing replicates of the condition *wildtpye* present. In the given murine example data set there are among others 2 samples of methylation status examined with RRBS in the condition *J1*, 5 samples of condition *J1* of ChIP-seq experiments and 2 RNA-seq samples of condition *TKO3a1*, see table \@ref(tab:table). Consequentely, this results in 16 plots in total. The third chapter summarizes groups of columns of attached values which are examined by the same base technology but do not necessarily have the same underlying condition. This is visualized in an appropriate plot. The presented murine example has three involved base technologies: RNA-seq, ChIP-seq and RRBS. Finally the fourth chapter compares every column of attached value to every other column of attached value. The comparison is visualized with an appropriate plot and a correlation test is made. Because this section may be long, the report concentrates on comparisons where a significant correlation is found (corrected p-value < 0.01). In the exemplary murine data set there are `r ncol(mcols(aggregated.ranges$genes))-1` columns of attached values, so this results in `(61*60)/2=1830` comparisons, but of not all of them show a significant correlation. Besides the pdf report the function `summarizeRanges` also produces a rmd file with which the user may customize the report or take it as a starting point for further analysis, as well as a RData file which holds all used processed data. # References {#References} King AD, Huang K, Rubbi L, Liu S, Wang CY, Wang Y, Pellegrini M, Fan G. **Reversible Regulation of Promoter and Enhancer Histone Landscape by DNA Methylation in Mouse Embryonic Stem Cells.** Cell Rep. 2016 Sep 27;17(1):289-302. doi: 10.1016/j.celrep.2016.08.083 Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. **Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities.** Mol Cell. 2010 May 28;38(4):576-89. doi: 10.1016/j.molcel.2010.05.004 The color scheme used by Cogito was generated with help of: [iWantHue](https://medialab.github.io/iwanthue/) by Mathieu Jacomy at the at the [Sciences-Po Medialab](http://medialab.sciences-po.fr/) # Session Information ```{r session info, echo = FALSE} sessionInfo() ```