--- title: "Analyzing R-loop data with RLSeq" author: - name: "Henry Miller" affiliation: - Alex Bishop Laboratory, UT Health San Antonio - Bioinformatics Research Network package: RLSeq abstract: | This vignette covers basic usage of RLSeq for evaluating data quality and analyzing R-loop locations. RLSeq is part of [RLSuite](https://gccri.bishop-lab.uthscsa.edu/rlsuite/){target="_blank"}, an R-loop analysis toolchain. RLSuite also includes [RLHub](https://github.com/Bishop-Laboratory/RLHub){target="_blank"}, [RLBase](https://gccri.bishop-lab.uthscsa.edu/rlbase/){target="_blank"}, and [RLPipes](https://github.com/Bishop-Laboratory/RLPipes){target="_blank"}. output: BiocStyle::html_document: toc_float: true toc_depth: 3 BiocStyle::pdf_document: default vignette: > %\VignetteIndexEntry{Analyzing R-loop Data with RLSeq} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib --- ```{r, include = FALSE} knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, dev = "png", message = FALSE, error = FALSE, warning = FALSE) BiocStyle::markdown() ``` # Introduction

**RLSeq** is a package for analyzing R-loop mapping data sets, and it is a core component of the *RLSuite* toolchain. It serves two primary purposes: (1) to facilitate the evaluation of data quality, and (2) to enable R-loop data analysis in the context of genomic annotations and the public data sets in [RLBase](https://gccri.bishop-lab.uthscsa.edu/rlbase/){target="_blank"}. The main analysis steps can be conveniently run using the `RLSeq()` function. Then, an HTML report can be generated using the `report()` function. Individual steps of this pipeline are also accessible through separate functions which provide custom analysis capabilities. This vignette will showcase the primary functionality of RLSeq with data from a publicly-available R-loop data mapping study in Ewing sarcoma cell lines, [GSE68845](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE68845){target="_blank"}. We have selected two DNA-RNA Immunoprecipitation sequencing (DRIP-seq) samples for demonstration purposes: (1) [SRX1025890](https://www.ncbi.nlm.nih.gov/sra/SRX1025890){target="_blank"}, a positive R-loop mapping sample ("POS"; condition: S9.6 -RNaseH1), and (2) [SRX1025892](https://www.ncbi.nlm.nih.gov/sra/SRX1025892){target="_blank"}, a negative control ("NEG"; condition S9.6 +RNaseH1). We will begin by showing a quick-start analysis on *SRX1025890*, and then we will proceed to discuss, in detail, the specific steps of this analysis with both samples. # Quick-start Here, we demonstrate a simple analysis workflow which utilizes a publicly-available data set stored in [*RLBase*](https://gccri.bishop-lab.uthscsa.edu/rlbase/){target="_blank"} (a database of R-loop consensus regions and R-loop-mapping experiments, also part of [*RLSuite*](https://gccri.bishop-lab.uthscsa.edu/rlsuite/){target="_blank"}). The commands below download these data, run `RLSeq()`, and generate the HTML report. ``` r # Peaks and coverage can be found in RLBase rlbase <- "https://rlbase-data.s3.amazonaws.com" pks <- file.path(rlbase, "peaks", "SRX1025890_hg38.broadPeak") cvg <- file.path(rlbase, "coverage", "SRX1025890_hg38.bw") # Initialize data in the RLRanges object. # Metadata is optional, but improves the interpretability of results rlr <- RLRanges( peaks = pks, coverage = cvg, genome = "hg38", mode = "DRIP", label = "POS", sampleName = "TC32 DRIP-Seq" ) # The RLSeq command performs all analyses rlr <- RLSeq(rlr) # Generate an html report report(rlr, reportPath = "rlseq_report_example.html") ``` The report generated by this code is found [here](https://rlbase-data.s3.amazonaws.com/misc/rlseq_report_example.html){target="_blank"}. # Preliminary ## Installation RLSeq should be installed alongside [RLHub](https://github.com/Bishop-Laboratory/RLHub){target="_blank"} to facilitate access to the data required for annotation and analysis. When downloading RLSeq from bioconductor, RLHub is already included. ``` r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("RLSeq") ``` Both packages can also be installed from github. ``` r library(remotes) install_github("Bishop-Laboratory/RLHub") install_github("Bishop-Laboratory/RLSeq") ``` ## Obtaining data RLSeq is compatible with R-loop data generated from a variety of pipelines and tools. *However*, it is strongly recommended that you use [RLPipes](https://github.com/Bishop-Laboratory/RLPipes){target="_blank"}, a snakemake-based CLI pipeline tool built specifically for upstream processing of R-loop datasets. RLPipes can be installed using [mamba](https://anaconda.org/conda-forge/mamba){target="_blank"} or [conda](https://docs.conda.io/en/latest/miniconda.html){target="_blank"} (slower). ``` shell # conda install -c conda-forge mamba mamba create -n rlpipes -c bioconda -c conda-forge rlpipes conda activate rlpipes ``` A typical config file (`CSV`) should be written as such: | experiment | |:-----------| | SRX1025890 | | SRX1025892 | And then the pipeline can be run. ``` shell RLPipes build -m DRIP rseq_out/ tests/test_data/samples.csv RLPipes run rseq_out/ ``` The resulting directory will contain `peaks/`, `coverage/`, `bam/`, and other processed data sets which are used in downstream analysis. *Note*: If you choose to use a different pipeline, use [macs2/macs3](https://github.com/macs3-project/MACS){target="_blank"} for peak calling to ensure compatibility with RLBase. # End-to-end RLSeq Here, we describe each step of the analysis pipeline which is run as part of the `RLSeq()` command. ```{r library, message=FALSE, warning=FALSE} library(RLSeq) library(dplyr) set.seed(1) ``` ## Data sets For this example, we will be using DRIP-Seq data from a 2018 *Nature* paper on R-loops in Ewing sarcoma [@Gorthi2018a]. The sample has been IP'd for R-loops (S9.6 -RNaseH1; label: **"POS"**). The data was processed using RLPipes and uploaded to RLBase. Peaks are converted to `GRanges` objects using a helper function from `regioneR`. URLs and file paths for peak files can also be supplied directly to without this step. ```{r pull_public_data} rlbase <- "https://rlbase-data.s3.amazonaws.com" # Get peaks and coverage s96Pks <- regioneR::toGRanges(file.path(rlbase, "peaks", "SRX1025890_hg38.broadPeak")) s96Cvg <- file.path(rlbase, "coverage", "SRX1025890_hg38.bw") ``` For demonstration purposes, only 10000 ranges are analyzed here. ```{r downsample_data} # For expediency, peaks we filter and down-sampled to the top 10000 by padj (V9) # This is not necessary as part of the typical workflow, however s96Pks <- s96Pks[s96Pks$V9 > 2,] s96Pks <- s96Pks[sample(names(s96Pks), 10000)] ``` Finally, `RLRanges` objects were constructed. These are the primary objects used in all `RLSeq` functions. `RLRanges` are an extension of `GRanges` which provide additional metadata and validation functions. ```{r build_rlranges} ## Build RLRanges ## # S9.6 -RNaseH1 rlr <- RLRanges( peaks = s96Pks, coverage = s96Cvg, genome = "hg38", mode = "DRIP", label = "POS", sampleName = "TC32 DRIP-Seq" ) ``` ## Sample quality Sample quality is assessed by analyzing the association of peaks with R-loop-forming sequences (RLFS). RLFS are genomic sequences that favor the formation of R-loops [@Jenjaroenpun2015]. While R-loops can form outside RLFS, there is a strong relationship between them, which provides an unbiased test of whether a set of peaks actually represents successful R-loop mapping. ### Permutation tests RLSeq first implements a permutation test to evaluate the enrichment of peaks within RLFS and build a Z-score distribution around RLFS sites. ```{r analyzeRLFS} # Analyze RLFS for positive sample rlr <- analyzeRLFS(rlr, quiet = TRUE) ``` The resulting objects now contain the permutation test results. These results can be easily visualized with the `plotRLFSRes` function. ```{r plot-perm, fig.cap="Plot of permutation test results (S9.6 -RNaseH1)."} plt <- plotRLFSRes(rlr) # try() is used to prevent errors on latest MacOS during CI/CD testing. # You don't need try() when running this normally. try(plt, silent = TRUE) ``` ### Quality classification The quality classifier is an ensemble model based on an online-learning scheme. It predicts "POS" for samples which are predicted to show robust R-loop mapping and "NEG" for samples which are not. The latest version can be accessed via [RLHub](https://github.com/Bishop-Laboratory/RLHub){target="_blank"}. For greater detail, please see the [`RLHub::modes` reference](https://bishop-laboratory.github.io/RLHub/reference/models.html#details){target="_blank"}. To apply the model and predict sample quality, use the `predictCondition()` function. ```{r predictCondition} # Predict rlr <- predictCondition(rlr) ``` The results from testing our example samples: ```{r rlresultPrediction} # Access results s96_pred <- rlresult(rlr, "predictRes") cat("Prediction: ", s96_pred$prediction) ``` ## Noise analysis **Note**: The code in this section does not work on Windows OS machines. The next step of the RLSeq quality workflow is to analyze the noisiness of the coverage signal for the user-supplied sample (requires that `coverage` is provided when creating the `RLRanges` object). The approach used in this analysis step was derived from the work of *Diaz et al., 2012* [@Diaz2012]. The method is run using the following function: ```{r noiseAnalyze} if (.Platform$OS.type != "windows") { rlr <- noiseAnalyze(rlr) } ``` The results can then be visualized in two ways. A **Fingerprint plot** and a **noiseComparisonPlot**. ### Fingerprint plot To visualize the results of `noiseAnalyze` we can use a "fingerprint plot" (named after the [deepTools implementation](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) by the same name [@Ramírez2016]). ```{r fingerprint} if (.Platform$OS.type != "windows") { plotFingerprint(rlr) } ``` This plot shows the proportion of signal contained in the corresponding proportion of coverage bins. In the plot above, we can observe that relatively few bins contain nearly all the signal. This is exactly what we would expect to see when our sample has good signal-to-noise ratio, a sign of good quality in R-loop mapping datasets. ### Noise comparison plot While a fingerprint plot is useful for getting a quick view of the dataset, it is also useful to compare the analyzed sample to publicly-available the datasets provided by RLBase. The `noiseComparisonPlot` enables this comparison. ```{r noiseComparisonPlot} if (.Platform$OS.type != "windows") { noiseComparisonPlot(rlr) } ``` This plot displays the average standardized signal across bins (noise index) from all publicly-available datasets from the same modality ("DRIP-Seq") and genome ("hg38") as the user-supplied sample. The data are also divided by prediction/label combination. Finally, the user-supplied sample is indicated as a diamond (\u25C7). This plot enables one to get a high-level overview of the noise indices of similar samples along with the rest of the relevant data in RLBase. ## Feature enrichment The feature enrichment test assesses the enrichment of genomic features within a supplied R-loop dataset. The function queries the RLHub annotation database to retrieve genomic features, and then it performs fisher's exact test and the relative distance test to assess feature enrichment [@Favorov2012]. For a description of all features, see [RLHub::annotations](https://bishop-laboratory.github.io/RLHub/reference/annotations.html){target="_blank"}. ```{r feature_enrich} # Perform test rlr <- featureEnrich( object = rlr, quiet = TRUE ) ``` The results: ```{r} # View Top Results annoResS96 <- rlresult(rlr, "featureEnrichment") annoResS96 %>% relocate(contains("fisher"), .after = type) %>% arrange(desc(stat_fisher_rl)) ``` From the results, we see that there is high enrichment within genic features, such as exons and introns. ### Visualization of enrichment results RLSeq provides a helper function, `plotEnrichment`, to facilitate the visualization of enrichment results. ```{r, warning=FALSE, figures-side, fig.show="hold", out.width="50%"} pltlst <- plotEnrichment(rlr) ``` This returns a list of plots named according to the corresponding annotation database. For example, Encode cis-regulatory elements (CREs): ```{r warning=FALSE} pltlst$Encode_CREs ```

**Note**: Caveat on data range

A limitation of this approach is that Fisher's exact test sometimes returns `Inf` or `-Inf` for the statistic (odds ratio). While these results are useful in demonstrating robust enrichment or non-enrichment, they are difficult to plot in a meaningful way. As a compromise, `plotEnrichment` sets a limited data range of -10 through 15. These values were chosen because they encompass every finite value that can be returned from the implementation of Fisher's test which RLSeq uses. In the above plots `Inf` results are shown on the y-axis at value `15` and, likewise, `-Inf` is shown at `-10`.

## Transcript Feature Overlap Analysis Transcript feature overlap analysis is a method for determining the proportion of peaks overlapping various transcript features (e.g., "Exon"). ```{r txfeatoverlap} rlr <- txFeatureOverlap( object = rlr, quiet = TRUE ) ``` The results (downsampled): ```{r} # View a random sample of results txResS96 <- rlresult(rlr, "txFeatureOverlap") txResS96 %>% slice_sample(n = 10) %>% mutate(name = gsub(name, pattern = ".+__", replacement = "")) ``` From the results, we can see the peaks from our peakset and the feature which they overlap with. **Of note**: this type of analysis uses a priority order such that when a peak overlaps two features, the one with the top priority is selected. The order is "TSS", "TTS", "5'UTR", "3'UTR", "Exon", "Intron", "Intergenic". ### Visualization of transcript overlap results RLSeq provides `plotTxFeatureOverlap` for plotting the results from the transcript feature overlap results. A powerful feature of RLSeq is the capability to visualize user-supplied data in the context of hundreds of publicly-available datasets. This capability is utilized in the visualization of transcript feature overlap results by plotting the results alongside the average for all high quality peaksets from the same R-loop mapping modality. **Of note**: "high quality" peaksets are those which have at least 5000 peaks, a positive label, and a positive prediction. ```{r} plotTxFeatureOverlap(rlr) ``` From the above result, we observe that R-loops in TC32 cells tend to localize more often in gene bodies compared to the average for high-quality DRIP-Seq datasets. ## Correlation analysis Correlation analysis finds inter-sample correlation coefficients of bin-level R-loop signal around gold-standard R-loop sites (sites profiled using ultra-long-read R-loop mapping -- "SMRF-Seq") [@Chédin2021]. This analysis helps to answer the question "how well does my data agree with previous results in public datasets?" For greater detail, please refer to the documentation for `corrAnalyze()`. **Note**: `corrAnalyze()` does not work on Windows OS systems. ```{r corrAnalyze} # corrAnalyze does not work on Windows OS if (.Platform$OS.type != "windows") { rlr <- corrAnalyze(rlr) } ``` The results of this analysis are visualized using `corrHeatmap`. ```{r} # corrAnalyze does not work on Windows OS if (.Platform$OS.type != "windows") { corrHeatmap(rlr) } ``` These results demonstrate that our sample correlates well with similar DRIP-Seq data sets. ## Gene Annotation Gene annotations are automatically downloaded using `AnnotationHub()` and then intersected with RLRanges. ```{r geneAnnotation} rlr <- geneAnnotation(rlr) ``` ## R-Loop Region Test R-loop regions are consensus R-loop-forming sites discovered from analyzing all high-confidence R-loop mapping samples in RLBase. The `rlRegionTest()` analyzes the enrichment of the ranges in our `RLRanges` object with these consensus R-loop sites, which, like correlation analysis, also helps answer the question "how well does my data agree with previous results?" For greater detail, please refer to the documentation for `rlRegionTest()`. ```{r rlRegionTest} rlr <- rlRegionTest(rlr) ``` The test results can be easily visualized in the following manner. ```{r plotRLRegionOverlap} plotRLRegionOverlap( object = rlr, # Arguments for VennDiagram::venn.diagram() fill = c("#9ad9ab", "#9aa0d9"), main.cex = 2, cat.pos = c(-40, 40), cat.dist=.05, margin = .05 ) ``` # Accessing RLBase data For convenience, we also provide pre-analyzed `RLRanges` objects for every sample in RLBase. To access them, you need only provide the ID of the sample which you want to obtain data from. These IDs, along with other metadata, are listed in `RLHub::rlbase_samples()`. ```{r RLRangesFromRLBase} rlr <- RLRangesFromRLBase(acc = "SRX1025890") rlr ``` # Session

Session info

```{r} sessionInfo() ```

# References