--- title: "Working with RangedSummarizedExperiments Big DNA Methylation Data" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{working_with_meth_rses} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 12, fig.height = 6.75, fig.show='hide' ) ``` ```{r setup, message = F} library(methodical) library(TumourMethData) library(BSgenome.Hsapiens.UCSC.hg19) ``` ## Introduction Most functions from `methodical` take as input RangedSummarizedExperiment objects with methylation data. If there are many samples, there can be many millions or, in the case of WGBS data, even billions of data points within a DNA methylation dataset. It can be unfeasible to load all this data into memory at once. This problem can be overcome by using DelayedArrays backed by HDF5 files, enabling data to be read into memory only as needed. Methodical provides a suite of functions for working with such DNA methylation RangedSummarizedExperiments, including functions to extract methylation values for sites overlapping genomic regions of interest `GRanges`, to liftover the methylation sites from one genome build to another and to mask methylation sites overlapping certain genomic regions, e.g. repeats. Here we demonstrate this different functionality using a dataset downloaded from TumourMethData. ```{r, eval=TRUE} # Download RangedSummarizedExperiment with methylation data for prostate metastases from TumourMethData mcrpc_wgbs_hg38_chr11 = TumourMethData::download_meth_dataset(dataset = "mcrpc_wgbs_hg38_chr11") ``` ## Extracting methylation data from methylation RangedSummarizedExperiments We'll first demonstrate how to extract methylation data for individual CpG sites in mcrpc_wgbs_hg38_chr11 overlapping a supplied GRanges object with `extractGRangesMethSiteValues`, using the gene body as an example. ```{r, eval=TRUE} # Create a GRanges with the hg38 genomic coordinates for the GSTP1, including # 2 KB upstream of its designated start in Ensembl gstp1_start_site_region <- GRanges("chr11:67581742-67586656:+") # Extract methylation values for CpG sites overlapping GSTP1 gene body gstp1_cpg_methylation <- extractGRangesMethSiteValues( meth_rse = mcrpc_wgbs_hg38_chr11, genomic_regions = gstp1_start_site_region) # View the first few rows and columns of the result. # extractGRangesMethSiteValues returns a row for each methylation site and a # separate column for each sample where row names give the coordinates of the # methylation sites in character format. gstp1_cpg_methylation[1:6, 1:6] ``` Next we'll show how to summarize the methylation of CpGs over regions of interest with `summarizeRegionMethylation`, using CpG islands as an example. `summarizeRegionMethylation` processes the the supplied genomic regions in chunks so that the methylation data for CpG sites overlapping the regions of interest is not all loaded into memory at once. The parameter `max_sites_per_chunk` controls the approximate number of CpG sites maximally read into memory at once and defaults to `floor(62500000/ncol(meth_rse)`. Several chunks can be processed in parallel using BiocParallel via the `BPPARAM` argument which takes a BiocParallelParam object. The number of workers indicated by BiocParallelParam determines the number of chunks that will be processed in parallel. Some experimentation may be needed to find the optimal choices for `max_sites_per_chunk` and the number of workers in terms of speed and memory usage. ```{r, eval=TRUE} # Load CpG islands annotation for hg38 cpg_island_annotation <- annotatr::build_annotations(genome = "hg38", annotation = "hg38_cpgs") names(cpg_island_annotation) <- cpg_island_annotation$id # Filter for annotation for chr11 cpg_island_annotation = cpg_island_annotation[seqnames(cpg_island_annotation) == "chr11"] # Convert into a GRangesList with separate GRanges for islands, shores, shelves and inter island regions cpg_island_annotation <- GRangesList(split(cpg_island_annotation, cpg_island_annotation$type)) # Create a BPPARAM class BPPARAM = BiocParallel::bpparam() # Summarize methylation levels for CpG islands cpg_island_methylation <- summarizeRegionMethylation( meth_rse = mcrpc_wgbs_hg38_chr11, genomic_regions = cpg_island_annotation$hg38_cpg_islands, BPPARAM = BPPARAM, summary_function = colMeans) # Print a few rows for the first few samples of the result cpg_island_methylation[1000:1006, 1:6] ``` ## Plotting methylation values in genomic regions and annotating genomic regions in plots We can plot methylation values extracted from a genomic region for a single sample using the `plotMethylationValues()` function. We'll demonstrate this using the values we extracted in the region surrounding the 5' end of GSTP1 for the DTB_003 prostate metastasis sample. We indicate the sample we want to plot with the `sample_name` parameter. ```{r, eval=TRUE} # Plot the methylation values along the GSTP1 gene body for one prostate metastasis sample. gstp1_methylation_plot = plotMethylationValues(gstp1_cpg_methylation, sample_name = "DTB_003") print(gstp1_methylation_plot) ``` Additionally, we can also annotate our plots using the `annotatePlot()` function. It uses a GRangeList provided with the `annotation_grl` parameter to create an annotation plot showing the regions in the GRangesList which overlap the genomic region displayed in the main plot. Each of the GRanges objects making up the GRangesList is given a different colour in the annotation plot and the names of these component GRanges are indicated. We can control the colours used with the `grl_colours` parameter if we don't want to use the default colours. If we provide a GRanges object with the location of a transcription start site to the `reference_tss`, parameter, we can show the distance of methylation sites upstream and downstream of this. By default, the main plot and annotation plot are combined into a single plot and returned. The `annotation_plot_proportion` parameter sets the proportion of the total plot height dedicated to the annotation plot. We can instead return the annotation plot by itself by setting the `annotation_plot_only` parameter to TRUE. We'll annotate the location of CpG islands, CpG shores, CpG shores and inter CpG island regions for `gstp1_methylation_plot` using the `cpg_island_annotation` GRangesList we created. ```{r, eval=TRUE} # Annotate gstp1_methylation_plot with cpg_island_annotation annotatePlot(meth_site_plot = gstp1_methylation_plot, annotation_grl = cpg_island_annotation, annotation_plot_proportion = 0.3, grl_colours = c("#DEEBF7", "#9ECAE1", "#4292C6", "#08519C")) # Create same plot, except showing the distance to the GSTP1 start site on the x-axis annotatePlot(meth_site_plot = gstp1_methylation_plot, annotation_grl = cpg_island_annotation, reference_tss = GRanges("chr11:67583742"),annotation_plot_proportion = 0.3, grl_colours = c("#DEEBF7", "#9ECAE1", "#4292C6", "#08519C")) # Return the annotation plot by itself annotatePlot(meth_site_plot = gstp1_methylation_plot, annotation_grl = cpg_island_annotation, annotation_plot_proportion = 0.3, grl_colours = c("#DEEBF7", "#9ECAE1", "#4292C6", "#08519C"), annotation_plot_only = TRUE) ``` ## Masking regions in a methylation RangedSummarizedExperiments We may want to mask certain regions in a methylation RangedSummarizedExperiment. With the `maskRangesInRSE()` function, we can mask regions across all samples (which could be useful for repetitive regions or polymorphic regions) or on a sample by sample basis (which could be appropriate for different regions that are known to be mutated in different tumour samples). All methylation values within the masked regions will be set to `NA`. The `mask_ranges` argument takes either a GRanges or GRangesList with the regions that should be masked. If a GRanges object is provided, all methylation sites overlapping these regions will be masked across all samples. If a GRangesList is provided, the names of the component GRanges should match sample names in the RangedSummarizedExperiment and in each sample, the methylation sites overlapping the regions in its corresponding GRangesList entry will be masked. We will demonstrate how to mask LINE repetitive regions across all samples. ```{r, eval=TRUE} # Download repetitive sequences from AnnotationHub and filter for LINE elements repeat_annotation_hg38 <- AnnotationHub::AnnotationHub()[["AH99003"]] line_elements_hg38 <- repeat_annotation_hg38[repeat_annotation_hg38$repClass == "SINE"] # Mask LINE elements in mcrpc_wgbs_hg38_chr11 mcrpc_wgbs_hg38_chr11_lines_masked <- maskRangesInRSE(rse = mcrpc_wgbs_hg38_chr11, mask_ranges = line_elements_hg38) # Extract the methylation values for one of the LINE elements in the # unmasked and masked RSE extractGRangesMethSiteValues(meth_rse = mcrpc_wgbs_hg38_chr11, genomic_regions = line_elements_hg38[1000])[, 1:6] extractGRangesMethSiteValues(meth_rse = mcrpc_wgbs_hg38_chr11_lines_masked, genomic_regions = line_elements_hg38[1000])[, 1:6] ``` ## Lifting over methylation sites in a RangedSummarizedExperiment from one genome build to another Sometimes we may want to work with a different genome build to that used to construct a methylation RangedSummarizedExperiment. We can easily liftover the genomic coordinates of the methylation sites using `liftoverMethRSE()` function. To do this, we need a liftover chain file for the appropriate source and target genome builds and which we provide to the `chain` argument. All methylation sites which cannot be mapped to the target genome build and those which result in many-to-one mappings are removed. We also need to decide whether we want to remove methylation sites in the source genome build which map to multiple sites in the target genome build. We do this using the `remove_one_to_many_mapping` argument, which has a default value of TRUE. We can also remove any regions which do not map to desired regions in the target genome, for example CpG sites, by providing a GRanges object to the argument `permitted_target_regions`. We will demonstrate how to liftover mcrpc_wgbs_hg38_chr11 to hg19. ```{r, eval=TRUE} # Create a DNAStringSet for chromosome11 chr11_dss = setNames(DNAStringSet(BSgenome.Hsapiens.UCSC.hg19[["chr11"]]), "chr11") # Get CpG sites for hg19 for chromsome 11 hg19_cpgs <- methodical::extractMethSitesFromGenome(genome = chr11_dss) # Download hg38 to hg19 liftover chain from AnnotationHub hg38tohg19Chain <- AnnotationHub::AnnotationHub()[["AH14108"]] # Liftover mcrpc_wgbs_hg38_chr11 to mcrpc_wgbs_hg19_chr11 mcrpc_wgbs_hg19_chr11 <- liftoverMethRSE(meth_rse = mcrpc_wgbs_hg38_chr11, chain = hg38tohg19Chain, remove_one_to_many_mapping = TRUE, permitted_target_regions = hg19_cpgs) # Compare the dimensions of mcrpc_wgbs_hg38_chr11 and mcrpc_wgbs_hg19_chr11. # 1,423,050 methylation sites could not be lifted over from hg38 to hg19. dim(mcrpc_wgbs_hg38_chr11) dim(mcrpc_wgbs_hg19_chr11) # chr1:921635 should be lifted over to chr1:857015 so confirm that they have # the same methylation values in hg38 and hg19 rtracklayer::liftOver(GRanges("chr11:67581759"), hg38tohg19Chain) extractGRangesMethSiteValues(mcrpc_wgbs_hg38_chr11, GRanges("chr11:67581759"))[, 1:8] extractGRangesMethSiteValues(mcrpc_wgbs_hg19_chr11, GRanges("chr11:67349230"))[, 1:8] ``` ## SessionInfo ```{r} sessionInfo() ```