--- title: "Calculating Methylation Transcription Correlations" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{calculating_methylation_transcription_correlations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, fig.width = 12, fig.height = 6.75 ) ``` ```{r setup, message = F} library(methodical) library(DESeq2) library(TumourMethData) ``` ## Introduction `methodical` facilitates the rapid analysis of the association between DNA methylation and expression of associated transcripts. It can calculate correlation values between individual methylation sites (e.g. CpG sites) or the methylation of wider genomic regions (e.g. expected promoter regions, gene bodies) and the expression of associated transcripts. We will demonstrate both cases in this vignette. ```{r, eval=FALSE} # Installing Methodical if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("methodical") ``` ## Import of RangedSummarizedExperiment for tumour WGBS methylation data We now first download a RangedSummarizedExperiment with WGBS data for prostate tumour metastases and associated transcript counts from TumourMethData. ```{r} # Download mcrpc_wgbs_hg38 from TumourMethDatasets. mcrpc_wgbs_hg38_chr11 <- TumourMethData::download_meth_dataset(dataset = "mcrpc_wgbs_hg38_chr11") # Download transcript counts mcrpc_transcript_counts <- TumourMethData::download_rnaseq_dataset(dataset = "mcrpc_wgbs_hg38_chr11") ``` ## Preparation of TSS annotation We now load annotation for our transcripts of interest, in this case all protein-coding transcripts annotated by Gencode. ```{r} # Download Gencode annotation library(AnnotationHub) transcript_annotation <- AnnotationHub()[["AH75119"]] # Import the Gencode annotation and subset for transcript annotation for non-mitochondrial protein-coding genes transcript_annotation <- transcript_annotation[transcript_annotation$type == "transcript"] transcript_annotation <- transcript_annotation[transcript_annotation$transcript_type == "protein_coding"] # Remove transcript version from ID transcript_annotation$ID <- gsub("\\.[0-9]*", "", transcript_annotation$ID) # Filter for transcripts on chromosome 11 transcript_annotation_chr11 <- transcript_annotation[seqnames(transcript_annotation) == "chr11"] # Filter for transcripts in the counts table transcript_annotation_chr11 <- transcript_annotation_chr11[transcript_annotation_chr11$ID %in% row.names(mcrpc_transcript_counts)] # Set transcript ID as names of ranges names(transcript_annotation_chr11) <- transcript_annotation_chr11$ID # Get the TSS for each transcript set ID as names transcript_tss_chr11 <- resize(transcript_annotation_chr11, 1, fix = "start") rm(transcript_annotation) ``` ## Preparation of count data Next we will normalize counts data for these transcripts from prostate cancer metastases samples using DESeq2. ```{r} # Subset mcrpc_transcript_counts for protein-coding transcripts mcrpc_transcript_counts <- mcrpc_transcript_counts[transcript_tss_chr11$ID, ] # Create a DESeqDataSet from Kallisto counts. mcrpc_transcript_counts_dds <- DESeqDataSetFromMatrix(countData = mcrpc_transcript_counts, colData = data.frame(sample = names(mcrpc_transcript_counts)), design = ~ 1) mcrpc_transcript_counts_dds <- estimateSizeFactors(mcrpc_transcript_counts_dds) mcrpc_transcript_counts_normalized <- data.frame(counts(mcrpc_transcript_counts_dds, normalized = TRUE)) ``` ## Calculating CpG methylation-transcript expression correlation We'll first demonstrate calculating the correlation values between all transcripts on chromosome 11 and the methylation of all CpG sites within 5 KB of their TSS and also the significance of these correlation values. We do this with the `calculateMethSiteTranscriptCors` function, which takes methylation RangedSummarizedExperiment and a table with transcript counts as input. It also takes a GRanges object with the TSS of interest. The arguments `expand_upstream` and `expand_downstream` define the regions around the TSS where we want to examine CpG sites. We set these to 5KB upstream and downstream. We can also provide a subset of samples that we want to use to calculate the correlation values with the `samples_subset` parameter. The default behaviour is to use try to use all samples, but we use `common_mcprc_samples` since not all samples in `mcrpc_wgbs_hg38_chr11` are also present in `mcrpc_transcript_counts_normalized`. Finally, we can control the memory usage and parallelization with the `max_sites_per_chunk` and `BPPARAM` parameters. `calculateMethSiteTranscriptCors` aims to keep the memory footprint low by splitting the provided `tss_gr`into chunks of closely-situated TSS and processing these chunks one at a time. The chunks are limited to that they contain an approximate maximum number of CpGs within `expand_upstream` and `expand_downstream` of the constituent TSS. `max_sites_per_chunk` controls this upper limit of CpG that are read into memory at once and has a default value of `floor(62500000/ncol(meth_rse))` using approximately 500 MB of RAM. `BPPARAM` controls the number of TSS within a chunk processed in parallel. As the TSS belong to the same chunk, several cores can be used simultaneously without drastically increasing RAM consumption. The following example should just take a few minutes. ```{r} # Find samples with both WGBS and RNA-seq count data common_mcprc_samples <- intersect(names(mcrpc_transcript_counts), colnames(mcrpc_wgbs_hg38_chr11)) # Create a BPPARAM class BPPARAM = BiocParallel::bpparam() # Calculate CpG methylation-transcription correlations for all TSS on chromosome 11 system.time({transcript_cpg_meth_cors_mcrpc_samples_5kb <- calculateMethSiteTranscriptCors( meth_rse = mcrpc_wgbs_hg38_chr11, transcript_expression_table = mcrpc_transcript_counts_normalized, samples_subset = common_mcprc_samples, tss_gr = transcript_tss_chr11, cor_method = "pearson", max_sites_per_chunk = NULL, expand_upstream = 5000, expand_downstream = 5000, BPPARAM = BPPARAM)}) ``` As an example, we'll examine the CpG methylation transcription correlations for a TSS of GSTP1, a gene known to be important to prostate cancer. We'll use the TSS for the ENST00000398606 transcript, which is the ENSEMBL canonical and MANE select transcript for GSTP1. We see that we have five columns in the results table: the name of the CpG site, the name of the associated transcript, the correlation value, the p-value of the correlation and the distance of the CpG site to the TSS of the transcript. The location of the TSS associated with the results is stored as a GRanges as an attribute called `tss_range`, making it easy to retrieve this information. ```{r} # Extract correlation results for ENST00000398606 transcript gstp1_cpg_meth_transcript_cors <- transcript_cpg_meth_cors_mcrpc_samples_5kb[["ENST00000398606"]] # Examine first few rows of the results head(gstp1_cpg_meth_transcript_cors) # Extract TSS for the transcript attributes(gstp1_cpg_meth_transcript_cors)$tss_range ``` ## Plotting correlation values Plotting the methylation-transcription correlation values around a TSS enable us to visualize how the relationship between DNA methylation and transcription changes around the TSS. We can use the `plotMethSiteCorCoefs` function to plot values for methylation sites, such as their correlation values or methylation levels. It takes a data.frame as input which should have a column called "meth_site" which gives the location of CpG sites as a character string which can be coerced to a GRanges (e.g. "seqname:start"). We also provide the name a numeric column which contains the values we want to plot, which is "cor" here. Values are displayed on the y-axis and the genomic location on the x-axis. This location can be either the actual genomic location on the relevant sequence or else the distance relative to the TSS, depending on whether `reference_tss` is set to TRUE or FALSE. ```{r, fig.show='hide'} # Plot methylation-transcription correlation values for GSTP1 showing # chromosome 11 position on the x-axis meth_cor_plot_absolute_distance <- plotMethSiteCorCoefs( meth_site_cor_values = gstp1_cpg_meth_transcript_cors, reference_tss = FALSE, ylabel = "Correlation Value", title = "Correlation Between GSTP1 Transcription and CpG Methylation") print(meth_cor_plot_absolute_distance) # Plot methylation-transcription correlation values for GSTP1 showing # distance to the GSTP1 on the x-axis meth_cor_plot_relative_distance <- plotMethSiteCorCoefs( meth_site_cor_values = gstp1_cpg_meth_transcript_cors, reference_tss = TRUE, ylabel = "Correlation Value", title = "Correlation Between GSTP1 Transcription and CpG Methylation") print(meth_cor_plot_relative_distance) ``` We see that there is a group of negatively correlated CpGs immediately surrounding the TSS and positively correlated CpGs further away, especially between 2,500 and 1,250 base pairs upstream. The returned plots are ggplot objects and can be manipulated in any way that a regular ggplot object can. ```{r, fig.show='hide'} # Add a dashed line to meth_cor_plot_relative_distance where the correlation is 0 meth_cor_plot_relative_distance + ggplot2::geom_hline(yintercept = 0, linetype = "dashed") ``` ## Calculating region methylation-transcript correlations We'll now demonstrate calculating the correlation values between transcript expression and methylation of wider regions associated them instead of just single CpG sites, e.g. promoters, gene bodies or enhancers. We use the `calculateRegionMethylationTranscriptCors` function for this, which takes a RangedSummarizedExperiment with methylation values and a table with transcript expression values, like `calculateMethSiteTranscriptCors`. However, `calculateRegionMethylationTranscriptCors` takes a GRanges object with wider regions which are associated with the transcripts. More than one region may be associated with a given transcript, for example we could analyse the correlation between transcript expression and methylation of its exons separately for each exon. Here we'll investigate the relationship between gene body methylation and transcription for all transcripts on chromosome 11. ```{r, fig.show='hide'} # Calculate gene body methylation-transcription correlations for all TSS on chromosome 11 system.time({transcript_body_meth_cors_mcrpc_samples <- calculateRegionMethylationTranscriptCors( meth_rse = mcrpc_wgbs_hg38_chr11, transcript_expression_table = mcrpc_transcript_counts_normalized, samples_subset = common_mcprc_samples, genomic_regions = transcript_annotation_chr11, genomic_region_transcripts = transcript_annotation_chr11$ID, region_methylation_summary_function = colMeans, cor_method = "pearson", BPPARAM = BPPARAM)}) # Show the top of the results table head(transcript_body_meth_cors_mcrpc_samples) # We'll filter for significant correlation values and plot their distribution transcript_body_meth_cors_significant <- dplyr::filter(transcript_body_meth_cors_mcrpc_samples, q_val < 0.05) ggplot(transcript_body_meth_cors_significant, aes(x = cor)) + geom_histogram() + theme_bw() + scale_y_continuous(expand = expansion(mult = c(0, 0.2), add = 0)) + labs(x = "Correlation Values", y = "Number of Significant Correlation Values") ``` ## Conclusion Within minutes, we calculated all the correlation values for the methylation of CpG sites within 5KB of TSS on chromosome 11 and expression of the associated transcripts as well as the p-values for the correlations. We then analysed the relationship between mean gene body methylation and transcript expression, first calculating the mean methylation of CpGs overlapping gene bodies on chromosome 11 and then computing the correlation values, again in short time using just a few cores. Thus, in under an hour we could easily analyse the correlations for all TSS in the genome on a desktop computer or even a laptop using just a few cores. ## SessionInfo ```{r} sessionInfo() ```