--- title: "sSNAPPY: Singel Sample directioNAl Pathway Perturbation analYsis" author: - name: Wenjun Nora Liu affiliation: Dame Roma Mitchell Cancer Research Laboratories, Adelaide Medical School, University of Adelaide email: wenjun.liu@adelaide.edu.au - name: Stephen Pederson email: stephen.pederson.au@gmail.com pacakge: sSNAPPY output: BiocStyle::html_document: toc: yes vignette: > %\VignetteIndexEntry{Single Sample Directional Pathway Perturbation Analysis} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: inline --- ```{r, echo=FALSE} knitr::opts_chunk$set(message = FALSE, warning = FALSE, crop = NULL) ``` # Introduction This vignette demonstrates how to use the package `sSNAPPY` to compute directional single sample pathway perturbation scores by incorporating pathway topologies and changes in gene expression, utilizing sample permutation to test the significance of individual scores and comparing average pathway activities across treatments. The package also provides many powerful and easy-to-use visualisation functions that helps visualising significantly perturbed pathways as networks, detecting community structures in pathway networks, and revealing pathway genes' involvement in the perturbation. # To get ready ## Installation The package `sSNAPPY` can be installed using the package `BiocManager` ```{r install, eval = FALSE} if (!"BiocManager" %in% rownames(installed.packages())) install.packages("BiocManager") # Other packages required in this vignette pkg <- c("tidyverse", "magrittr", "ggplot2", "cowplot", "DT") BiocManager::install(pkg) BiocManager::install("sSNAPPY") install.packages("htmltools") ``` ## Load packages ```{r setup, results="hide", warning=FALSE} library(sSNAPPY) library(tidyverse) library(magrittr) library(ggplot2) library(cowplot) library(DT) library(htmltools) library(patchwork) ``` ## Load data The example dataset used for this tutorial can be loaded with `data()` as shown below. It's a subset of data retrieved from [Singhal H et al. 2016](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4928895/), where ER-positive primary breast cancer tumour tissues collected from 12 patients were split into tissue fragments of equal sizes for different treatments. For this tutorial, we are only looking at the RNA-seq data from samples that were treated with vehicle, R5020(progesterone) OR E2(estrogen) + R5020 for 48 hrs. Tumour specimens were collected from 5 patients, giving rise to a total of 15 samples. To cut down the computation time, only half the expressed genes were randomly sampled to derive the example logCPM matrix. A more detailed description of the dataset can be assessed through the help page (`?logCPM_example` and `?metadata_example`). ```{r data} data(logCPM_example) data(metadata_example) # check if samples included in the logCPM matrix and metadata dataframe are identical setequal(colnames(logCPM_example), metadata_example$sample) # View sample metadata datatable(metadata_example, filter = "top") ``` # `sSNAPPY` workflow ## Compute weighted single-sample logFCs (ssLogFCs) It is expected that the logCPM matrix will be filtered to remove undetectable genes and normalised to correct for library sizes or other systematic artefacts, such as gene length or GC contents, prior to applying the `sSNAPPY` workflow. Filtration and normalisation have been performed on the example dataset. Before single-sample logFC (ssFC) can be computed, row names of the logCPM matrix need to be converted to `entrez ID`. This is because all the pathway topology information retrieved will be in `entrez ID`. The conversion can be achieved through bioconductor packages `AnnotationHub` and `ensembldb`. ```{r logCPM_example, eval=FALSE} head(logCPM_example) ``` To compute the ssFC, samples must be in matching pairs. In our example data, treated samples were matched to the corresponding control samples derived from the same patients. Therefore the `groupBy` parameter of the `weight_ss_fc()` functions will be set to be "patient". `weight_ss_fc()` requires both the logCPM matrix and sample metadata as input. The column names of the logCPM matrix should be sample names, matching a column in the metadata. Name of the sample name column will be provided as the `sampleColumn` parameter. The function also requires the name of the metadata column that contains treatment information to be specified. The column with treatment information must be a factor with the control treatment set to be the reference level. ```{r ssFC} # Check that the baseline level of the treatment column is the control levels(metadata_example$treatment)[1] #compute weighted single sample logFCs weightedFC <- weight_ss_fc(logCPM_example, metadata = metadata_example, groupBy = "patient", sampleColumn = "sample", treatColumn = "treatment") ``` The `weight_ss_fc()` function firstly computes raw ssFC for each gene by subtracting logCPM values of the control sample from the logCPM values of treated samples within each patient. It has been demonstrated previously that in RNA-seq data, lowly expressed genes turn to have a larger variance (Law et al. 2014), which is also demonstrated by the plots below. To reduce the impact of this artefact, `weight_ss_fc` also weights each ssFCs by estimating the relationship between the gene-level variance and mean logCPM, and defining the gene-wise weight to be the inverse of the predicted variance of that gene's mean logCPM value. ```{r lowess, fig.width=6,fig.height=5, fig.cap="*Gene-wise standard deviations are plotted against the mean logCPM values with mean-variance trend modelled by a loess fit. Genes with low expression values tend to have a larger variance.*"} logCPM_example %>% as.data.frame() %>% mutate( sd = apply(., 1, sd), mean = apply(., 1, mean) ) %>% ggplot( aes(mean, sd) ) + geom_point() + geom_smooth( method = "loess") + labs( x = expression(Gene-wise~average~expression~(bar(mu[g]))), y = expression(Gene-wise~standard~deviation~(sigma[g])) ) + theme_bw() + theme( panel.grid = element_blank(), axis.title = element_text(size = 14) ) ``` The output of the `weight_ss_fc()` function is a list with two element, where one is the weighted ssFC matrix (`weighted_logFC`) and the other is a vector of gene-wise weights (`weight`). ## Retrieve pathway topologies in the required format *sSNAPPY* adopts the pathway perturbation scoring algorithm proposed in `r Biocpkg("SPIA")`, which makes use of gene-set topologies and gene-gene interaction to propagate pathway genes' logFCs down the topologies to compute pathway perturbation scores, where signs of scores reflect pathways' potential directions of changes. Therefore, pathway topology information needs to be firstly retrieved from a chosen database and converted to weight adjacency matrices, the format required to apply the scoring algorithm. This step is achieved through a chain of functions that are part of the `r Biocpkg("grapghite")`, which have been nested into one simple function in *sSNAPPY* called `retrieve_topology()`. The `retrieve_topology` function now supports various species and databases. Databases that are currently supported for human are the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database(Ogata et al. 1999), the Reactome(Gillespie et al. 2021) database, and WikiPathways(Martens et al. 2021). The retrieved topology information will be a list where each element corresponds a pathway. It's recommended to save the list as a file so this step only needs to be performed once for each database. This vignette uses *KEGG* pathways in human as an example. ```{r retrieve_topology} gsTopology <- retrieve_topology(database = "kegg", species = "hsapiens") head(names(gsTopology)) ``` If only selected biological processes are of interest to your research, it's possible to only retrieve the topologies of those pathways by specifying keywords. For example, to retrieve all metabolism-related *KEGG* pathways: ```{r gsTopology_sub} gsTopology_sub <- retrieve_topology( database = "kegg", species = "hsapiens", keyword = "metabolism") head(names(gsTopology_sub)) ``` It is also possible to provide multiple databases' names and/or multiple keywords for a focused analysis. ```{r gsTopology_mult, echo=FALSE} gsTopology_mult <- retrieve_topology( database = c("kegg", "reactome"), species = "hsapiens", keyword = c("metabolism", "estrogen")) names(gsTopology_mult) ``` ## Score single sample pathway perturbation Once the expression matrix, sample metadata and pathway topologies are all ready, gene-wise single-sample perturbation scores can be computed within each sample: ```{r} genePertScore <- raw_gene_pert(weightedFC$weighted_logFC, gsTopology) ``` and summed to derive pathway perturbation scores for each pathway in each treated samples. ```{r} pathwayPertScore <- pathway_pert(genePertScore, weightedFC$weighted_logFC) head(pathwayPertScore) ``` ## Generate null distributions of perturbation scores To derive the empirical p-values for each single-sample perturbation scores and normalize the raw scores for comparing overall treatment effects, null distributions of scores for each pathway are generated through a sample-label permutation approach. In the permutation, all sample labels will be randomly shuffled and put into permuted pairs. Permuted single-sample logFCs will be calculated for each permuted pair of samples, while the reminding pathway perturbation scoring algorithm remains unchanged. Unless otherwise specified through the `NB` parameter, all possible permuted pairs will be used to construct the null distributions of perturbation scores. The output of the `generate_permuted_scores()` function is a list where each element is a vector of permuted perturbation scores for a specific pathway. ```{r permutedScore} set.seed(123) permutedScore <- generate_permuted_scores( expreMatrix = logCPM_example, gsTopology = gsTopology, weight = weightedFC$weight ) ``` ## Test significant perturbation on ### single-sample level After the empirical null distributions are generated, the median and mad of each distribution will be calculated for each pathway to convert the test single-sample perturbation scores derived from the `compute_perturbation_score()` function to robust z-scores: $(Score - Median)/MAD$. Two-sided p-values associated with each perturbation scores are also computed by the proportion of permuted scores that are as or more extreme than the test perturbation score within each pathway. Raw p-values will be corrected for multiple-testing using a user-defined approach. The default is `fdr`. In a data with N samples, the total number of possible permuted pairs of samples is $N \times (N-1)$. When the sample size is small, small p-values cannot be accurately estimated so the p-values returned by the `normalise_by_permu()` function should be interpreted with caution. The `normalise_by_permu()` function requires the test perturbation scores and permuted perturbation scores as input. Users can choose to sort the output by p-values, gene-set names or sample names. ```{r normalisedScores} normalisedScores <- normalise_by_permu(permutedScore, pathwayPertScore, sortBy = "pvalue") ``` In this example dataset, none of the pathway was considered to be significantly perturbed within individual samples using a FDR cut-off of 0.05. ```{r DT_indi} normalisedScores %>% dplyr::filter(adjPvalue < 0.05) ``` ### treatment-level In addition to testing pathway perturbations at single-sample level, normalised perturbation scores can also be used to model mean treatment effects within a group, such as within each treatment group. An advantage of this method is that it has a high level of flexibility that allows us to incorporate confounding factors as co-factors or co-variates to offset their effects. In the example data-set, the key question is how tumour tissues responded to the activation of PR alone or both ER and AR. We can test for the treatment-level pathway perturbation using a linear regression model of the form `~ 0 + treatment`. ```{r fit} fit <- normalisedScores %>% left_join(metadata_example) %>% split(f = .$gs_name) %>% lapply(function(x)lm(robustZ ~ 0 + treatment, data = x)) %>% lapply(summary) treat_sig <- lapply( names(fit), function(x){ fit[[x]]$coefficients %>% as.data.frame() %>% .[seq_len(2),] %>% dplyr::select(Estimate, pvalue = `Pr(>|t|)` ) %>% rownames_to_column("Treatment") %>% mutate( gs_name = x, FDR = p.adjust(pvalue, "fdr"), Treatment = str_remove_all(Treatment, "treatment") ) }) %>% bind_rows() ``` Results from the linear modelling revealed pathways that were on average perturbed due to each treatment: ```{r treat_sig_DT} treat_sig %>% dplyr::filter(FDR < 0.05) %>% mutate_at(vars(c("Treatment", "gs_name")), as.factor) %>% mutate_if(is.numeric, sprintf, fmt = '%#.4f') %>% mutate(Direction = ifelse(Estimate < 0, "Inhibited", "Activation")) %>% dplyr::select( Treatment, `Perturbation Score` = Estimate, Direction, `Gene-set name` = gs_name, `P-value` = pvalue, FDR ) %>% datatable( filter = "top", options = list( columnDefs = list(list(targets = "Direction", visible = FALSE)) ), caption = htmltools::tags$caption( htmltools::em( "Pathways that were significant perturbed within each treatment group.") ) ) %>% formatStyle( 'Perturbation Score', 'Direction', color = styleEqual(c("Inhibited", "Activation"), c("blue", "red")) ) ``` #### Visualise genes' contributions to pathway perturbation If there's a specific pathway that we would like to dig deeper into and explore which pathway genes potentially played a key role in its perturbation, for example, activation of the "Proteoglycans in cancer" in progesterone-treated samples, we can plot the gene-level perturbation scores of genes that are constantly highly perturbed or highly variable in that pathway as a heatmap using the `plot_gene_contribution()` function. From the heatmap below that we can see that the activation of this pathway was consistently driven by two genes: ENTREZID:1277 and ENTREZID:3688 in all R5020-treated samples while the other genes show some inter-patient heterogeneity. ```{r fig.height= 7, fig.width=8, fig.cap="*Gene-level perturbation scores of genes with top 10 highest absolute mean gene-wise perturbation scores in the kegg.Proteoglycans in cancer gene-set. Only samples treated with R52020 are included.*" } plot_gene_contribution( genePertMatr = genePertScore$`kegg.Proteoglycans in cancer` %>% .[, str_detect(colnames(.), "E2", negate = TRUE)], filterBy = "mean", topGene = 10, color = rev(colorspace::divergex_hcl(100, palette = "RdBu")), breaks = seq(-0.001, 0.001, length.out = 100) ) ``` By default, genes' entrez IDs are used and plotted as row names, which may not be very informative. So the row names could be overwritten by providing a `data.frame` mapping entrez IDs to other identifiers through the `mapRownameTo` parameter. Mapping between different gene identifiers could be achieved through the `mapIDs()` function from the Bioconductor package [`AnnotationDbi`](https://bioconductor.org/packages/release/bioc/html/AnnotationDbi.html). But to reduce the compiling time of this vignette, mappings between entrez IDs and gene names as available in Ensembl Release 101 have been provided as a `data.frame` called `entrez2name`. Note that if the mapping information was provided and the mapping was only successful for some genes but not the others, only genes that have been mapped successfully will be plotted. Since `plot_gene_contribution()` is built on `pheatmap`, which provides a practical column annotation feature, the `plot_gene_contribution()` function also allow a `data.frame` storing annotation information to be provided to utilise that feature. We can annotate each column (ie. each sample) by the pathway-level perturbation score or any other sample metadata, such as progesterone receptor (PR) status. In this example, we first create a `data.frame` storing the pathway-level perturbation scores of the "Proteoglycans in cancer" pathway in each sample and their PR status. ```{r annotation_df} annotation_df <- normalisedScores %>% dplyr::filter(gs_name == "kegg.Proteoglycans in cancer") %>% mutate( `Z Range` = cut( robustZ, breaks = seq(-2, 2, length.out = 6), include.lowest = TRUE ) ) %>% dplyr::select(sample, `Z Range`) %>% left_join( ., metadata_example %>% dplyr::select(sample, `PR Status` = PR), unmatched = "drop" ) ``` The annotation `data.frame` was provided to the `plot_gene_contribution()` function through the `annotation_df` parameter. Colors of the annotation could be customised through `pheatmap::pheatmap()`'s `annotation_colors` parameter. From the heatmap below, we can see that gene EIF3B and MTOR played a strong role in promoting the activation of this pathway in the two PR-negative samples, but those two genes were not as highly involved in the PR-positive samples. The genes consistently promoting the activation of this pathway among all R5020-treated samples are MMP2, COL1A1 and ITGB1. ```{r fig.height= 7, fig.width=10, fig.cap="*Gene-level perturbation scores of genes with top 10 absolute mean gene-wise perturbation scores in the Proteoglycans in cancer gene-set among R502-treated samples.*"} load(system.file("extdata", "entrez2name.rda", package = "sSNAPPY")) z_levels <- levels(annotation_df$`Z Range`) sample_order <- metadata_example %>% dplyr::filter(treatment == "R5020") %>% .[order(.$treatment),] %>% pull(sample) plot_gene_contribution( genePertMatr = genePertScore$`kegg.Proteoglycans in cancer` %>% .[, match(sample_order, colnames(.))], annotation_df = annotation_df, filterBy = "mean", topGene = 10, mapEntrezID = entrez2name, cluster_cols = FALSE, color = rev(colorspace::divergex_hcl(100, palette = "RdBu")), breaks = seq(-0.001, 0.001, length.out = 100), annotation_colors = list( `PR Status` = c(Positive = "darkgreen", Negative = "orange"), `Z Range` = setNames( colorRampPalette(c("navyblue", "white", "darkred"))(length(z_levels)), z_levels )) ) ``` #### Visualise overlap between gene-sets as networks Visualising significantly perturbed biological pathways as a network, where edges between gene-sets reflect how much overlap they share, is an useful approach for demonstrating the connections between biological processes. The `plot_gs_network()` function in this package allows an easy construction of such network by taking the `normalise_by_permu()` function's output as direct input and allowing flexible customisation. Nodes in the network plots could be colored by the predicted direction of perturbation (i.e. sign of robust z-score) or p-values. Results of group-level perturbation can also be visualised using the `plot_gs_network()` function. The function allows you to customize the layout, colour, edge transparency and other aesthetics of the graph. More information can be found on the help page (`?plot_gs_network`). The output of the graph is a `ggplot` object and the theme of it can be changed just as any other `ggplot` figures. Taking the pathways that were among the top 20 ranked in the R5020 group as an example: ```{r} pathway2plot <- treat_sig %>% dplyr::filter(Treatment == "R5020") %>% arrange(FDR) %>% .[1:20,] %>% mutate( status = ifelse(Estimate > 0, "Activated", "Inhibited"), status = ifelse(FDR < 0.05, status, "Unchanged")) ``` ```{r fig.width=12, fig.height=5, fig.cap="*Networks of pathways that were perturbed by the R5020 treatment, where colors of nodes reflect (A) pathways' predicted directions of changes. and (B) -log10(p-values). In panel A, pathways that were significantly perturbed were predicted to be either inhibited or activated while pathways that did not pass the significance threshold were deemed unchanged.*"} set.seed(123) p1 <- pathway2plot %>% plot_gs_network( gsTopology = gsTopology, colorBy = "status" ) + scale_color_manual( values = c( "Activated" = "red", "Unchanged" = "gray" ) ) + theme( panel.grid = element_blank(), panel.background = element_blank() ) set.seed(123) p2 <- pathway2plot %>% mutate(`-log10(p)` = -log10(pvalue)) %>% plot_gs_network( gsTopology = gsTopology, colorBy = "-log10(p)" ) + theme( panel.grid = element_blank(), panel.background = element_blank() ) (p1 | p2) + plot_annotation(tag_levels = "A") ``` #### Visualise community structure in the gene-set network When a large number of pathways were perturbed, it is hard to answer the question "What key biological processes were perturbed?" solely by looking at all the pathway names. To solve that, we can use the `plot_community()` function to apply a community detection algorithm to the network structure we constructed above, and annotate each community by the biological process that most pathways assigned to that community belong to. Using the default Louvain community detection algorithm, two main communities were formed and annotated to be related to cancer and endocrine and sensory system, aligning with the expected changes in hormone-treated cancer samples. ```{r, fig.height=9, fig.width=12, fig.cap="*The top 20 ranked pathways in the R5020 treatment group, annotated by the main biological processes each pathways belong to and coloured by pathways' predicted change in direction. The status of pathways that did not pass the significance threshold to be considered as significantly perturbed were deemed as unchanged.*" } set.seed(123) pathway2plot %>% plot_community( gsTopology = gsTopology, colorBy = "status", color_lg_title = "Community" ) + scale_color_manual( values = c( "Activated" = "red", "Unchanged" = "gray" ) ) + theme(panel.background = element_blank()) ``` The `plot_community` function was built in with categorizations of *KEGG* pathways so annotation of *KEGG* communities can be automatically completed without the need to specify the `gsAnnotation` parameter. We also retrieved and curated the categorisation of *Reactome* pathways, which can be loaded using the following code: ```{r eval=TRUE} load(system.file("extdata", "gsAnnotation_df_reactome.rda", package = "sSNAPPY")) ``` Analyses involving other pathway databases may require user-provided pathway categories. #### Visualise genes included in perturbed pathways networks If we want to not only know if two pathways are connected but also the genes connecting those pathways, we can use the `plot_gs2gene()` function instead: ```{r, fig.height=7, fig.width=12, fig.cap="*Pathways significantly perturbed by the R5020 treatment and genes implicated in at least 3 of those pathways.*" } treat_sig %>% dplyr::filter(FDR < 0.05,) %>% plot_gs2gene( gsTopology = gsTopology, colorGsBy = "Estimate", labelGene = FALSE, geneNodeSize = 1, edgeAlpha = 0.1, gsNameSize = 2, filterGeneBy = 3 ) + scale_fill_gradient2() + theme(panel.background = element_blank()) ``` However, since there are a large number of genes in each pathway, the plot above was quite messy and it was unrealistic to plot all genes' names. So it is recommend to filter genes by their perturbation scores or logFC. For example, we can rank genes by the absolute values of their mean single-sample logFCs and only focus on genes that were ranked in the top 500 of the list. ```{r top500_FC} meanFC <- weightedFC$weighted_logFC %>% .[, str_detect(colnames(.), "E2", negate = TRUE)] %>% apply(1, mean ) top500_gene <- meanFC %>% abs() %>% sort(decreasing = TRUE, ) %>% .[1:500] %>% names() top500_FC <- meanFC %>% .[names(.) %in% top500_gene] top500_FC <- ifelse(top500_FC > 0, "Up-Regulated", "Down-Regulated") ``` When we provide genes' logFC as a named vector through the `geneFC` parameter, only pathway genes with logFC provided will be plotted and gene nodes will be colored by genes' directions of changes. The names of the logFC vector must be entrez IDs in the format of "ENTREZID:XXXX", as pathway topology matrices retrieved through `retrieve_topology()` always use entrez IDs as identifiers. However, it is not going to be informative to label genes with their entrez IDs. So just as in the `plot_gene_contribution()` function, we can provide a `data.frame` to match genes' entrez IDs to our chosen identifiers through the `mapEntrezID` parameter in the `plot_gs2gene()` function too. ```{r, fig.height=8, fig.width=10, fig.cap="*Pathways significantly perturbed by the R5020 treatment, and pathway genes with top 500 magnitudes of changes among all R5020-treated samples. Both pathways and genes were colored by their predicted directions of changes.*" } treat_sig %>% dplyr::filter(FDR < 0.05, Treatment == "R5020") %>% mutate(status = ifelse(Estimate > 0, "Activated", "Inhibited")) %>% plot_gs2gene( gsTopology = gsTopology, colorGsBy = "status", geneFC = top500_FC, mapEntrezID = entrez2name, gsNameSize = 3, filterGeneBy = 0 ) + scale_fill_manual(values = c("darkred", "lightskyblue")) + scale_colour_manual(values = c("red", "blue")) + theme(panel.background = element_blank()) ``` # References - Charity W Law, Yunshun Chen, Wei Shi, and Gordon K Smyth. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol, 15(2):R29, 2014. doi: 10.1186/gb-2014-15-2-r29. - Gabriele Sales, Enrica Calura, Duccio Cavalieri, and Chiara Romualdi. graphite - a Bioconductor package to convert pathway topology to gene network. BMC Bioinformatics, 13(1):20, December 2012. doi: 10.1186/1471-2105-13-20. - Adi Laurentiu Tarca, Sorin Draghici, Purvesh Khatri, Sonia S. Hassan, Pooja Mittal, Jung-sun Kim, Chong Jai Kim, Juan Pedro Kusanovic, and Roberto Romero. A novel signaling pathway impact analysis. Bioinformatics, 25(1):75–82, January 2009. doi: 10.1093/bioinformatics/btn577. - Minoru Kanehisa and Susumu Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28.1 (2000), pp. 27–30. - Marc Gillespie, Bijay Jassal, Ralf Stephan, Marija Milacic, Karen Rothfels, Andrea Senff-Ribeiro, Johannes Griss, Cristoffer Sevilla, Lisa Matthews, Chuqiao Gong, Chuan Deng, Thawfeek Varusai, Eliot Ragueneau, Yusra Haider, Bruce May, Veronica Shamovsky, Joel Weiser, Timothy Brunson, Nasim Sanati, Liam Beckman, Xiang Shao, Antonio Fabregat, Konstantinos Sidiropoulos, Julieth Murillo, Guilherme Viteri, Justin Cook, Solomon Shorser, Gary Bader, Emek Demir, Chris Sander, Robin Haw, Guanming Wu, Lincoln Stein, Henning Hermjakob, and Peter D’Eustachio. The reactome pathway knowledgebase 2022. Nucleic Acids Research, 50(D1):D687–D692, November 2021. doi: 10.1093/nar/gkab1028. - Marvin Martens et al. WikiPathways: connecting communities. Nucleic Acids Research 49.D1 (2021), pp. D613–D621. doi: 10.1093/nar/gkaa1024. # Session Info ```{r sessionInfo} sessionInfo() ```