---
title: "compSPOT-Vignette"
author:
- name: Sydney Grant
affiliation: Roswell Park Comprehensive Cancer Center
email: sydney.grant@roswellpark.org
- name: Ella Sampson
affiliation: Roswell Park Comprehensive Cancer Center
email: ellasamp@buffalo.edu
- name: Rhea Rodrigues
affiliation: Roswell Park Comprehensive Cancer Center
email: RheaCarmelGlen.Rodrigues@roswellpark.org
- name: Gyorgy Paragh
affiliation: Roswell Park Comprehensive Cancer Center
email: Gyorgy.Paragh@roswellpark.org
package: compSPOT
output:
BiocStyle::html_document:
toc: true
theme: cerulean
vignette: |
%\VignetteIndexEntry{compSPOT-Vignette}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction
Sydney R. Grant^1,2^, Ella Sampson^1^, Rhea Rodrigues^1,2^, Gyorgy Paragh^1,2^
^1^Department of Dermatology, Roswell Park Comprehensive Cancer Center, Buffalo, NY
^2^Department of Cell Stress Biology, Roswell Park Comprehensive Cancer Center, Buffalo, NY
Clonal cell groups share common mutations within cancer, precancer, and even
clinically normal appearing tissues. The frequency and location of these
mutations may predict prognosis and cancer risk. It has also been well
established that certain genomic regions have increased sensitivity to acquiring
mutations. Mutation-sensitive genomic regions may therefore serve as markers
for predicting cancer risk. This package contains multiple functions to
establish significantly mutated hotspots, compare hotspot mutation burden
between samples, and perform exploratory data analysis of the correlation
between hotspot mutation burden and personal risk factors for cancer, such as
age, gender, and history of carcinogen exposure. This package allows users to
identify robust genomic markers to help establish cancer risk.
Currently, minimal resources exist which enable researchers to design their own
targeted sequencing panels based on specific biological questions and tissues
of interest. `compSPOT` has been designed to work sequentially with Bioconductor
package `seq.hotSPOT`. Highly mutated genomic regions identified by `seq.hotSPOT`
may be used for discovery of significant mutation hotspots with `compSPOT`.
`compSPOT` may also be used to discover differences in hotspot mutation burden
between different groups of interest, and the association of mutation burden with
clinical features. `compSPOT` may be used in combination with the Bioconductor
package `RTCGA.mutations`, which can be used to pull mutation datasets from the
TCGA database to be used as input data in various cancer types. Additionally,
the package `RTCGA.clinical` may be also used to identify highly mutated regions
in subsets of patients with specific clinical features of interest.
# Installation & Setup
```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("compSPOT")
```
Load [*compSPOT*][]
```{r}
library(compSPOT)
```
# Example Input Data
## Mutation Data
The mutation dataset should include the following columns:
"Chromosome" <-- Chromosome number where the mutation is located
"Position" <-- Genomic position number where the mutation is located
"Sample" <-- Unique ID for each sample in dataset
"Gene" <-- Name of the gene which mutation is located in (optional)
"Group" <-- Group classification ID (for compare_groups only)
Clinical Parameters <-- (for compare_features only)
Loading example mutations:
``` {r load mutations}
data("compSPOT_example_mutations")
```
## Regions Data
The regions dataset should include the following columns:
"Chromosome" <-- Chromosome number where the region is located
"Lowerbound" <-- Genomic position number where the region begins
"Upperbound" <-- Genomic position number where the region ends
"Gene" <-- Name of the gene which mutation is located in (optional)
"Count" <-- Number of mutations in mutation dataset which are found within the
region (optional)
Loading example regions:
``` {r load regions}
data("compSPOT_example_regions")
```
# Example Workflow
The compSPOT package contains three main functions for (1) selection of
mutation hotspots (2) comparison of hotspot mutation burden between groups,
and (3) comparison of mutation hotspot enrichment based on clinical and personal
risk factors. All functions return both numerical outputs based on analysis
summary and data visualization components for quick and easy interpretation of
results.
## Identifying Mutation Hotspots with find_hotspots
Our previously published Bioconductor package `seq.hotSPOT`
(doi: 10.3390/cancers15051612) identifies highly mutated genomic regions based
on SNV datasets. While this tool can identify long lists of mutated regions,
we sought to establish a method for identifying which of these genomic regions
have significantly higher mutation frequency compared to others and may be used
as markers of carcinogenic progression.
Methods: This function begins by measuring the mutation frequency for each
unique sample for each provided genomic region. Beginning with the top-ranked
hotspot, a Kolmogorov-Smirnov test is performed on the mutation frequency of
the top genomic region compared to the normalized mutation frequency of all the
lower-ranked regions. This continues, then running the Kolmogorov-Smirnov test
for the normalized mutation frequency of the top 2 genomic regions compared to
the normalized mutation frequency of all lower-ranked regions. This process
repeats itself, continuously adding an additional genomic regions each time
until either the set p-value or empirical distribution threshold is not met.
Once this cutoff has been reached, an established list of mutation hotspots is
provided.
```{r sig.spots}
significant_spots <- find_hotspots(data = compSPOT_example_mutations,
regions = compSPOT_example_regions,
pvalue = 0.05, threshold = 0.2,
include_genes = TRUE,
rank = TRUE)
```
Table 1. Example output table from find_hotspots function.
This table is stored in the first position of the output list.
```{r table 1}
head(significant_spots[[1]])
```
## Comparison Mutation Hotspot Burden with compare_groups
Previously, we have shown mutation hotspots identified using seq.hotSPOT may be
used to differentiate between samples with history of frequent vs infrequent
carcinogen exposure (doi: 10.3390/cancers15051612, doi: 10.3390/ijms24097852).
compare_groups provides an automated approach for statistical and visual
comparison between mutation enrichment of different groups of interest.
Methods: This function creates a list of mutation frequency per unique sample
for each genomic region separated based on specified sub-groups. The regions
with significant differences in mutation distribution are calculated using a
Kolmogorov-Smirnov test. The difference in mutation frequency is output in a
violin plot.
For this example dataset, the sig.spot function identified 6 hotspots. We will
use these 6 hotspots to compared the mutation burden between Lung Cancer
patients with high- and low-risk of disease progression.
```{r group.spot}
hotspots <- subset(significant_spots[[1]], type == "Hotspot")
group_comp <- compare_groups(data = compSPOT_example_mutations,
regions = hotspots, pval = 0.05,
threshold = 0.2,
name1 = "High-Risk",
name2 = "Low-Risk",
include_genes = TRUE)
```
Table 2. Example output table from compare_groups function.
This table is stored in the first position of the output list.
```{r table 2}
group_comp[[1]]
```
## EDA of Mutation Hotspot Burden and Personal Risk Factors with compare_features
Mutation enrichment in cancer mutation hotspots has been shown to relate to
personal cancer risk factors such as age, gender, and carcinogen exposure
history and may be used in combination to create predictive models of cancer
risk (doi: 10.3390/ijms24097852). feature.spot provides a baseline analysis of
any set of clinical features to identify trends in the enrichment of mutations
and personal risk factors.
Methods: This function first classifies the features into sequential or
categorical features. Sequential features are compared to the mutation count
using Pearson Correlation. Similarly, in categorical features Wilcox Rank Sum
and Kruska-Wallis Tests are used to compare groups within the features based on
their mutational count.
```{r feature.spot}
features <- c("AGE", "SEX", "SMOKING_HISTORY", "TUMOR_VOLUME", "KI_67")
feature_example <- compare_features(data = compSPOT_example_mutations,
regions = compSPOT_example_regions,
feature = features)
```
```{r sessionInfo, echo=FALSE}
sessionInfo()
```