---
title: "compSPOT-Vignette"
author:
- name: Sydney Grant
  affiliation: Roswell Park Comprehensive Cancer Center
  email: sydney.grant@roswellpark.org
- name: Ella Sampson
  affiliation: Roswell Park Comprehensive Cancer Center
  email: ellasamp@buffalo.edu
- name: Rhea Rodrigues
  affiliation: Roswell Park Comprehensive Cancer Center
  email: RheaCarmelGlen.Rodrigues@roswellpark.org
- name: Gyorgy Paragh
  affiliation: Roswell Park Comprehensive Cancer Center
  email: Gyorgy.Paragh@roswellpark.org
package: compSPOT
output:
  BiocStyle::html_document:
  toc: true
  theme: cerulean
vignette: |
  %\VignetteIndexEntry{compSPOT-Vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```



# Introduction

Sydney R. Grant^1,2^, Ella Sampson^1^, Rhea Rodrigues^1,2^, Gyorgy Paragh^1,2^

^1^Department of Dermatology, Roswell Park Comprehensive Cancer Center, Buffalo, NY
^2^Department of Cell Stress Biology, Roswell Park Comprehensive Cancer Center, Buffalo, NY

Clonal cell groups share common mutations within cancer, precancer, and even 
clinically normal appearing tissues. The frequency and location of these 
mutations may predict prognosis and cancer risk. It has also been well 
established that certain genomic regions have increased sensitivity to acquiring 
mutations. Mutation-sensitive genomic regions may therefore serve as markers 
for predicting cancer risk. This package contains multiple functions to 
establish significantly mutated hotspots, compare hotspot mutation burden 
between samples, and perform exploratory data analysis of the correlation 
between hotspot mutation burden and personal risk factors for cancer, such as 
age, gender, and history of carcinogen exposure. This package allows users to 
identify robust genomic markers to help establish cancer risk.

Currently, minimal resources exist which enable researchers to design their own 
targeted sequencing panels based on specific biological questions and tissues 
of interest. `compSPOT` has been designed to work sequentially with Bioconductor 
package `seq.hotSPOT`. Highly mutated genomic regions identified by `seq.hotSPOT` 
may be used for discovery of significant mutation hotspots with `compSPOT`. 
`compSPOT` may also be used to discover differences in hotspot mutation burden 
between different groups of interest, and the association of mutation burden with 
clinical features. `compSPOT` may be used in combination with the Bioconductor 
package `RTCGA.mutations`, which can be used to pull mutation datasets from the 
TCGA database to be used as input data in various cancer types. Additionally, 
the package `RTCGA.clinical` may be also used to identify highly mutated regions 
in subsets of patients with specific clinical features of interest.

# Installation & Setup

```{r, eval = FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("compSPOT")
```

Load [*compSPOT*][]

```{r}
library(compSPOT)
```

# Example Input Data

## Mutation Data

The mutation dataset should include the following columns:
"Chromosome" <-- Chromosome number where the mutation is located
"Position" <-- Genomic position number where the mutation is located
"Sample" <-- Unique ID for each sample in dataset
"Gene" <-- Name of the gene which mutation is located in (optional)
"Group" <-- Group classification ID (for compare_groups only) 
Clinical Parameters <-- (for compare_features only)

Loading example mutations:
``` {r load mutations}
data("compSPOT_example_mutations")
```


## Regions Data

The regions dataset should include the following columns:
"Chromosome" <-- Chromosome number where the region is located
"Lowerbound" <-- Genomic position number where the region begins
"Upperbound" <-- Genomic position number where the region ends
"Gene" <-- Name of the gene which mutation is located in (optional)
"Count" <-- Number of mutations in mutation dataset which are found within the 
region (optional)

Loading example regions:
``` {r load regions}
data("compSPOT_example_regions")
```

# Example Workflow

The compSPOT package contains three main functions for (1) selection of 
mutation hotspots (2) comparison of hotspot mutation burden between groups, 
and (3) comparison of mutation hotspot enrichment based on clinical and personal 
risk factors. All functions return both numerical outputs based on analysis 
summary and data visualization components for quick and easy interpretation of 
results.

## Identifying Mutation Hotspots with find_hotspots

Our previously published Bioconductor package `seq.hotSPOT` 
(doi: 10.3390/cancers15051612) identifies highly mutated genomic regions based 
on SNV datasets. While this tool can identify long lists of mutated regions, 
we sought to establish a method for identifying which of these genomic regions 
have significantly higher mutation frequency compared to others and may be used 
as markers of carcinogenic progression.


Methods: This function begins by measuring the mutation frequency for each 
unique sample for each provided genomic region. Beginning with the top-ranked 
hotspot, a Kolmogorov-Smirnov test is performed on the mutation frequency of 
the top genomic region compared to the normalized mutation frequency of all the 
lower-ranked regions. This continues, then running the Kolmogorov-Smirnov test 
for the normalized mutation frequency of the top 2 genomic regions compared to 
the normalized mutation frequency of all lower-ranked regions. This process 
repeats itself, continuously adding an additional genomic regions each time 
until either the set p-value or empirical distribution threshold is not met. 
Once this cutoff has been reached, an established list of mutation hotspots is 
provided.

```{r sig.spots}
significant_spots <- find_hotspots(data = compSPOT_example_mutations, 
                                   regions = compSPOT_example_regions, 
                                   pvalue = 0.05, threshold = 0.2, 
                                   include_genes = TRUE, 
                                   rank = TRUE)
```

Table 1. Example output table from find_hotspots function. 
This table is stored in the first position of the output list.

```{r table 1}
head(significant_spots[[1]])
```


## Comparison Mutation Hotspot Burden with compare_groups

Previously, we have shown mutation hotspots identified using seq.hotSPOT may be 
used to differentiate between samples with history of frequent vs infrequent 
carcinogen exposure (doi: 10.3390/cancers15051612, doi: 10.3390/ijms24097852). 
compare_groups provides an automated approach for statistical and visual 
comparison between mutation enrichment of different groups of interest.


Methods: This function creates a list of mutation frequency per unique sample 
for each genomic region separated based on specified sub-groups. The regions 
with significant differences in mutation distribution are calculated using a 
Kolmogorov-Smirnov test. The difference in mutation frequency is output in a 
violin plot.

For this example dataset, the sig.spot function identified 6 hotspots. We will 
use these 6 hotspots to compared the mutation burden between Lung Cancer 
patients with high- and low-risk of disease progression.


```{r group.spot}
hotspots <- subset(significant_spots[[1]], type == "Hotspot")

group_comp <- compare_groups(data = compSPOT_example_mutations, 
                             regions = hotspots, pval = 0.05, 
                             threshold = 0.2, 
                             name1 = "High-Risk", 
                             name2 = "Low-Risk", 
                             include_genes = TRUE)
```

Table 2. Example output table from compare_groups function.
This table is stored in the first position of the output list.

```{r table 2}
group_comp[[1]]
```


## EDA of Mutation Hotspot Burden and Personal Risk Factors with compare_features 

Mutation enrichment in cancer mutation hotspots has been shown to relate to 
personal cancer risk factors such as age, gender, and carcinogen exposure 
history and may be used in combination to create predictive models of cancer 
risk (doi: 10.3390/ijms24097852). feature.spot provides a baseline analysis of 
any set of clinical features to identify trends in the enrichment of mutations 
and personal risk factors.

Methods: This function first classifies the features into sequential or 
categorical features. Sequential features are compared to the mutation count 
using Pearson Correlation. Similarly, in categorical features Wilcox Rank Sum 
and Kruska-Wallis Tests are used to compare groups within the features based on 
their mutational count. 


```{r feature.spot}
features <- c("AGE", "SEX", "SMOKING_HISTORY", "TUMOR_VOLUME", "KI_67")
feature_example <- compare_features(data = compSPOT_example_mutations, 
                                    regions = compSPOT_example_regions, 
                                    feature = features)
```



```{r sessionInfo, echo=FALSE}
sessionInfo()
```