---
title: "MetaPhOR"
author: "Emily Isenhart"
output: 
    BiocStyle::html_document:
        toc: FALSE
bibliography: "`r system.file('extdata/REFERENCES.bib', package = 'MetaPhOR')`"
vignette: >
    %\VignetteIndexEntry{MetaPhOR}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r style, include = FALSE, results = 'asis'}
BiocStyle::html_document()
```

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    dpi=200
)

library(kableExtra)
```

## Introduction  
MetaPhOR was developed to enable users to assess metabolic dysregulation using 
transcriptomic-level data (RNA-sequencing and Microarray data) and produce 
publication-quality figures. A list of differentially expressed genes (DEGs), 
which includes fold change and p value, from DESeq2 (@DESeq2) or limma (@limma),
can be used as input, with sample size for MetaPhOR, and will produce a data 
frame of scores for each KEGG pathway. These scores represent the magnitude and 
direction of transcriptional change within the pathway, along with estimated 
p-values (@MP). MetaPhOR then uses these scores to visualize 
metabolic profiles within and between samples through a variety of mechanisms, 
including: bubble plots, heatmaps, and pathway models.

## Installation  
This command line can be used to install and load MetaPhOR.  

if (!require("BiocManager", quietly = TRUE))  
    install.packages("BiocManager")
BiocManager::install("MetaPhOR")

```{r, message=FALSE}
library(MetaPhOR)
```

## Data Preparation  
Minimal data preparation is required to run MetaPhOR. DEGs may be loaded into R 
in the form of .csv or .tsv for use with this package. The DEG file must contain
columns for log fold change, adjusted p-value, and HUGO gene names. By default, 
MetaPhOR assumes DESeq2 header stylings: “log2FoldChange” and “padj”. In any 
function that assumes these headers, however, the user can define column names
for these values. Below is a sample DEG file resulting from limma: 

```{r}
exdegs <- read.csv(system.file("extdata/exampledegs.csv",
                                package = "MetaPhOR"),
                                header = TRUE)
```

```{r, echo=FALSE}
kable(head(exdegs), format="html")    %>%
    kable_material()
```

## Pathway Analysis
“pathwayAnalysis” first assigns scores and their absolute values using log fold 
change and p value to each gene (@MP). These transcript-level 
scores, along with sample size, are then utilized to calculate both scores 
(directional change) and absolute value scores (magnitude of change) 
(@MP) for each KEGG Pathway (@KEGG). We then utilize a 
bootstrapping method, to  randomly calculate 100,000 scores per pathway, based 
on the number of genes in that pathway and model the distribution. This 
distribution can then be used to evaluate where the actual score for that 
pathway sits in relation to the distribution, and can assign a p-value to the 
achieved score.  

For example, if the polyamine biosynthetic pathway contains 13 genes, we can get
a score for the sum of the 13 genes that exist within that pathway. Using 
bootstrapping, we can then randomly sample, with replacement, 13 genes to create
scores, 100,000 times.  We use these random samples (100,000) to generate a 
distribution, and we can calculate a p-value dependent on where the score that 
consists of the 13 genes that actually exist within the pathway falls within the
distribution.  

Taken together, the scores and p-values resulting from “pathwayAnalysis” provide
a measure for both the biological and statistical significance of metabolic 
dysregulation.  

**Note: A seed MUST be set before utilizing this function to ensure reproducible
results by bootstrapping. It is NECESSARY that the seed remain the same
throughout an analysis.**

**Note: Bootrapping is performed, by default, with 100,000 iterations of 
resampling for optimal power. The number of iterations can be decreased for 
improved speed. We use 50,000 iterations for improved speed of examples.***

pathwayAnalysis() requires:  

* The file path to the DEG list of interest 
* The name of the column containing HUGO gene names 
* The sample size of the DEG analysis
* The number of iterations of resampling to be performed during bootstrapping
* Correct headers for fold change and p value columns (as indicated above)

A partial output of the pathway analysis function can be seen as follows: 

```{r, include=FALSE}
# BRCA, OVCA, PRAD 
# sampsize <- c(1095, 378, 497)
```

```{r}
set.seed(1234)

brca <- pathwayAnalysis(DEGpath = system.file("extdata/BRCA_DEGS.csv",
                        package = "MetaPhOR"),
                        genename = "X",
                        sampsize = 1095,
                        iters = 50000,
                        headers = c("logFC", "adj.P.Val"))
```

```{r, echo = FALSE}
kable(head(brca), format="html", booktabs = TRUE)    %>%
    kable_material()
```

## bubblePlot  
The metabolic profile determined by pathway analysis can be easily visualized 
using “bubblePlot.” Scores are plotted on the x-axis, while absolute value 
scores are plotted on the y-axis. Each point represents a KEGG pathway, where 
point size represents p-value (the smaller the p value, the larger the point) 
and point color is dictated by scores. Negative scores, which indicate 
transcriptional downregulation, are blue, and positive scores, which indicate 
transcriptional upregulation, are red. The top ten points, either by smallest 
p value or greatest dysregulation by score, are labeled with their pathway 
names. The plot demonstrates which pathways are the most statistically and 
biologically relevant. 

bubblePlot() requires:  

* The output of pathwayAnalysis(), as a data frame
* An indication which values to use, in order to label points: either “Pval” or 
“LogFC”
* Optional: Numeric value for point label text size (default = .25)

\newpage 
**Bubble Plot Labeled by P Value**  
```{r}
pval <- bubblePlot(scorelist = brca,
                    labeltext = "Pval",
                    labelsize = .85)
plot(pval)
```

**Bubble Plot Labeled by LogFC**  
```{r}
logfc <- bubblePlot(scorelist = brca,
                    labeltext = "LogFC",
                    labelsize = .85)
plot(logfc)
```
\newpage 

## metaHeatmap  
“metaHeatmap” provides a useful visualization for comparing metabolic profiles 
between groups, including only significantly dysregulated pathways, and 
highlighting those which are most highly changed. This function should be used 
when you have multiple groups/DEGs being compared, e.g. if you have 4 conditions
all being compared to each other. This will not be useful if you have a single 
DEG list. This function can be used only when multiple DEG comparisons are 
scored by “pathwayAnalysis.” The absolute pathway scores are scaled across 
outputs and plotted via pheatmap, selecting only those which have absolute score
p values below the level of significance. 

Note: A heatmap cannot be produced if there are no pathways significantly 
dysregulated below the p value cut off. 

metaHeatmap() requires:  

* A list of outputs from pathway analysis, as data frames
* A character vector of names for labeling each output
* Optional: The p value cut off to be used (default = 0.05)

```{r}
##read in two additional sets of scores,
##run in the same manner as brca for comparison

ovca <- read.csv(system.file("extdata/OVCA_Scores.csv", package = "MetaPhOR"),
                header = TRUE,
                row.names = 1)
prad <- read.csv(system.file("extdata/PRAD_Scores.csv", package = "MetaPhOR"),
                header = TRUE,
                row.names = 1)

all.scores <- list(brca, ovca, prad)
names <- c("BRCA", "OVCA", "PRAD")

metaHeatmap(scorelist = all.scores,
            samplenames = names,
            pvalcut = 0.05)
```

## cytoPath  
“cytoPath” models metabolic pathways, sourced from WikiPathway (@Wikipathways), 
using the transcriptional change of individual genes in the pathway. The 
resulting figure can be used to identify candidate genes for further study by 
providing a detailed look at transcriptional dysregulation within and between 
“pathwayAnalysis” outputs. Pathways of interest can be readily identified from 
“bubblePlot” or “metaHeatmap.” This function utilizes Cytoscape (@Cytoscape); 
the package *RCy3* and a local instance of Cytoscape are required to render this
plot. For details regarding legends, see the Cytoscape Legend Creator Tutorial
(https://cytoscape.org/cytoscape-tutorials/protocols/legend-creator/#/
introduction).

cytoPath() requires:  

* The name of the pathway to be plotted
* The file path to the DEG list of interest 
* The file path to which the figure will be saved
* The name of the column containing HUGE gene names 
* Correct headers for fold change and p value columns (as indicated above)

```{r, eval = FALSE}
cytoPath(pathway = "Tryptophan Metabolism",
            DEGpath = "BRCA_DEGS.csv",
            figpath = paste(getwd(), "BRCA_Tryptophan_Pathway", sep = "/"),
            genename = "X",
            headers = c("logFC", "adj.P.Val"))
```


```{r, echo = F, fig.wide = TRUE}
knitr::include_graphics(c(system.file("extdata", "BRCA_Tryptophan_Pathway.png", 
                                        package = "MetaPhOR")))
```

## pathwayList  
The WikiPathways available within MetaPhOR have been restricted to metabolic 
pathways. The pathwayList function will provide a complete list of WikiPathways
for mapping with “cytoPath.” 

```{r, eval = FALSE}
pathwayList()
```

```{r, echo = F}
kable(head(pathwayList()), format = "html", booktabs = TRUE)    %>%
    kable_material()
```

## Conclusion  
MetaPhOR contains five functions for the analysis of metabolic dysregulation 
using transcriptomic-level data and the production of publication-quality 
figures. This version of MetaPhOR, 0.99.0, is available at
https://github.com/emilyise/MetaPhOR. 

## SessionInfo
```{r}
sessionInfo()
```

## References