---
title: "CoGAPS - Coordinated Gene Association in Pattern Sets"
author: "Jeanette Johnson, Ashley Tsang, Thomas Sherman, Genevieve Stein-O'Brien, Hyejune Limb, Elana Fertig"
date: "`r BiocStyle::doc_date()`"
bibliography: References.bib
vignette: >
    %\VignetteIndexEntry{CoGAPS}
    %\VignetteEncoding{UTF-8}
    %\VignetteEngine{knitr::rmarkdown}
output: 
    BiocStyle::html_document
editor_options: 
  markdown: 
    wrap: 72
---

```{r include=FALSE, cache=FALSE}
library(CoGAPS)
library(BiocParallel)
```

# Vignette Version

This vignette was built using CoGAPS version:

```{r current version}
packageVersion("CoGAPS")
```

# Introduction

Coordinated Gene Association in Pattern Sets (CoGAPS) is a technique for
latent space learning in gene expression data. CoGAPS is a member of the
Nonnegative Matrix Factorization (NMF) class of algorithms. NMFs
factorize a data matrix into two related matrices containing gene
weights, the Amplitude (A) matrix, and sample weights, the Pattern (P)
Matrix. Each column of A or row of P defines a feature and together this
set of features defines the latent space among genes and samples,
respectively. In NMF, the values of the elements in the A and P matrices
are constrained to be greater than or equal to zero. This constraint
simultaneously reflects the non-negative nature of gene expression data
and enforces the additive nature of the resulting feature dimensions,
generating solutions that are biologically intuitive to interpret
(@SEUNG_1999).

# Software Setup

*CoGAPS* can be installed directly from the FertigLab Github Repository
using R devtools or from Bioconductor

```{r eval=FALSE}
devtools::install_github("FertigLab/CoGAPS")

# To install via BioConductor:
install.packages("BiocManager")
BiocManager::install("FertigLab/CoGAPS")
```

When CoGAPS has installed correctly, you will see this message:

\*\* installing vignettes

\*\* testing if installed package can be loaded from temporary location

\*\* checking absolute paths in shared objects and dynamic libraries

\*\* testing if installed package can be loaded from final location

\*\* testing if installed package keeps a record of temporary
installation path

\* DONE (CoGAPS)

# Running CoGAPS on Simulated Toy Data

We first give a walkthrough of the package features using a simple,
simulated data set. In later sections we provide two example workflows
on real data sets.

Import the CoGAPS library with the following command:

```{r library}
library(CoGAPS)
```

To ensure CoGAPS is working properly, we will first load in the
simulated toy data for a test run.

Single-cell data will be loaded later in this file.

```{r modsim}
data('modsimdata')
# input to CoGAPS
modsimdata
```

Next, we will set the parameters to be used by CoGAPS. First, we will
create a CogapsParams object, then set parameters with the setParam
function.

```{r modsim params}
# create new parameters object
params <- new("CogapsParams", nPatterns=6)

# view all parameters
params

# get the value for a specific parameter
getParam(params, "nPatterns")

# set the value for a specific parameter
params <- setParam(params, "nPatterns", 3)
getParam(params, "nPatterns")
```

Run `CoGAPS` on the ModSim data.

Since this is a small dataset, the expected runtime is only about 5-10
seconds.

The only required argument to `CoGAPS` is the data set. This can be a
`matrix`, `data.frame`, `SummarizedExperiment`, `SingleCellExperiment`
or the path of a file (`tsv`, `csv`, `mtx`, `gct`) containing the data.

```{r cogaps on modsim}
# run CoGAPS with specified parameters
cogapsresult <- CoGAPS(modsimdata, params, outputFrequency = 10000)
```

Verify that a similar output appears:
```{eval=FALSE}
This is CoGAPS version 3.19.1 

Running Standard CoGAPS on modsimdata (25 genes and 20 samples) with
parameters:

-- Standard Parameters --
nPatterns            3 
nIterations          50000 
seed                 622 
sparseOptimization   FALSE 

-- Sparsity Parameters --
alpha          0.01 
maxGibbsMass   100 
Data Model: Dense, Normal
Sampler Type: Sequential
Loading Data\...Done! (00:00:00)

-- Equilibration Phase --
10000 of 50000, Atoms: 59(A), 49(P), ChiSq: 245, Time: 00:00:00 00:00:00
20000 of 50000, Atoms: 68(A), 46(P), ChiSq: 188, Time: 00:00:00 00:00:00
30000 of 50000, Atoms: 80(A), 47(P), ChiSq: 134, Time: 00:00:00 00:00:00
40000 of 50000, Atoms: 69(A), 46(P), ChiSq: 101, Time: 00:00:00 00:00:00
50000 of 50000, Atoms: 76(A), 53(P), ChiSq: 132, Time: 00:00:00 00:00:00

-- Sampling Phase --

10000 of 50000, Atoms: 82(A), 52(P), ChiSq: 94, Time: 00:00:00 00:00:00
20000 of 50000, Atoms: 74(A), 54(P), ChiSq: 144, Time: 00:00:01 00:00:01
30000 of 50000, Atoms: 79(A), 47(P), ChiSq: 116, Time: 00:00:01 00:00:01
40000 of 50000, Atoms: 79(A), 46(P), ChiSq: 132, Time: 00:00:01 00:00:01
50000 of 50000, Atoms: 76(A), 48(P), ChiSq: 124, Time: 00:00:01 00:00:01
```

This means that the underlying C++ library has run correctly, and
everything is installed how it should be.

We now examine the result object.

## Analyzing the Toy Data CoGAPS result

CoGAPS returns a object of the class `CogapsResult` which inherits from
`LinearEmbeddingMatrix` (defined in the `SingleCellExperiment` package).
CoGAPS stores the lower dimensional representation of the samples (P
matrix) in the `sampleFactors` slot and the weight of the features (A
matrix) in the `featureLoadings` slot. `CogapsResult` also adds two of
its own slots - `factorStdDev` and `loadingStdDev` which contain the
standard deviation across sample points for each matrix.

There is also some information in the `metadata` slot such as the
original parameters and value for the Chi-Sq statistic. In general, the
metadata will vary depending on how `CoGAPS` was called in the first
place. The package provides these functions for querying the metadata in
a safe manner:

```{r modsim result}
cogapsresult
cogapsresult@sampleFactors
cogapsresult@featureLoadings

# check reference result:
data('modsimresult')
```

If both matrices--sampleFactors and featureLoadings--have reasonable
values (small, nonnegative, somewhat random-seeming), it is an
indication that CoGAPS is working as expected.

We now continue with single-cell analysis.

# Single-cell CoGAPS

```{r load single cell data from zenodo}
# OPTION: download data object from Zenodo
options(timeout=50000) # adjust this if you're getting timeout downloading the file
bfc <- BiocFileCache::BiocFileCache()
pdac_url <- "https://zenodo.org/record/7709664/files/inputdata.Rds"
pdac_data <- BiocFileCache::bfcrpath(bfc, pdac_url)
pdac_data <- readRDS(pdac_data)
pdac_data
```

We also want to extract the counts matrix to provide directly to CoGAPS

```{r extract counts matrix, eval=FALSE}
pdac_epi_counts <- as.matrix(pdac_data@assays$originalexp@counts)
norm_pdac_epi_counts <- log1p(pdac_epi_counts)

head(pdac_epi_counts, n=c(5L, 2L))
head(norm_pdac_epi_counts, n=c(5L, 2L))
```

Most of the time we will set some parameters before running CoGAPS.
Parameters are managed with a CogapsParams object. This object will
store all parameters needed to run CoGAPS and provides a simple
interface for viewing and setting the parameter values.

```{r set params}
library(CoGAPS)
pdac_params <- CogapsParams(nIterations=50000, # 50000 iterations
               	seed=42, # for consistency across stochastic runs
               	nPatterns=8, # each thread will learn 8 patterns
                sparseOptimization=TRUE, # optimize for sparse data
                distributed="genome-wide") # parallelize across sets

pdac_params
```

If you wish to run distributed CoGAPS, which is recommended to improve
the computational efficiency for most large datasets, you must also call
the setDistributedParams function. For a complete description of the
parallelization strategy used in distributed CoGAPS, please refer to our
preprint: <https://www.biorxiv.org/content/10.1101/2022.07.09.499398v1>

```{r distributed params}
pdac_params <- setDistributedParams(pdac_params, nSets=7)
pdac_params
```

With all parameters set, we are now ready to run CoGAPS. Please note
that this is the most time-consuming step of the procedure. Timing can
take several hours and scales nlog(n) based on dataset size, as well as
the parameter values set for 'nPatterns' and 'nIterations'. Time is
increased when learning more patterns, when running more iterations, and
when running a larger dataset, with iterations having the largest
variable impact on the runtime of the NMF function.

```{r run CoGAPS on single-cell data, eval=FALSE}
startTime <- Sys.time()
  
pdac_epi_result <- CoGAPS(pdac_epi_counts, pdac_params)
endTime <- Sys.time()

saveRDS(pdac_epi_result, "../data/pdac_epi_cogaps_result.Rds")

# To save as a .csv file, use the following line:
toCSV(pdac_epi_result, "../data")
```

# Analyzing the CoGAPS result

Now that the CoGAPS run is complete, learned patterns can be
investigated. Due to the stochastic nature of the MCMC sampling in
CoGAPS and long run time, it is generally a good idea to immediately
save your CoGAPS result object to a file to have (Box 17), then read it
in for downstream analysis.

If you wish to load and examine a precomputed result object, please do
so by:

```{r load precomputed from zenodo}
# OPTION: download precomputed CoGAPS result object from Zenodo
#caches download of the hosted file
cogapsresult_url <- "https://zenodo.org/record/7709664/files/cogapsresult.Rds"
cogapsresult <- BiocFileCache::bfcrpath(bfc, cogapsresult_url)
cogapsresult <- readRDS(cogapsresult)

```

To load your own result, simply edit the file path:

```{r load saved, eval=FALSE}
library(CoGAPS)
cogapsresult <- readRDS("../data/pdac_epi_cogaps_result.Rds")
```

It is recommended to immediately visualize pattern weights on a UMAP
because you will immediately see whether they are showing strong signal
and make common sense.

Since pattern weights are all continuous and nonnegative, they can be
used to color a UMAP in the same way as one would color by gene
expression. The sampleFactors matrix is essentially just nPatterns
different annotations for each cell, and featureLoadings is likewise
just nPatterns annotations for each gene. This makes it very simple to
incorporate pattern data into any data structure and workflow.

## Load CoGAPS pattern information into Seurat object

To store CoGAPS patterns as an Assay within a Seurat object
(recommended):

```{r pattern assay, eval=FALSE}
library(Seurat)
# make sure pattern matrix is in same order as the input data
patterns_in_order <-t(cogapsresult@sampleFactors[colnames(pdac_data),])

# add CoGAPS patterns as an assay
pdac_data[["CoGAPS"]] <- CreateAssayObject(counts = patterns_in_order)
```

## Plot patterns on an embedding

With the help of Seurat's FeaturePlot function, we generate a UMAP
embedding of the cells colored by the intensity of each pattern.

```{r pattern UMAP, eval=FALSE}
DefaultAssay(pdac_data) <- "CoGAPS"
pattern_names = rownames(pdac_data@assays$CoGAPS)

library(viridis)
color_palette <- viridis(n=10)

FeaturePlot(pdac_data, pattern_names, cols=color_palette, reduction = "umap") & NoLegend()
```

# Pattern Markers
We provide a `patternMarkers()` CoGAPS function to find markers (i.e. genes or 
samples) most associated with each pattern. It returns a per-pattern
dictionary of markers, their ranking, and their distance "score" for each
pattern.

## Marker assignment methods
`patternMarkers()` can run in two marker selection modes, controlled with
the "threshold" parameter, which can be set to "all" or "cut". In both cases
the ranking of markers is essentially sorting Euclidian distances between a
candidate marker and a synthetic one-row matrix that represents an "ideal" 
pattern.

If `threshold="all"` (default), each gene is treated as a marker of exactly one pattern,
whichever it is most strongly associated with, based on the distance metric. 

For the `threshold="cut"`, the candidates are sorted by best intra-pattern rank
and are added to markers list unitil a certain rank is reached. This rank 
corresponds to the first occurence of intra-pattern rank being worse than the
inter-pattern rank. This mode usually yields much less markers per pattern, and 
the same marker can appear in multiple patterns.

## Axis selection
The `axis` parameter can be used to select the axis along which the markers are
selected. By default, `axis=1`, which means that the feature (i.e. gene) markers
are estimated. If `axis=2`, the sample markers are estimated. 

The three components of the returned dictionary pm are:

## `patternMarkers()` Output
Output is a list of three components:

PatternMarkers:

* A list of marker genes for each pattern

PatternRanks:

*   Each marker gene ranked by association for each pattern
*   Whole natural numbers, assigning each marker a place in the
    rank for each pattern.
*   Lower rank indicates higher association and vice versa.

PatternScores:

* Scores describe how strongly a marker gene is associated with a
  pattern.
* A higher score value indicates the marker gene is less associated
  with the pattern (as score is a distance) metric, and vice versa.
* Scores have nonnegative values mostly falling between 0 and 2

## Example
```{r pattern markers}
pm <- patternMarkers(cogapsresult)
```

# Pattern GSEA
One way to explore and use CoGAPS patterns is to conduct gene set enrichment analysis by functionally annotating the genes which are significant for each pattern. The getPatternGeneSet function provides a wrapper around the gsea and fora methods from fgsea for gene set enrichment in CoGAPS pattern amplitudes or gene set overrepresentation in pattern markers, respectively. Gene sets for testing are provided as a list to allow testing of gene sets from many sources, such as hallmark gene sets from [msigDB](https://www.gsea-msigdb.org/gsea/msigdb/) or gene ontology sets from [org.Hs.eg.db](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html).

To perform gene set overrepresentation on pattern markers, please run:

```{r get hallmarks}
hallmark_ls <- list(
  "HALLMARK_ALLOGRAFT_REJECTION" = c(
    "AARS1","ABCE1","ABI1","ACHE","ACVR2A","AKT1","APBB1","B2M","BCAT1","BCL10","BCL3","BRCA1","C2","CAPG","CARTPT","CCL11","CCL13","CCL19","CCL2","CCL22","CCL4","CCL5","CCL7","CCND2","CCND3","CCR1","CCR2","CCR5","CD1D","CD2","CD247","CD28","CD3D","CD3E","CD3G","CD4","CD40","CD40LG","CD47","CD7","CD74","CD79A","CD80","CD86","CD8A","CD8B","CD96","CDKN2A","CFP","CRTAM","CSF1","CSK","CTSS","CXCL13","CXCL9","CXCR3","DARS1","DEGS1","DYRK3","EGFR","EIF3A","EIF3D","EIF3J","EIF4G3","EIF5A","ELANE","ELF4","EREG","ETS1","F2","F2R","FAS","FASLG","FCGR2B","FGR","FLNA","FYB1","GALNT1","GBP2","GCNT1","GLMN","GPR65","GZMA","GZMB","HCLS1","HDAC9","HIF1A","HLA-A","HLA-DMA","HLA-DMB","HLA-DOA","HLA-DOB","HLA-DQA1","HLA-DRA","HLA-E","HLA-G","ICAM1","ICOSLG","IFNAR2","IFNG","IFNGR1","IFNGR2","IGSF6","IKBKB","IL10","IL11","IL12A","IL12B","IL12RB1","IL13","IL15","IL16","IL18","IL18RAP","IL1B","IL2","IL27RA","IL2RA","IL2RB","IL2RG","IL4","IL4R","IL6","IL7","IL9","INHBA","INHBB","IRF4","IRF7","IRF8","ITGAL","ITGB2","ITK","JAK2","KLRD1","KRT1","LCK","LCP2","LIF","LTB","LY75","LY86","LYN","MAP3K7","MAP4K1","MBL2","MMP9","MRPL3","MTIF2","NCF4","NCK1","NCR1","NLRP3","NME1","NOS2","NPM1","PF4","PRF1","PRKCB","PRKCG","PSMB10","PTPN6","PTPRC","RARS1","RIPK2","RPL39","RPL3L","RPL9","RPS19","RPS3A","RPS9","SIT1","SOCS1","SOCS5","SPI1","SRGN","ST8SIA4","STAB1","STAT1","STAT4","TAP1","TAP2","TAPBP","TGFB1","TGFB2","THY1","TIMP1","TLR1","TLR2","TLR3","TLR6","TNF","TPD52","TRAF2","TRAT1","UBE2D1","UBE2N","WARS1","WAS","ZAP70"
    ),
  "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION" = c(
    "ABI3BP","ACTA2","ADAM12","ANPEP","APLP1","AREG","BASP1","BDNF","BGN","BMP1","CADM1","CALD1","CALU","CAP2","CAPG","CD44","CD59","CDH11","CDH2","CDH6","COL11A1","COL12A1","COL16A1","COL1A1","COL1A2","COL3A1","COL4A1","COL4A2","COL5A1","COL5A2","COL5A3","COL6A2","COL6A3","COL7A1","COL8A2","COMP","COPA","CRLF1","CCN2","CTHRC1","CXCL1","CXCL12","CXCL6","CCN1","DAB2","DCN","DKK1","DPYSL3","DST","ECM1","ECM2","EDIL3","EFEMP2","ELN","EMP3","ENO2","FAP","FAS","FBLN1","FBLN2","FBLN5","FBN1","FBN2","FERMT2","FGF2","FLNA","FMOD","FN1","FOXC2","FSTL1","FSTL3","FUCA1","FZD8","GADD45A","GADD45B","GAS1","GEM","GJA1","GLIPR1","COLGALT1","GPC1","GPX7","GREM1","HTRA1","ID2","IGFBP2","IGFBP3","IGFBP4","IL15","IL32","IL6","CXCL8","INHBA","ITGA2","ITGA5","ITGAV","ITGB1","ITGB3","ITGB5","JUN","LAMA1","LAMA2","LAMA3","LAMC1","LAMC2","P3H1","LGALS1","LOX","LOXL1","LOXL2","LRP1","LRRC15","LUM","MAGEE1","MATN2","MATN3","MCM7","MEST","MFAP5","MGP","MMP1","MMP14","MMP2","MMP3","MSX1","MXRA5","MYL9","MYLK","NID2","NNMT","NOTCH2","NT5E","NTM","OXTR","PCOLCE","PCOLCE2","PDGFRB","PDLIM4","PFN2","PLAUR","PLOD1","PLOD2","PLOD3","PMEPA1","PMP22","POSTN","PPIB","PRRX1","PRSS2","PTHLH","PTX3","PVR","QSOX1","RGS4","RHOB","SAT1","SCG2","SDC1","SDC4","SERPINE1","SERPINE2","SERPINH1","SFRP1","SFRP4","SGCB","SGCD","SGCG","SLC6A8","SLIT2","SLIT3","SNAI2","SNTB1","SPARC","SPOCK1","SPP1","TAGLN","TFPI2","TGFB1","TGFBI","TGFBR3","TGM2","THBS1","THBS2","THY1","TIMP1","TIMP3","TNC","TNFAIP3","TNFRSF11B","TNFRSF12A","TPM1","TPM2","TPM4","VCAM1","VCAN","VEGFA","VEGFC","VIM","WIPF1","WNT5A"
  )
)

hallmarks_ora <- getPatternGeneSet(cogapsresult,
                                   gene.sets = hallmark_ls,
                                   method = "overrepresentation")
```

hallmarks is a list of data frames, each containing hallmark overrepresentation statistics corresponding to one pattern. 

To generate a barchart of the most significant hallmarks for any given pattern, please run:

```{r plot hallmarks}
pl_pattern7 <- plotPatternGeneSet(
  patterngeneset = hallmarks_ora, whichpattern = 7, padj_threshold = 0.05
)
pl_pattern7
```
To generate statistics on the association between certain sample groups and patterns, we provide a wrapper function, called MANOVA.This will allow us to explore if the patterns we have discovered lend to statistically significant differences in the sample groups. We will first load in the original data (if not already done earlier).

Then, create a new matrix called “interestedVariables” consisting of the metadata variables of interest in conducting analysis on. Lastly, call the wrapper function, passing in the result object as well. 

```{r MANOVA variables}
# create dataframe of interested variables 
interestedVariables <- cbind(pdac_data@meta.data[["nCount_RNA"]], pdac_data@meta.data[["nFeature_RNA"]])
# call cogaps manova wrapper
manova_results <- MANOVA(interestedVariables, cogapsresult)
```
The function will print out the MANOVA results for each pattern learned based on the variables of interest. From the output, we can observe that all p-values have a value of 0.0, indicating that differences observed in the sample groups based on the patterns are statistically significant. 

# Citing CoGAPS

If you use the CoGAPS package for your analysis, please cite
@FERTIG_2010

If you use the gene set statistic, please cite @OCHS_2009

# References