---
title: "multistateQTL: Orchestrating multi-state QTL analysis in R"
author:
  - "Christina B Azodi"
  - "Davis McCarthy"
  - "Amelia Dunstone"
package: multistateQTL
date: "`r Sys.Date()`"
output:
    BiocStyle::html_document:
        toc: true
        toc_float: true
vignette: >
  \usepackage[utf8]{inputenc}
  %\VignetteIndexEntry{multistateQTL: Orchestrating multi-state QTL analysis in R}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r knitr-options, echo=FALSE}
set.seed(42)
knitr::opts_chunk$set(
    warning=FALSE, error=FALSE, message=FALSE, 
    fig.height=6, fig.width=8)
```

# Introduction

`multistateQTL` is a Bioconductor package for applying basic statistical tests (e.g., feature-wise FDR correction, calculating pairwise sharing), summarizing, and visualizing QTL summary statistics from multiple states (e.g., tissues, celltypes, environmental conditions). It works on the `QTLExperiment` (`QTLE`) object class, where rows represent features (e.g., genes, transcripts, genomic regions), columns represent states, and assays are the various summary statistics. It also provides wrapper implementations of a number of multi-test correction methods (e.g., [mashr](https://github.com/stephenslab/mashr), [meta-soft](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3146723/), etc), which result in a set of multi-test corrected summary statistics. 

## Installation

QTLExperiment and multistateQTL can be installed from GitHub:

```
if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")

BiocManager::install(c("QTLExperiment", "multistateQTL"), version="devel")
```

They are also available on GitHub:
```
devtools::install_git("https://github.com/dunstone-a/QTLExperiment", build_vignettes=TRUE)
devtools::install_git(https://github.com/dunstone-a/multistateQTL", build_vignettes=TRUE)
```

```{r}
library(QTLExperiment)
library(multistateQTL)
```

# Simulating data

## Estimate parameters from GTEx

Provided with real QTL summary statistics as either a `QTLE` object or a named list with betas and error (and optionally pval or lfsr), key parameters are estimated that are used to simulate realistic multi-state QTL data. We demonstrate the parameter estimation function on publicly available summary statistics from GTEx (v8). Note that this data only contains tests that were called as significant by GTEx and vroom only loads the first chunk of results as it does not read .gz compressed objects well. This truncated dataset is used in the vignette for convenience, however, to estimate the default parameters in `qtleParams()` we downloaded QTL summary statistics for all associations tested for the 10 GTEx tissues with the largest sample sizes from [Google Cloud](https://console.cloud.google.com/storage/browser/gtex-resources/GTEx_Analysis_v8_QTLs/GTEx_Analysis_v8_eQTL_all_associations?pageState=(%22StorageObjectListTable%22%3A(%22f%22%3A%22%255B%255D%22))&prefix&forceOnObjectsSortingFiltering=false&cloudshell=true)). To speed up calculations we filtered this to include only associations on chromosome 1 and considered significant tests with pval < 0.05 and null tests with a pval > 0.1. 

See [QTLExperiment](https://github.com/dunstone-a/QTLExperiment) for more info on the `sumstats2qtle` function and for other approaches for reading in QTL summary statistics.

The parameters estimated here include:

- **Significant betas shape and rate:** Define the gamma distribution used to sample *mean effect sizes* for each QTL that is significant in at least one state. These simulated mean effect sizes are used as the mean parameter in `rnorm` to sample an effect size for each QTL for each state. The variance parameter for `rnorm` is user defined (default = 0.1).  
- **Significant coefficient of variation (cv) shape and rate:** Define the gamma distribution used to sample the cv for each QTL in each state where that QTL is significant. The cv is multiplied by the simulated significant effect size for that test/state to get the simulated standard error values. 
- **Null beta shape and rate:** Define the gamma distribution used to sample the effect sizes for each QTL in each state where the effect is not significant.
- **Null beta cv shape and rate:** Define the gamma distribution used to sample the cv for each QTL in each state where that QTL is not significant. This cv is then multiplied by the simulated null effect size for that test/state to get the simulated stand error values.


```{r demo-sumstats2qtle}
input_path <- system.file("extdata", package="multistateQTL")
state <- c("lung", "thyroid", "spleen", "blood")

input <- data.frame(
    state=state, 
    path=paste0(input_path, "/GTEx_tx_", state, ".tsv"))

gtex <- sumstats2qtle(
    input, 
    feature_id="molecular_trait_id",
    variant_id="rsid", 
    betas="beta", 
    errors="se",
    pvalues="pvalue", 
    verbose=TRUE)
gtex
head(betas(gtex))
```

Estimating parameters: 

```{r estimate-parameters}
params <- qtleEstimate(gtex, threshSig=0.05, threshNull=0.5)
params
```

Looking at the distributions defined by these estimated parameters, the simulated effect sizes for significant QTL will tend to be larger, while the simulated coefficient of variation values will be smaller than for the non-significant QTL.

```{r plot-estimated params, fig.height=3, fig.width=6, fig.cap="Gamma distributions defined by the parameters estimated by qtleEstimate."}
plotSimulationParams(params=params)
```

The default parameters available through `qtleParams()` were estimated from the GTEx v8 tissue-level eQTL summary statistics from chromosome 1 using the 10 tissues with the largest sample sizes. From these data, significant QTL parameters were estimated from tests in the lowest p-value quantile, while null parameters were estimated from tests in the highest p-value quantile. Data for tests on chromosome 1 were included in all four tissues (n=32613).


## Simulate multi-state QTL data

The simulation tool allows for the simulation of four types of associations: (1) Global, where the simulated effect size is approximately equal across all states; (2) Unique, where the association is only significant in one state; (3) Multi-state, where the association is significant in a subset of states (i.e., state-groups), and (4) Null, where the association has no significant effects in any state. First each test is randomly assigned as one of the above types according to the proportions specified by the user. For multi-state QTL, each state is assigned to a state-group, either randomly or according to user defined groups, then each multi-state QTL is assigned randomly to one of the state-groups. For unique QTL, the QTL is randomly assigned to a single state. 

Simulated mean effect sizes for all non-null QTL are sampled from gamma(beta.sig.shape, beta.sig.rate) and are randomly assigned a positive or negative effect direction. Then for each state where that QTL is significant, an effect size is sampled from N(mean effect size, σ), where σ is user defined (default=0.1). Effect sizes for null QTL are sampled from gamma(beta.null.shape, beta.null.rate) and are randomly assigned a positive or negative effect direction. Standard errors for each QTL for each state are simulated by sampling from gamma(cv.sig.shape, cv.sig.rate) or gamma(cv.null.shape, cv.null.rate) for significant and null QTL, respectively, and multiplying the sampled cv by the absolute value of the simulated beta for that QTL in that state. 

Here is an example of a simple simulation with half of the simulated QTL tests having globally significant effects. This example uses the default parameters.

```{r basic-simulation}
sim <- qtleSimulate(nTests=1000, nStates=6, global=0.5)
sim
```

```{r basic-simulation-key}
head(rowData(sim))
```

We can also generate more complex simulations, for example this simulation has 20% global, 40% multi-state, 20% unique, and 20% null QTL effects, where multi-state effects are assigned to one of two state-groups.

```{r complex-simulation}
sim <- qtleSimulate(
    nStates=10, nFeatures=100, nTests=1000,
    global=0.2, multi=0.4, unique=0.2, k=2)
```


Here is a snapshot of the simulation key for QTL simulated as unique to a single state: 

```{r snapshot-unique}
head(rowData(subset(sim, QTL == "unique")))
```

Here is a snapshot of the simulation key for QTL simulated as multi-state: 

```{r snapshot-multistate}
head(rowData(subset(sim, QTL == "multistate")))

message("Number of QTL specific to each state-group:")
table(rowData(subset(sim, QTL == "multistate"))$multistateGroup)
```


# Dealing with missing data

The multistateQTL toolkit provides two functions to help deal with missing data, `getComplete` and `replaceNAs`. The `getComplete` function is a smart subsetting function that remove QTL associations (rows) with more than an allowed amount of missing data. The `replaceNAs` function allows for NAs in each assay to be replaced with a constant or with the row mean or row median. For example, here is a snapshot of our simulated data from above with added NAs:  

```{r add-NAs}
na_pattern <- sample(seq(1, ncol(sim)*nrow(sim)), 1000)

sim_na <- sim
assay(sim_na, "betas")[na_pattern] <- NA
assay(sim_na, "errors")[na_pattern] <- NA
assay(sim_na, "lfsrs")[na_pattern] <- NA

message("Number of simulated tests: ", nrow(sim_na))
head(betas(sim_na))
```

First we can use `getComplete` to keep only the tests that have data for at least half of the states: 

```{r remove-50p-missing}
sim_na <- getComplete(sim_na, n=0.5, verbose=TRUE)
```

Then for the remaining QTL, we can fill in the missing values using the following scheme

```{r fill-in-missing}
sim_na <- replaceNAs(sim_na, verbose=TRUE)

head(betas(sim_na))
```


# Calling significance

The multistateQTL toolkit also provides the `callSignificance` function, which calls QTL tests significant in each state using either a single or two-step threshold approach. For example, we can set a single lfsr threshold of 0.1 to call significance of our simulate QTL: 

```{r simple-significance-calling}
sim <- callSignificance(sim, assay="lfsrs", thresh=0.05)

message("Median number of significant tests per state: ", 
    median(colData(sim)$nSignificant))
```

Because we have the simulated ground-truth, we can compare these significance calls to what was simulated using the `simPerformance` function, which provides the following global (i.e. across all state) performance metrics:

```{r simple-sig-calling-performance}
sim <- callSignificance(sim, assay="lfsrs", thresh=0.001)
perf_metrics <- simPerformance(sim)
lapply(perf_metrics, FUN=function(x) {round(x, 2)})
```

As you can see the recall of TRUE significant QTL is quite low. However if we change our significance calling approach to be more flexible. 

```{r state-by-state-double-thresholds}
sim <- callSignificance(
    sim, mode="simple", assay="lfsrs",
    thresh=0.0001, secondThresh=0.0002)
simPerformance(sim)$recall
```


# Plotting global patterns of sharing

The `multistateQTL` package contains five functions for visualising multi-state eQTL data. 
These functions are based on the `r BiocStyle::CRANpkg("ggplot2")`  and `r BiocStyle::Biocpkg("ComplexHeatmap")` R packages.

- `plotPairwiseSharing`: based on `ComplexHeatmap::Heatmap`
- `plotQTLClusters`: based on `ComplexHeatmap::Heatmap`
- `plotUpSet`: based on `ComplexHeatmap::UpSet`
- `plotCompareStates`: based on `ggplot2`
- `plotSimulationParams`: based on `ggplot2`

The functions built on `ggplot2` are compatible with `ggplot2` syntax such as the `+` operator.

## Pairwise sharing

The function `plotPairwiseSharing` shows the degree of pairwise sharing of significant hits for each combination of two states. 
Column annotations can be added by specifying a valid column name from the `colData` of the object. 

In the plot below, columns are ordered by the broad cell type (`multistateGroup`) of the states. 
We expect states belonging to the same multi-state group to be more related and have a greater degree of sharing of significant eQTLs. 
Column annotations are used to show the number of significant eQTLs for each state.


```{r pairwise-sharing}
sim_sig <- getSignificant(sim)
sim_top <- getTopHits(sim_sig, assay="lfsrs", mode="state")
sim_top <- runPairwiseSharing(sim_top)
p1 <- plotPairwiseSharing(sim_top, annotate_cols=c("nSignificant", "multistateGroup"))
```


## Upset plots

These plots show the set of tests that are significant, but not necessarily shared, by groups of states. 

```{r upset-plot}
plotUpSet(sim_top, annotateColsBy=c("nSignificant", "multistateGroup"))
```


# Characterizing multi-state QTL patterns

## Categorizing multi-state QTL tests

Once multi-state test correction is performed, you will want to identify global, multi-state, and unique QTL. 

Note that `plotCompareStates` returns a list with a `ggplot2` object as the first element and a `table` as the second element. 
These can be accessed using the names "plot" or "counts".

```{r get-significant-simple}
sim_top <- runTestMetrics(sim_top)

plotCompareStates(sim_top, x="S01", y="S02")
    
table(rowData(sim_top)$qtl_type)

hist(rowData(sim_top)$nSignificant)
```

## Visualizing multi-state QTL

The function `plotQTLClusters` can be used to produce a heatmap of the eQTL betas values for each state. 
Each row is a SNP-gene pair, and columns are states. 
Row and column annotations can be added by naming column names from the `rowData` or `colData` of the input `QTLExperiment` object.

```{r plot-multistate-QTL}
sim_top_ms <- subset(sim_top, qtl_type_simple == "multistate")

plotQTLClusters(
    sim_top_ms, 
    annotateColsBy=c("multistateGroup"),
    annotateRowsBy=c("qtl_type", "mean_beta", "QTL"))
```


# Session Info

```{r session-info}
sessionInfo()
```