---
title: "Data-Mining Biomarkers and High-Dimensional Testing"
author: "Wilson Cai and [Nima Hejazi](https://nimahejazi.org)"
date: "`r Sys.Date()`"
bibliography: vignette-refs.bib
output:
  BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{Data-Mining Biomarkers and High-Dimensional Testing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Introduction

The `adaptest` R package can be used to perform data-mining and high-dimensional
statistical tests that is common in differential expression studies. The package
utilizes a two-stage procedure:

1. The _data-mining_ stage: reduce the dimension of biomarkers based on the
   associations of biomarkers with an exposure variable.

2. The _multiple testing_ stage: adjust for multiple testing to control false
   positives.

In this vignette, we illustrate how to use `adaptest` to perform such analysis,
using a data set containing microarray expression measures.

---

## Data-mining stage

First, we load the `adaptest` package and the (included) `simpleArray` data set:

```{r setup_data, message=FALSE}
library(adaptest)
data(simpleArray)
set.seed(1234)
Y <- simulated_array
A <- simulated_treatment
```

In order to construct targeted minimum loss-based estimates for the purpose of
hypothesis testing with data-adaptive parameters, we need three separate data
structures:

1. _W_, baseline covariates that could potentially confound the association of
   biomarkers with the exposure of interest;
2. _A_, the point exposure of interest; and
3. _Y_, the biomarkers of interest.

All values in _A_ ought to be binarized, in order to avoid practical violations
of the assumption of positivity. To invoke the data-adaptive testing function
(`adaptest`), we also need to specify the number of top biomarkers `n_top` to
the data-mining algorithm, and the number of folds `n_fold` for cross-
validation. The smaller `n_top` is, the more selective data-mining algorithm we
have. The larger `n_fold` is, more folds are carried our in cross-validation.

The TMLE-based biomarker discovery process can be invoked using the `adaptest`
function. The procedure is quite resource-intensive because it evaluates the
association of each individual potential biomarker (of which there are $1000$ in
the included data set) with an exposure of interest, while accounting for
potential confounding based on all other covariates included in the design
matrix. We demonstrate the necessary syntax for calling `adaptest` below:

```{r adaptest_eval, message=FALSE}
adaptest_out <- adaptest(Y = Y,
                         A = A,
                         W = NULL,
                         n_top = 35,
                         n_fold = 5,
                         learning_library = c("SL.glm", "SL.mean"),
                         parameter_wrapper = adaptest::rank_DE,
                         absolute = FALSE,
                         negative = FALSE)
```

The output of `adaptest` is an object of class `data_adapt`, containing the
following major components:

1. `top_index`: (integer vector) - indices for the biomarkers selected by way of
   data-mining.
2. `top_colname`: (character vector) - names for the biomarkers selected by way
   of data-mining.
3. `top_colname_significant_q`: (character vector) - names for the biomarkers
   selected by way of data-mining, which are significant after multiple testing
   stage.
4. `DE`: (numeric vector) - differential expression effect sizes for the
   biomarkers in `top_colname`.
5. `p_value`: (numeric vector) - p-values for the biomarkers in `top_colname`.
6. `q_value`: (numeric vector) - q-values for the biomarkers in `top_colname`.
7. `significant_q`: (integer vector) - indices of `top_colname` which is
   significant after multiple testing stage.
8. `mean_rank_top`: (numeric vector) - average ranking across cross-validation
   folds for the biomarkers in `top_colname`.
9. `folds`: (`origami::folds class`) - object of cross-validation folds.

After invoking `adaptest`, the resultant `data_adapt` object will have the slots
described above completely populated. Note that simply calling `objects()` on an
object of class `data_adapt` will return more slots than those described above
-- many of these are auxiliary slots containing information that is likely not
of interest to the user. The $9$ slots given above contain information that
summarizes the findings of the data-adaptive hypothesis testing procedure.
What's more, even individually accessing these slots is inconvenient; thus, we
provide a `summary` method to allow the statistical results of this procedure to
be extracted easily. We demonstrate this below

```{r adaptest_summary, message=FALSE}
summary(adaptest_out)
```

In the table above, each column provides information about the results of the
data-adaptive hypothesis testing procedure. In particular, columns 2-4 provide
effect sizes, p-values, and q-values (after multiple testing correction) that
may be of interest in scientifically interpreting the findings of the procedure.

As our goal in this vignette is to describe the properties of the `adaptest`
software package and its operation, we omit a discussion of the statistical
methodology implemented in this R package. A fully detailed technical account of
the data-adaptive multiple testing procedure is available in @cai2018data,
presently available on the arxiv. For an introduction to statistical inference
procedures using data-adaptive target parameters, the interested reader is
directed to @hubbard2016statistical. For background on the Targeted Learning
methodology, as well as recent advances, the canonical references are
@vdl2011targeted and @vdl2018targeted.

---

## Interpret and Visualize Results

This package provides several interpretation methods that can be used to tabular
and visualize the results of the data-adaptive tests.

The `get_composition` method for a `adaptest` object will produce a table of
composition of each data-adaptive parameters that is significant after multiple
testing stage:

```{r get_comp_small}
get_composition(object = adaptest_out, type = "small")
```

Setting the argument `type = "big"` will instead produce a table of composition
of each data-adaptive parameters before multiple testing stage, so that there
are more columns. We omit running the code below since the output is large.

```{r get_comp_big, eval=FALSE}
# NOT RUN
get_composition(object = adaptest_out, type = "big")
```

The `plot` method for a `adaptest` object will produce two plots that help user
interpret the results. The first plot is a plot of sorted average CV-rank for
all the biomarkers in the original data set (`Y`). The second plot is a plot of
sorted q-values with labels corresponding to the indices of the data-adaptive
parameter (as returned in `get_composition`)

```{r plot}
plot(adaptest_out)
```

---

## Bioconductor Integration with `SummarizedExperiment`

Now, let's try to acquire a taste for how we would use these algorithmic tools
with objects common in computational biology -- we'll do this by performing the
same analysis we did above, but using the core `SummarizedExperiment` container
object and the popular `airway` data set as an example.

To start, let's load the required packages:

```{r sum_exp, message=FALSE}
library(SummarizedExperiment)
library(airway)
data(airway)
```

For simplicity, we'll restrict ourselves to just a random subset of the genes or
transcripts available from the _airway_ data set.

```{r augment_se}
genes_sub <- order(sample(seq_len(1000)))
air_reduced <- airway[genes_sub, ]
```

Generally, finding data-adaptive target parameters is a computationally and
data-intensive procedure, requiring a fairly __large sample size__. To work with
the relatively small `airway` data set, we'll simply augment the data by
_artificially_ doubling it in size in a naive manner:

```{r augment_airway}
simple_air <- cbind(air_reduced, air_reduced)
```

We'll now extract the variable of interest -- a `factor` label for whether the
unit received treatment or not -- and coerce it to a binary `numeric` vector:

```{r make_tx_var}
# use a binary treatment variable (must be 0/1 only)
dex_var = as.numeric(as.matrix(colData(simple_air))[, 3] - 1)
```

Now, we can perform the same data-adaptive analysis we discussed above on this
simplified version of the _airway_ data set, simply by calling the `bioadaptest`
wrapper function:

```{r bioadaptest}
airway_out <- bioadaptest(data_in = simple_air,
                          var_int = dex_var,
                          cntrl_set = NULL,
                          n_top = 5,
                          n_fold = 2,
                          parameter_wrapper = rank_DE)
```

---

## Session Information

```{r session_info, echo=FALSE}
sessionInfo()
```

---

## References