---
title: 'Alternatives to data metrics object'
author:
- name: Lindsay Rutter
date: '`r Sys.Date()`'
package: bigPint
bibliography: bigPint.bib
output:
  BiocStyle::html_document:
    toc_float: true
    tidy: TRUE
vignette: >
  \usepackage[utf8]{inputenc}
  %\VignetteIndexEntry{"Alternatives to data metrics object"}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignettePackage{bigPint}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo=TRUE)
```

## Options for superimposing data subsets

The `bigPint` package allows users to superimpose a subset of the dataset onto the full dataset. If the `bigPint` package is being applied to RNA-seq data, then this subset of genes is often differentially expressed genes (DEGs). Except for the `plotSMApp()` function, all functions offer users three options for superimposing data subsets in the `bigPint` package. We briefly discuss these options below.

____________________________________________________________________________________

## Option 1: Data metrics object

The `dataMetrics` input is NULL by default in `bigPint` package functions. However, a user can create a `dataMetrics` object as is explained in the article [Creating data metrics object](https://lindsayrutter.github.io/bigPint/articles/createDataMetrics.html). If a user does input a `dataMetrics` object, then two other input parameters will be used, `threshVar` and `threshVal`. These two inputs will be used to create the data subset from the `dataMetrics` input. Below are their definitions in the help files of `bigPint`package functions:

- threshVar `CHARACTER STRING` Name of column in dataMetrics object used to threshold significance; default "FDR"
- threshVal `INTEGER` Maximum value to threshold significance from threshVar object; default 0.05

Below is an example of superimposing genes that have an FDR value less than 1e-10 [@soybeanIR].

```{r, eval=TRUE, include=TRUE, message=FALSE}
library(bigPint)
data("soybean_ir_sub")
data("soybean_ir_sub_metrics")
soybean_ir_sub[,-1] <- log(soybean_ir_sub[,-1] + 1)
ret <- plotSM(data=soybean_ir_sub, dataMetrics=soybean_ir_sub_metrics,
  threshVar = "FDR", threshVal = 1e-10, pointSize = 0.1, saveFile = FALSE)
ret[["N_P"]]
```

____________________________________________________________________________________

## Option 2: Gene List object

We can alternatively use the `geneList` input object to superimpose a subset of the data onto the full data frame. The `geneList` object is NULL by default. However, the user can set it to be equal to the list of the IDs that should be superimposed. For example, we can achieve the same plot above by using the following code.  

```{r, eval=TRUE, include=TRUE, message=FALSE, warning=FALSE}
library(dplyr)
sigGenes = soybean_ir_sub_metrics[["N_P"]] %>% filter(FDR < 1e-10) %>% select(ID)
sigGenes = sigGenes[,1]
ret <- plotSM(data=soybean_ir_sub, geneList = sigGenes, pointSize = 0.1,
  saveFile = FALSE)
ret[["N_P"]]
```

We note that the `geneList` object is more flexible than the `dataMetrics` object. This is because the `dataMetrics` object can only create the subset of data by thresholding one quantitative variable. However, the `geneList` object can be created in many more ways. For example, below we can examine genes that have an FDR value less than 1e-10 and a log fold change value greater than the absolute value of 6.

```{r, eval=TRUE, include=TRUE}
library(dplyr)
sigGenes = soybean_ir_sub_metrics[["N_P"]] %>% filter(FDR < 1e-10) %>%   
  filter(abs(logFC) > 6) %>% select(ID)
sigGenes = sigGenes[,1]
ret <- plotSM(data=soybean_ir_sub, geneList = sigGenes, pointSize = 0.5,
  pointColor = "magenta", saveFile = FALSE)
ret[["N_P"]]
```

Because of this, if both `dataMetrics` and `geneList` are both not their default NULL value, then `geneList` will take priority and `dataMetrics` will be ignored. 

____________________________________________________________________________________

## Option 3: No superimposing

The last possibility is to leave both `dataMetrics` and `geneList` as their default NULL value. This will allow a user to examine the distribution of the full dataset without superimposing any subset of data. We end with an example of this technique.

```{r, eval=TRUE, include=TRUE}
ret <- plotSM(data=soybean_ir_sub, saveFile = FALSE)
ret[["N_P"]]
```

____________________________________________________________________________________

## References