---
title: "biobroom Vignette"
author: "Andrew J. Bass, Emily Nelson, David Robinson and John D. Storey"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Vignette Title}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---
  
# About biobroom

The `biobroom` package contains methods for converting standard objects in
Bioconductor into a "tidy format". It serves as a complement to the popular [broom](https://github.com/dgrtwo/broom)
package, and follows the same division (`tidy`/`augment`/`glance`) of tidying
methods.

"Tidy data" is a data analysis paradigm that focuses on keeping data formatted as a single observation per row of a data table. For further information, please see [Hadley Wickham's seminal paper](http://vita.had.co.nz/papers/tidy-data.pdf) on the subject.  "Tidy" is not a normative statement about the quality of an object's structure. Rather, it is a technical specification about the choice of rows and columns. A tidied data frame is not "better" than an S4 object; it simply allows analysis with a different set of tools.

Tidying data makes it easy to recombine, reshape and visualize bioinformatics
analyses. Objects that can be tidied include

* `ExpressionSet` object, 
* `GRanges` and `GRangesList` objects,
* `RangedSummarizedExperiment` object,
* `MSnSet` object,
* per-gene differential expression tests from `limma`, `edgeR`, and `DESeq2`,
* `qvalue` object for multiple hypothesis testing.


We are currently working on adding more methods to existing Bioconductor
objects. If any bugs are found please contact the authors or visit our
[github page](https://github.com/StoreyLab/biobroom). Otherwise, any questions
can be answered on the [Bioconductor support site](https://support.bioconductor.org/).

# Installation

```{r global_options, include=FALSE, echo=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=6, out.width='700in', out.height='350in', 
                      echo=TRUE, warning=FALSE, message=FALSE, cache=FALSE, dev='png')
```

The `biobroom` package can be installed by typing in a R terminal:

```{r eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("biobroom")
```

To find out more about the provided objects:

```{r eval=FALSE}
library(biobroom)
?edgeR_tidiers
?DESeq2_tidiers
?limma_tidiers
?ExpressionSet_tidiers
?MSnSet_tidiers
?qvalue_tidiers
```

# Examples

## qvalue object

The [qvalue package](https://www.bioconductor.org/packages/release/bioc/html/qvalue.html)
is a popular package to estimate q-values and local false discovery rates. To get started,
we can load the `hedenfalk` dataset included in the `qvalue` package:
```{r}
library(qvalue)
data(hedenfalk)

qobj <- qvalue(hedenfalk$p)
names(qobj)
```

`qobj` is a `qvalue` object, generated from the p-values contained in the `hedenfalk` dataset. If we wanted to use a package such as `dplyr` or `ggplot`,
we would need to convert the results into a data frame. The `biobroom` package
makes this conversion easy by using the `tidy`, `augment` and `glance` functions:

- `tidy` returns one row for each choice of the tuning parameter lambda.
- `augment` returns one row for each provided p-value, including the computed q-value and local false discovery rate.
- `glance` returns a single row containing the estimated `pi0`.

Applying these functions to `qobj`:  

```{r}
library(biobroom)
# use tidy/augment/glance to restructure the qvalue object
head(tidy(qobj))
head(augment(qobj))
head(glance(qobj))
```

The original data, or in this example the gene names, can be inputted into `augment` using the `data` argument:
```{r}
# create sample names
df <- data.frame(gene = 1:length(hedenfalk$p))
head(augment(qobj, data = df))
```
The tidied data can be used to easily create plots:

```{r}
library(ggplot2)
# use augmented data to compare p-values to q-values
ggplot(augment(qobj), aes(p.value, q.value)) + geom_point() +
  ggtitle("Simulated P-values versus Computed Q-values") + theme_bw()
```

Additionally, we can extract out important information such as significant
genes under a false discovery rate threshold:
```{r}
library(dplyr)

# Find significant genes under 0.05 threshold
sig.genes <- augment(qobj) %>% filter(q.value < 0.05)
head(sig.genes)
```

## DESeq2 objects

To demonstrate tidying on `DESeq2` objects we have used the published `airway`
RNA-Seq experiment, available as a package from *Bioconductor*:

```{r eval=FALSE}
source("https://bioconductor.org/biocLite.R")
biocLite("airway")
```

Import the `airway` dataset:

```{r}
library(DESeq2)
library(airway)

data(airway)
airway_se <- airway
```

`airway_se` is a `SummarizedExperiment` object, which is a type of object used by the `DESeq2` package. Next, we create a `DESeqDataSet` object
and show the output of tidying this object:

```{r}
airway_dds <- DESeqDataSet(airway_se, design = ~cell + dex)

head(tidy(airway_dds))
```

Only the gene counts are outputted since there has been no analysis performed.
We perform an analysis on the data and then `tidy` the resulting object:
```{r}
# differential expression analysis
deseq <- DESeq(airway_dds)
results <- results(deseq)
# tidy results
tidy_results <- tidy(results)
head(tidy_results)
```

As an example to show how easy it is to manipulate the resulting object,
`tidy_results`, we can use `ggplot2` to create a volcano plot of the p-values: 

```{r}
ggplot(tidy_results, aes(x=estimate, y=log(p.value),
                         color=log(baseMean))) + geom_point(alpha=0.5) +
  ggtitle("Volcano Plot For Airway Data via DESeq2") + theme_bw()
```

##edgeR objects

Here we use the  `hammer` dataset included in `biobroom` package. `edgeR` can be used to perform differential
expression analysis as follows:

```{r}
library(edgeR)
data(hammer)

hammer.counts <- Biobase::exprs(hammer)[, 1:4]
hammer.treatment <- Biobase::phenoData(hammer)$protocol[1:4]

y <- DGEList(counts=hammer.counts,group=hammer.treatment)
y <- calcNormFactors(y)
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)
et <- exactTest(y)
```

The results of the analysis are stored in `et`, which is an `DGEExact` object. We can `tidy` this object using `biobroom`:

```{r}
head(tidy(et))
```

`glance` shows a summary of the experiment: the number of genes found significant (at a specified `alpha`), and which contrasts were compared to get the results.

```{r}
glance(et, alpha = 0.05)
```

Additionally, we can can easily manipulate the resulting object and create
a volcano plot of the p-values using `ggplot2`:
```{r}
ggplot(tidy(et), aes(x=estimate, y=log(p.value), color=logCPM)) +
  geom_point(alpha=0.5) + ggtitle("Volcano Plot for Hammer Data via EdgeR") +
  theme_bw()
```

##limma objects

To demonstrate how `biobroom` works with `limma` objects, we generate some simulated data to test the tidier for `limma` objects. 

```{r}
# create random data and design
dat <- matrix(rnorm(1000), ncol=4)
dat[, 1:2] <- dat[, 1:2] + .5  # add an effect
rownames(dat) <- paste0("g", 1:nrow(dat))
des <- data.frame(treatment = c("a", "a", "b", "b"),
                  confounding = rnorm(4))
```

We then use `lmFit` and `eBayes` (functions included in `limma`) to fit a linear model and use `tidy` to convert the
resulting object into tidy format:

```{r}
lfit <- lmFit(dat, model.matrix(~ treatment + confounding, des))
eb <- eBayes(lfit)

head(tidy(lfit))
head(tidy(eb))
```

Analysis can easily be performed from the tidied data. The package `ggplot2` can
be used to make a volcano plot of the p-values:

```{r}
ggplot(tidy(eb), aes(x=estimate, y=log(p.value), color=statistic)) + 
  geom_point() + ggtitle("Nested Volcano Plots for Simulated Data Processed with limma") +
  theme_bw()
  
```


##ExpressionSet objects

`tidy` can also be run directly on `ExpressionSet` objects, as described in another popular `Bioconductor` package `Biobase.` The
`hammer` dataset we used above (which is included in the `biobroom` package) is an `ExpressionSet` object, so we'll use that to demonstrate.

```{r}
library(Biobase)

head(tidy(hammer))
```

We can add the phenotype information by using the argument `addPheno`:
```{r}
head(tidy(hammer, addPheno = TRUE))
```

Now we can easily visualize the distribution of counts for each protocol by using
`ggplot2`:

```{r}
ggplot(tidy(hammer, addPheno=TRUE), aes(x=protocol, y=log(value))) +
  geom_boxplot() + ggtitle("Boxplot Showing Effect of Protocol on Expression")
```

##MSnSet Objects

`tidy` can also be run directly on `MSnSet` objects from `MSnbase`, which as specialised containers for quantitative proteomics data.

```{r}
library(MSnbase)
data(msnset)

head(tidy(msnset))

head(tidy(msnset, addPheno = TRUE))
```

# Note on returned values

All *biobroom* `tidy` and `augment` methods return a `tbl_df` by default (this
prevents them from printing many rows at once, while still acting like a
traditional `data.frame`). To change this to a `data.frame` or `data.table`,
you can set the `biobroom.return` option:

```{r eval=FALSE}
options(biobroom.return = "data.frame")
options(biobroom.return = "data.table")
```