%\VignetteIndexEntry{erccdashboard examples}

\documentclass{article}
\usepackage{fullpage}
\begin{document}
\SweaveOpts{concordance=TRUE}
\SweaveOpts{keep.source=TRUE, split=FALSE}
\SweaveOpts{width=7, height=7}
\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=1em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=1em}
\DeclareGraphicsExtensions{.png,.pdf,.jpg}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
<<echo=false>>=
options(width=60, continue = "  ")
#options(SweaveHooks=list(fig=function()
#                par(mar=c(5.1,4.1,1.1,2.1))))
library( "erccdashboard" )
@    
\title{{\tt erccdashboard} Package Vignette}
\author{Sarah A. Munro}
\maketitle

This vignette describes the use of the erccdashboard R package to analyze 
External RNA Controls Consortium (ERCC) spike-in control ratio mixtures in gene
expression experiments. If you use this package for method validation of your
gene expression experiments please cite our manuscript that describes this R 
package using citation("erccdashboard").

In this vignette we demonstrate analysis of two types of gene expression 
experiments from the SEQC project that used ERCC control ratio mixture 
spike-ins:
\begin{itemize}
  \item Rat toxicogenomics methimazole-treated and control samples
  \item Human reference RNA samples from the MAQC I project, Universal Human 
  Reference RNA (UHRR) and Human Brain Reference RNA (HBRR)
\end{itemize}

A subset of the large data set produced in the SEQC study are provided here as 
examples. The three sets of example data are: 

\begin{enumerate}
  \item Rat toxicogenomics RNA-Seq gene expression count data
  \item UHRR/HBRR RNA-Seq gene expression count data
  \item UHRR/HBRR Microarray gene expression fluorescent intensity data
\end{enumerate}


\section{Rat Toxicogenomics Example: MET (methimazole treatment) and CTL 
(control) Experiment}
\subsection{Load data and define input parameters} 

Load the package gene expression data.
<<loadExampleData, keep.source=TRUE>>=
data(SEQC.Example)
@

The R workspace should now contain 5 objects
Three of these objects are gene expression experiment expression measures:
\begin{itemize}
  \item UHRR.HBRR.arrayDat - Fluorescent signal data from an Illumina 
        beadarray microarray experiment with UHRR and HBRR in the SEQC 
        interlaboratory study
  \item MET.CTL.countDat - RNA-Seq count data from a
        rat toxicogenomics experiment
  \item UHRR.HBRR.countDat - RNA-Seq count data from Lab 5 in the SEQC
        interlaboratory study with UHRR and HBRR
\end{itemize}
The other two objects are vectors of total reads for the 2 sequencing 
experiments
\begin{itemize}
  \item MET.CTL.totalReads - total sequenced reads factors for 
        each column in the corresponding rat experiment count table
  \item UHRR.HBRR.totalReads - total sequenced reads factors for 
        each column in the corresponding UHRR/HBRR count table
\end{itemize}


\subsection{Quick analysis: runDashboard}
To run the default analysis function runDashboard on the MET-CTL rat 
toxicogenomics RNA-Seq experiment, the following input arguments are required:

<<defineInputData,keep.source=TRUE>>=
datType = "count" # "count" for RNA-Seq data, "array" for microarray data
isNorm = FALSE # flag to indicate if input expression measures are already
               # normalized, default is FALSE 
exTable = MET.CTL.countDat # the expression measure table
filenameRoot = "RatTox" # user defined filename prefix for results files
sample1Name = "MET" # name for sample 1 in the experiment
sample2Name = "CTL" # name for sample 2 in the experiment
erccmix = "RatioPair" # name of ERCC mixture design, "RatioPair" is default
erccdilution = 1/100 # dilution factor used for Ambion spike-in mixtures
spikeVol = 1 # volume (in microliters) of diluted spike-in mixture added to 
             #   total RNA mass
totalRNAmass = 0.500 # mass (in micrograms) of total RNA 
choseFDR = 0.05 # user defined false discovery rate (FDR), default is 0.05
@

The first input argument, datType, indicates whether that data is integer count 
data from an RNA-Seq experiment ("count") or data from a microarray experiment 
("array"). The isNorm argument indicates if the input expression measures are 
already normalized, the default value is FALSE. If you want to use normalized
RNA-Seq or microarray data in the analysis, the isNorm
argument must be set to TRUE. If the data is normalized, then limma will be used
for array data DE testing, but for RNA-Seq data, DE testing results must 
be available in the working directory in a file named 
"filenameRoot ERCC Pvals.csv"

The third argument, exTable, is the expression measure table.

Take a look at the RatTox experiment count table.
<<inspectRatCount,keep.source=TRUE>>=
head(MET.CTL.countDat)
@
The first column of the expression measure table, Feature, contains unique 
names for all the transcripts that were quantified in this experiment. 
The remaining columns represent replicates of the pair of samples, in this 
expression measure table the control sample is labeled CTL and the treatment 
sample is labeled MET. An underscore is included to separate the sample names 
from the replicate numbers during analysis. This column name format 
Sample\textunderscore{}Rep is required for the columns of any input expression 
measure table. Only one underscore (\textunderscore{}) should be used in the 
column names.

The default differential expression testing of
RNA-Seq experiments in the erccdashboard is done with the QuasiSeq package, 
which requires the use of integer count data. The default normalization of the 
data is 75th percentile (also known as upper quartile) normalization. 
It is optional to provide a vector of per replicate normalization factors 
through the input argument repNormFactor, such as a vector of total reads for 
each replicate. The example total reads vectors we provide here were derived 
from the FASTQ files associated with each column in the RNA-Seq experiment 
count tables.
Any repNormFactor vector will be used as a library size normalization factor for
each column of exTable. This will be adjusted to be a per million reads factor.

For any experiment the sample spiked with ERCC Mix 1 is sample1Name and the 
sample spiked with ERCC Mix 2 is sample2Name. In this experiment 
sample1Name = MET and sample2Name = CTL. For a more robust experimental design 
the reverse spike-in design could be created using additional replicates of the
treatment and control samples. ERCC Mix 2 would be spiked into MET samples and 
ERCC Mix 1 would be spiked into CTL control replicates. 

The dilution factor of the pure Ambion ERCC mixes prior to spiking into total 
RNA samples is erccdilution. The amount of diluted ERCC mix spiked into
the total RNA sample is spikeVol (units are $\mu$L). The mass of total RNA 
spiked with the diluted ERCC mix is totalRNAmass (units are $\mu$g).

The final required input parameter, choseFDR, is the False Discovery Rate (FDR) 
for differential expression testing. A typical choice would be 0.05 (5\% FDR), 
so this is the default choseFDR value. For the rat data since most genes are not
differentially expressed a less conservative FDR is chosen (FDR = 0.1) and for 
the UHRR and HBRR reference RNA samples FDR = 0.01 is chosen, because there is
a large number of differentially expressed genes for this pair of samples. 

The function runDashboard.R is provided for convenient default erccdashboard 
analysis. Execution of the runDashboard function calls the default functions for
erccdashboard analysis and reports parameters and progress to the R console. The
functions called within runDashboard.R are also available to the user (details 
provided in Section 4). 

All data and analysis results are stored in the list object exDat. For 
convenience the main diagnostic figures are saved to a pdf file and the exDat 
object is saved to an .RData object named using the filenameRoot provided by the
user.

Use the following command to run the default runDashboard script:

<<runDashboardRatcount, keep.source=TRUE >>=
exDat <- runDashboard(datType="count", isNorm = FALSE,
                       exTable=MET.CTL.countDat,
                       filenameRoot="RatTox", sample1Name="MET",
                       sample2Name="CTL", erccmix="RatioPair",
                       erccdilution=1/100, spikeVol=1,
                       totalRNAmass=0.500, choseFDR=0.1)
@

\subsection{Results of dashboard analysis}
The summary function will give a top level view
of the exDat list structure. The str function will give more detail. It is a 
good idea to set the max.level argument in the str function, because by the end
of the analysis the exDat structure is quite large.
<<initializeData,keep.source=TRUE>>=
summary(exDat)
@
The figures from the analysis are stored in \verb|exDat$Figures|. The four main
diagnostic figures that are saved to pdf are the dynRangePlot, rocPlot, 
lodrERCCPlot, and maPlot.
\clearpage
\begin{center}
<<ratPlotA,fig=TRUE>>=
exDat$Figures$dynRangePlot
@
\end{center}
For this particular experiment the relationship between abundance and signal for
the ERCC controls show that the measurement results span a 2$^{15}$ dynamic 
range. These ERCC mixtures were designed to span a 2$^{20}$ dynamic range, but
there was insufficient evidence to reliably quantify ERCC transcripts at low 
abundances.
\clearpage
\begin{center}
<<ratPlotB,fig=TRUE>>=
exDat$Figures$rocPlot
@
\end{center}
The receiver operator characteristic (ROC) curve and the Area Under the Curve 
(AUC) statistic provide evidence of the diagnostic power for detecting 
differential expression in this rat toxicogenomics experiment. As expected with
increased fold change, diagnostic power increases. The AUC summary statistic 
for different experiments can be used to compare diagnostic performance.
\clearpage
\begin{center}
<<ratPlotC,fig=TRUE>>=
exDat$Figures$lodrERCCPlot
@
\end{center}
By modeling the relationship between average signal and p-values we can obtain 
Limit of Detection of Ratios (LODR) estimates for each differential fold change 
(or Ratio,indicated by color) and a threshold p-value, p.thresh, indicated by 
the dotted black line. LODR values can be compared between
experiments to evaluate the ability to detect differences between samples as a 
function of transcript abundance.
\clearpage
\begin{center}
<<ratPlotD,fig=TRUE,pdf=FALSE,png=TRUE >>=
exDat$Figures$maPlot
@
\end{center}
An MA plot (Ratio of Signals vs Average Signals) shows the ratio measurements of
transcripts in the pair of samples as a function of abundance. The ERCC control 
ratios measurements are coded to indicate which controls are above a given LODR 
(solid circles) or below the LODR (open circles). This plot also shows the 
variability in ratio measurements as a function of dynamic range and the bias in 
control ratio measurements (r$_m$), which is influenced by the mRNA fraction 
difference between the pair of samples.

\clearpage

\section{Comparison of Performance Between Experiments}
The performance metrics provided here derived from measurements of ERCC ratios 
in gene expression experiments (AUC, LODR, r$_m$, and the standard deviations of 
the ERCC ratio measurements) can be used to assess performance between 
experiments within the same laboratory, or between different laboratories or 
technology platforms.

To illustrate the difference between technology platforms example data are also
provided from the SEQC interlaboratory study. The interlaboratory experiments 
were performed with aliquots from single large batches of commercially available 
reference RNA samples, Universal Human Reference RNA (UHRR) and Human Brain 
Reference RNA (HBRR).To prepare these large batches of reference RNA samples, 
50 $\mu$L of undiluted Ambion ERCC Mix 1 was spiked into 2500 $\mu$g of the 
UHRR total RNA and 50 $\mu$L of undiluted Ambion ERCC Mix 2 was spiked into 
2500 $\mu$g the HBRR sample before these samples were aliquoted and shared 
amongst laboratories and across platforms. 

\subsection{Analysis of an UHRR vs. HBRR Microarray experiment}
To analyze the example reference RNA experiment microarray data use the 
following command:
<<SEQCMainArray,results=hide>>=
exDat <- runDashboard(datType="array", isNorm = FALSE,
                      exTable=UHRR.HBRR.arrayDat,
                      filenameRoot = "Lab13.array",
                      sample1Name = "UHRR", sample2Name="HBRR",
                      erccmix = "RatioPair", erccdilution = 1, 
                      spikeVol = 50, totalRNAmass = 2.5*10^(3), choseFDR=0.01)
@

Note that unnormalized fluorescent signals are expected for erccdashboard 
analysis of microarray data (with the argument isNorm = FALSE) and input data 
should not be log transformed.

As with RNA-Seq experiments for microarray experiments it is optional 
to provide a vector of per replicate normalization factors 
through the input argument repNormFactor.

Normalized microarray data may be analyzed with the erccdashboard if 
isNorm = TRUE.

\subsection{Analysis of an UHRR vs. HBRR RNA-Seq experiment}

To analyze the RNA-Seq reference RNA experiment data simply repeat the same command
that you used in the previous section with the microarray data, but change the 
datType to "count", use UHRR.HBRR.countDat as your exTable, and change your 
filenameRoot to a different character string, so that your microarray data is 
not overwritten.

\section{Analysis Details: Advanced Use of erccdashboard Functions}
The analysis functions contained in the convenience wrapper function 
\verb|runDashboard| can also be used directly by the user.
Comments are provided above each analysis step included in \verb|runDashboard| 
to describe the purpose and ordering constraints.
View the runDashboard script to see comments describing the analysis functions 
and the ordering constraints:
<<viewDashboardOrder, keep.source=TRUE>>=
runDashboard
@

\subsection{Flexibility in Differential Expression Testing}
The \verb|geneExprTest| function wraps the QuasiSeq differential expression 
testing package for datType = "count" or uses the limma package for differential
expression testing if datType = "array". The function uses the DE testing 
p-value results and choseFDR parameter to select a threshold p-value for LODR 
estimation.

For count data if DE testing has already been done and 
a correctly formatted filenameRoot.All.Pvals.csv file is provided with the
necessary DE test results, then geneExprTest will have 
reduced runtime. The function will look for a csv file with the name 
"filenameRoot.All.Pvals.csv" and column names "Feature", "MnSignal", Pval", and 
"Fold" must be in the file. "Feature" is a column of transcript names including 
ERCC controls and endogenous transcripts, "MnSignal" is the mean signal across 
sample replicates, "Pval" are the unadjusted P-values from differential 
expression testing (Q-values will be automatically
estimated for the P-values, and take into account multiple hypothesis testing), 
and "Fold" are the numeric fold changes for the ERCC controls (0.5, 4, 1, and 
0.667) and for endogenous transcripts the "Fold" is NA.

To use results from another DE testing tool (instead of QuasiSeq) the csv file 
with the name "filenameRoot.All.Pvals.csv" and correct column headers should be 
in the working directory.
This file is required if the input data (exTable) is RNA-Seq data (datType = 
"count") and if the data is already normalized (isNorm = TRUE).
If isNorm is TRUE, then the software asks for user input about length 
normalization. Type Y at the command line in the R console if the data is 
length normalized (e.g. FPKM or RPKM data) otherwise type N.

For array data the current option for differential expression testing is limited
to limma.

\subsection{Options for LODR Estimation}
The default behavior of runDashboard is to use the \verb|estLODR| function to
obtain an LODR estimate using empirical data from the ERCCs and a model-based 
simulation using the endogenous genes in the sample. The type of LODR estimation
is selected using the argument kind in the \verb|estLODR| function. The other 
parameter that may be adjusted is the probability for the LODR estimate, in the 
default analysis prob = 0.9 is selected.

\subsection{Options for Printing Plots to File}
The function \verb|saveERCCPlots| will save selected figures to a pdf file. The default 
is to print the 4 main erccdashboard figures to a single page (plotsPerPg = "main"). 
If plotsPerPg = "single" then each plot is placed on an 
individual page in one pdf file. If plotlist is not defined (plotlist = NULL)
then all plots in \verb|exDat$Figures| are printed to the pdf file. 

\subsection{Analysis of Alternative Spike-in Designs}
By default the package is configured to analyze the ERCC ratio mixtures produced
by Ambion (ERCC ExFold RNA Spike-In Mixes, Catalog Number 4456739). This pair of
control ratio mixtures were designed to have 1:1, 4:1, 1:1.5, and 1:2 ratios of 
92 distinct RNA transcripts (23 different RNA control sequences are in each of 
these four ratio subpools). Alternative ERCC RNA control ratio mixture designs 
can be produced using the NIST DNA Plasmid Library for External Spike-in 
Controls (NIST Standard Reference Material 2374, 
https://www-s.nist.gov/srmors/certificates/2374.pdf). For example, a pair of RNA
control mixtures could be created with a ternary ratio design, three subpools of 
RNA controls with either no change (1:1) or 2-fold increased (2:1) and 2-fold
decreased (1:2) relative abundances between the pair of mixtures (Mix 1/Mix 2).
To use alternative spike-in mixture designs with the dashboard a csv file must
be provided to the package with the argument userMixFile for the initDat 
function. 

If all samples from both conditions were only spiked with a single ERCC mixture 
(e.g. Ambion Catalog Number 4456740, ERCC RNA Spike-In Mix) a limited subset of 
the package functions can be used (\verb|initDat|, \verb|est_r_m|, and 
\verb|dynRangePlot|. For \verb|initDat| use ERCCMixes="Single" and 
\verb|est_r_m| and \verb|dynRangePlot| functions can then be used to examine 
the mRNA fraction differences for the pair of samples and evaluate the dynamic 
range of the experiment. 

\section{Notes on R version and session information}
The results shown in this R vignette are the same as the results shown in our
manuscript and were obtained with the following R session 
information.
<<sessionData>>=
sessionInfo()
@
 
\end{document}