% \VignetteIndexEntry{XDE Vignette}
% \VignetteKeywords{microarray, differential expression, meta-analysis}
% \VignettePackage{XDE}
\documentclass{article}
\usepackage{amsmath}
\usepackage{inconsolata}
\usepackage{graphicx}
\usepackage{natbib}
\usepackage{hyperref}
\usepackage[mathcal]{euscript}

\parindent 0in  % Left justify
\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rfunction}[1]{\texttt{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\stan}{\texttt{Stanford}}
\newcommand{\xde}{\textit{XDE}}
\newcommand{\eset}{\Robject{ExpressionSet}}
\newcommand{\esets}{\Robject{ExpressionSet}s}
\newcommand{\esetList}{\Robject{ExpressionSetList}}
\newcommand{\esetLists}{\Robject{ExpressionSetList}s}
\newcommand{\xparam}{\Robject{XdeParameter}}
\newcommand{\xmcmc}{\Robject{XdeMcmc}}
\newcommand{\R}{\textsf{R}}
\newcommand{\bioc}{http://www.bioconductor.org}

\textwidth=6.2in
\textheight=8.5in
\parskip=.3cm
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\title{XDE: A Bayesian hierarchical model for analysis of differential
  expression in multiple studies}

\author{Robert Scharpf, Andrew Nobel, Giovanni Parmigiani, H\aa kon
  Tjelmeland}

\begin{document}

\maketitle

<<echo=FALSE>>=
options(width=60)
@

\section{Introduction}

There are many publicly available high throughput gene expression
studies that address comparable biological questions with similar
patient populations.  For economical and practical reasons many of
these studies have a relatively small number of biological replicates.
To improve the statistical power it is of interest to combine observed
data from several microarray studies, potentially measured with
different technologies.  However, variation in the measured gene
expression levels is caused not only by the biological conditions of
interest and natural variation in gene expression in different samples
of the same type, but also to a great extent by technological
differences between studies.  To successfully combine observed data
from different studies, it is therefore essential to filter out the
technological effects.  The \R{} package \Rpackage{XDE} fits a
Bayesian Hierarchical model for estimation of differential expression
in 2 or more studies.  Alternative methods for assessing differential
expression are provided.  Other \R{} packages developed for the
cross-study analysis of differential expression are
\Rpackage{GeneMeta} \cite{Lusa2007}, \Rpackage{metaArray}
\cite{Choi2007}, and \Rpackage{rankProduct} \cite{Hong2006}.


After reading this vignette, one should be able to
\begin{itemize}

\item create an instance of the class \esetList{}

\item create an instance of the class \xparam{} that contains default
  options for the MCMC

\item fit the Bayesian hierarchical model to two or more studies
  stored as an \esetList{} class

\item plot MCMC chains to assess convergence

\item compute posterior averages for concordant and discordant
  differential expression

\item generate alternative cross-study summaries of differential
  expression

\end{itemize}

See \cite{Scharpf2009} for a more detailed discussion of the Bayesian
hierarchical model

\section{\label{sec:xlist} The \esetList{} Class}

Presently the software for fitting the Bayesian hierarchical model
requires that each study be represented as an \Robject{ExpressionSet}
\cite{Gentleman2004} and that the same features be present in each
study (the \Rfunction{featureNames} of the studies must be identical).
Therefore, for the meta-analysis of several platforms,
platform-specific annotations must be mapped to a common reference
identifier.  Mapping identifiers in each platform is a non-trivial
stop in any cross-study analysis.  The \R{} packages
\Rpackage{funcBox} (not yet publicly available) and
\Rpackage{MergeMaid} have been developed to facilitate the annotation
process and the merging of multiple studies, respectively
\cite{Cope2004}.

For the purposes of this vignette, \xde{} provides an example dataset
of a single study that was split into three artificial datasets.

<<data, eval=TRUE>>=
library(XDE)
data(expressionSetList)
xlist <- expressionSetList
class(xlist)
@

The original study is described in \cite{garb:etal:2001}.  The
processed data was mapped to unigene identifiers and made available in
the experimental data package \Rpackage{lungExpression} (\bioc) to
facilitate the reproducibility of the analyses described in
\cite{Parmigiani2004a}.

The function \Rfunction{validObject} checks that the each element in
\esetList{} is a valid \eset{} and that the \Robject{featureNames} are
the same in each element.

<<validObject, eval=TRUE>>=
validObject(xlist)
@

In order to assess differential expression across multiple studies,
one must define a dichotomous covariate in the \Robject{phenoData} of
each \eset{} element in the \esetList{} object.  This covariate must
have an identical name in each study.  In this vignette, we will use
\xde{} to quantify differential expression between adenocarcinomas and
squamous carcinomas.  The binary covariate ``adenoVsquamous'' has been
defined in each of the \eset{} elements for this purpose.  The
following statement must evaluate to \texttt{TRUE}:

<<covariate>>=
stopifnot(all(sapply(xlist, function(x, label){ label %in% varLabels(x)}, label="adenoVsquamous")))
@



\section{\label{sec:XdeParam} The \xparam{} Class}

There are many features in our implementation of the hierarchical
Bayesian model that can be modified:

\begin{itemize}

\item hyperparameters for the Bayesian hierarchical model

\item the seed for generating random values in the MCMC

\item the starting values for the MCMC chains

\item tuning parameters for Metropolis-Hastings proposals

\item the number of updates per MCMC iteration (each parameter can
  have a different number of updates)

\item selection of MCMC chains that are to be written to log files

\end{itemize}

All of the above features are provided in the \xparam{} class.  The
attributes provided when initializing an instance of class \xparam{}
work well in most instances.  The \xparam{} vignette provides a more
detailed description of how the attributes can be modified, as well as
a brief description of the Bayesian model and the algorithm for the
MCMC.  See \cite{Scharpf2009} for a more detailed description of the
Bayesian model.

The default values for our \Robject{xlist} object are obtained by
initializing an instance of the \xparam{} class:

<<initializeList, eval=TRUE>>=
params <- new("XdeParameter", esetList=xlist, phenotypeLabel="adenoVsquamous")
params
@

This parameterization, presently the default, assumes that a gene is
differentially expressed in all studies or in none.  An alternative
parametrization that allows genes to be differentially expressed in a
subset of studies can be obtained by setting the argument
\Robject{one.delta} to \Robject{FALSE}.

<<initializeList2, eval=FALSE>>=
params <- new("XdeParameter", esetList=xlist,
	      phenotypeLabel="adenoVsquamous", one.delta=FALSE)
@

Whether one fits the ``$\delta_g$'' (\Robject{one.delta=TRUE}) or the
``$\delta_{gp}$'' model (\Robject{one.delta=FALSE}), the chain written
to file is of dimension G x P x I, where G is the number of genes, P
is the number of studies, and S is the number of samples (in the case
of the $\delta_g$ model, the value written to file for a single gene
will be the same for each study.  The same is true for the $\xi_p$
parameter.)

\section{\label{sec:fit} Fitting the Bayesian hierarchical model}

\subsection{Starting values}

\paragraph{Randomly simulated starting values}
By default, the first iteration of the MCMC chain stored in the slot
\texttt{firstMcmc} of the \Robject{params} are simulated randomly from
the priors.  When the value of \texttt{burnin} is true, the output
from the MCMC are not saved to file and the chain can not be monitored
for convergence. By default the value of \Robject{burnin} is TRUE and
only the last iteration from the chain will be available (the
parameters are not written to log files).





\paragraph{Empirical starting values}
One can use empirical values for starting the chain (or specify your
own starting values) by initializing an object of class XdeParameter
and then specifying your own values for the first MCMC:

<<empiricalStart, eval=FALSE>>=
params <- new("XdeParameter", esetList=xlist, phenotypeLabel="adenoVsquamous", one.delta=FALSE)
empirical <- empiricalStart(xlist, phenotypeLabel="adenoVsquamous")
firstMcmc(params) <- empirical
@

To run a burnin of 3 iterations starting from the empircal values:

<<burnin, eval=FALSE>>=
iterations(params) <- 3
burnin(params) <- TRUE
fit <- xde(params, xlist)
@

Only the first and last iterations of the MCMC are available when
\texttt{burnin} is \texttt{TRUE}.  The output of the \Rfunction{xde}
is an object of class \xmcmc{}.  The object \texttt{fit} contains the
last iteration from the MCMC, as well as a different seed that can be
used for initiating the next chain.  For instance, to run two
additional iterations starting at the last iteration stored in the
\texttt{fit} object, one should provide the \Robject{params},
\Robject{xlist}, and \Robject{fit} objects to the \Rfunction{xde}
function:

<<moreIterations, eval=FALSE>>=
iterations(params) <- 2
fit2 <- xde(params, xlist, fit)
@

When an object of class \xmcmc{} is supplied as an argument to the
\Rfunction{xde} function, the seed and the last iteration from the
\Robject{fit} object are used to begin the next chain.  Note that
the results from the previous call to the \Rfunction{xde} would be
identical to the following sequence of commands:

<<firstMcmc, eval=FALSE>>=
firstMcmc(params) <- lastMcmc(fit)
seed(params) <- seed(fit)
fit2 <- xde(params, xlist)
@

One should run several thousand iterations (saving all parameters to
file) to monitor convergence.  In the following code chunk (not run)
we save only the chains for the parameters that are not indexed by
genes and/or study by setting the \texttt{output} for these parameters
to zero -- this step was taken to keep this package from becoming
unnecessarily large.  By setting the \texttt{thin} to 2, we only write
every other MCMC iteration to file.  In total, 1000 iterations are
saved.

<<runMoreIterations, eval=FALSE, echo=TRUE>>=
burnin(params) <- FALSE
iterations(params) <- 2000
output(params)[c("potential", "acceptance",
		 "diffExpressed",
		 "nu", ##"DDelta",
                 ##"delta",
		 "probDelta",
		 ##"sigma2",
		 "phi")] <- 0
thin(params) <- 2
directory(params) <- "logFiles"
xmcmc <- xde(params, xlist)
@

<<savexmcmc, eval=FALSE, echo=FALSE, results=hide>>=
##this function should work
postAvg <- calculatePosteriorAvg(xmcmc, NCONC=2, NDIFF=1)
save(postAvg, file="~/madman/Rpacks/XDE/inst/logFiles/postAvg.rda")
##put browser in .standardizedDelta to fix tau2R, ...
BES <- calculateBayesianEffectSize(xmcmc)
save(BES, file="~/madman/Rpacks/XDE/inst/logFiles/BES.rda")
save(xmcmc, file="~/madman/Rpacks/XDE/data/xmcmc.RData")
q("no")
@

See the \xparam{} vignette for a more detailed discussion of the
output, burnin, and thin methods.  The \Robject{xmcmc} object can be
loaded by:

<<loadObject, echo=FALSE, eval=TRUE>>=
data(xmcmc, package="XDE")
@

\section{\label{sec:mcmc} MCMC diagnostics}

In this section we briefly describe how to access the chains for
assessing convergence of the MCMC. We refer the \R user to the package
\Rpackage{coda} for more detailed discussion of MCMC diagnostics
\cite{Plummer2007}.

The following is a list of the chains that were saved in our run with
1000 saved iterations:
<<savedParams, eval=TRUE>>=
output(xmcmc)[2:22][output(xmcmc)[2:22] == 1]
@

The \$ operator can be used to read in log files.  First, we need to
update the \texttt{directory} slot in the \Robject{xmcmc} with a
character string indicating the path to the log files.  (Note that an
assignment method for directory is only available for \R{} object of
class \xmcmc{} -- typically one would not need to change the directory
in the \xmcmc{} object).  In the following code chunk, we extract the
chain for the $c^2$ parameter:

<<extractc2, eval=TRUE>>=
pathToLogFiles <- system.file("logFiles", package="XDE")
xmcmc@directory <- pathToLogFiles
c2 <- xmcmc$c2
@

<<c2, fig=TRUE, eval=TRUE>>=
par(las=1)
plot.ts(c2, ylab="c2", xlab="iterations", plot.type="single")
@

Here we extract only the parameters that are not indexed by gene and
platform (log files for parameters indexed by gene and platform are
typically very large):

<<retrieveLogs, results=hide>>=
getLogs <- function(object){
	params <- output(object)[output(object) == 1]
	params <- params[!(names(params) %in% c("nu", "phi", "DDelta", "delta", "sigma2", "diffExpressed"))]
	names(params)
}
param.names <- getLogs(xmcmc)
params <- lapply(lapply(as.list(param.names), function(name, object) eval(substitute(object$NAME_ARG, list(NAME_ARG=name))), object=xmcmc), as.ts)
names(params) <- param.names
tracefxn <- function(x, name)  plot(x, plot.type="single", col=1:ncol(x), ylab=name)
mapply(tracefxn, params, name=names(params))
@

\section{\label{sec:posterior} Posterior probabilities of differential expression}

We refer to the posterior mean of the standardized offsets in the
hierarchical model as the Bayesian effect size (BES).  The BES is
calculated as $\frac{\delta_g\Delta_{gp}}{c \tau \sigma^{b_p}}$ and
obtained by

<<bayesianEffectSize, eval=FALSE>>=
bayesianEffectSize(xmcmc) <- calculateBayesianEffectSize(xmcmc)
@

As the function \Rfunction{calculateBayesianEffectSize} requires that
the $\delta, \Delta$, and $\sigma^2$ chains are saved, the above code
chunk is not evaluated in this vignette.

<<bes, eval=TRUE>>=
load(file.path(pathToLogFiles, "BES.rda"))
load(file.path(pathToLogFiles, "postAvg.rda"))
@

Posterior averages for the probability of differential expression,
concordant differential expression, and discordant differential
expression are stored in the \Robject{postAvg} object.


See \cite{Scharpf2009} for a discussion of how the above posterior
average probabilities are computed.

<<postAvgFig_conc, echo=FALSE, fig=TRUE, eval=TRUE>>=
par(las=1)
hist(postAvg[, "concordant"], breaks=50, xlim=c(0, 1), main="",
     xlab="posterior probability of concordant differential expression")
@

<<postAvgFig_disc, echo=FALSE, fig=TRUE, eval=TRUE>>=
par(las=1)
hist(postAvg[, "discordant"], breaks=50, xlim=c(0, 1), main="",
     xlab="posterior probability of discordant differential expression")
@

We may wish to differentially label the study-specific statistics of
effect size that have high probabilities of concordant differential
expression.  To do this, we order the matrix of the BES so that the
genes with the highest posterior probabilities are plotted last.  The
function \Rfunction{symbolsInteresting} returns a list of graphical
options to \Rfunction{pairs}.

<<setPar, eval=TRUE>>=
op.conc <- symbolsInteresting(rankingStatistic=postAvg[, "concordant"])
op.disc <- symbolsInteresting(rankingStatistic=postAvg[, "discordant"])
@

<<conc, fig=TRUE, eval=TRUE>>=
par(las=1)
graphics:::pairs(BES[op.conc$order, ], pch=op.conc$pch, col=op.conc$col,
                 bg=op.conc$bg, upper.panel=NULL, cex=op.conc$cex)
@

or high probabilities of discordant differential expression

<<disc, fig=TRUE, eval=TRUE>>=
graphics:::pairs(BES[op.disc$order, ], pch=op.disc$pch, col=op.disc$col, bg=op.disc$bg,
		 upper.panel=NULL, cex=op.disc$cex)
@

\section{\label{sec:alternatives} Alternative cross-study summaries of differential expression}

Study-specific estimates of effect size, such as SAM or t-statistics,
can be useful to check the overall reproducibility between studies.
Again using pairwise scatterplots we plot t- and SAM-statistics using
different colors and plotting symbols for the genes that show high
posterior probabilities of concordant differential expression.

<<ssAlternatives, eval=FALSE>>=
##t <- ssStatistic(statistic="t", phenotypeLabel="adenoVsquamous", esetList=xlist)
tt <- rowttests(xlist, "adenoVsquamous", tstatOnly=TRUE)
if(require(siggenes)){
  sam <- ssStatistic(statistic="sam", phenotypeLabel="adenoVsquamous", esetList=xlist)
}
if(require(GeneMeta)){
  z <- ssStatistic(statistic="z", phenotypeLabel="adenoVsquamous", esetList=xlist)
}
@

<<tstatConc, eval=FALSE>>=
graphics:::pairs(tt[op.conc$order, ], pch=op.conc$pch, col=op.conc$col,
                 bg=op.conc$bg, upper.panel=NULL, cex=op.conc$cex)

@

<<zConc, eval=FALSE>>=
graphics:::pairs(tt[op.conc$order, ], pch=op.conc$pch, col=op.conc$col,
                 bg=op.conc$bg, upper.panel=NULL, cex=op.conc$cex)

@

<<tstatDisc, eval=FALSE>>=
graphics:::pairs(tt[op.disc$order, ], pch=op.disc$pch, col=op.disc$col,
                 bg=op.disc$bg, upper.panel=NULL, cex=op.disc$cex)
@

Uncorrelated $t$ and $SAM$ statistics suggest a low level of
reproducibility that may be attributable to technological differences
in the platforms, probes that align to different transcripts of the
same gene, or differences in the study populations. Low or
non-existing reproducibility may induce a wrong borrowing of strength
in the Bayesian model, whereby concordant differential expression is
seen as noise and shrunk to zero.  Hence, study-specific estimates of
t- and SAM-statistics may be helpful in deciding whether the Bayesian
model is likely to be beneficial.

If the correlation across studies is low, we suggest an unsupervised
approach to gene filtering, such as integrative correlation, to select
for genes that show some level of reproducibility across studies.  The
\R{} packages \Rpackage{MergeMaid} and \Rpackage{genefilter} may be
helpful.

For evaluating the overall differential expression, we follow the
discussion of Garrett-Mayer \citep{Garrett-Mayer2007}, and combine the
elements of single study statistics in a linear fashion to obtain a
statistic suitable for assessing differential expression.  For a more
detailed discussion of how these cross-study summaries were generated
for evaluating concordant and discordant differential expression, see
\cite{Scharpf2009}.

<<pca, eval=FALSE>>=
tScores <- xsScores(tt, N=nSamples(xlist))
samScores <- xsScores(sam, nSamples(xlist))
zScores <- xsScores(z[, match(names(xlist), colnames(z))], N=nSamples(xlist))
##Concordant differential expression, we use the combined score from the random effects model directly
zScores[, "concordant"] <- z[, "zSco"]
@

%\section{Assessing Goodness of Fit}


\section{Session Information}

The version number of \R{} and packages loaded for generating the
vignette were:

<<sessionInfo, echo=FALSE, results=tex>>=
toLatex(sessionInfo())
@

\bibliography{genomics}
\bibliographystyle{plain}

\end{document}