%\VignetteIndexEntry{RPA}
%The above line is needed to remove a warning in R CMD check
\documentclass[12pt]{article}

\usepackage{amsmath,amssymb,amsfonts}
\usepackage{hyperref}
\usepackage[authoryear,round]{natbib}
\usepackage{Sweave}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in
\evensidemargin=.1in
\headheight=-.3in

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}

\def\muj{\mathbf{\mu}_j}
\def\mujs{\{\mathbf{\mu}_j\}}

\def\s{\{\mathbf{s}\}}
\def\st{\mathbf{s}^t}
\def\sj{\mathbf{s}_j}
\def\sij{s_{ij}}
\def\sjs{\mathbf{s}_{j=1}^J}
\def\sjt{\mathbf{s}_j^t}
\def\stj{s_{tj}}
\def\scj{s_{cj}}
\def\stjs{\{s_{tj}\}_t}
\def\sjI{\mathbf{s}_j^1}
\def\sjT{\mathbf{s}_j^T}

\def\m{\mathbf{m}}
\def\mj{\mathbf{m}_j}
\def\mjs{\{\mathbf{m}_j\}}
\def\mtj{m_{tj}}

\def\gj{\{\mathbf{g}_j\}}
\def\gi{g_i}
\def\gt{g_t}
\def\gc{g_c}
\def\g{\mathbf{g}}

\def\d{\mathbf{d}}
\def\dt{d_t}
\def\djt{\mathbf{d}_j^t}
\def\djr{\mathbf{s}_{j}}

\def\TauSq{\boldsymbol{\tau}^2}
\def\tauj{\mathbf{\tau}_j}
\def\taus{\{\tau_j^2\}}
\def\taujSq{\mathbf{\tau}_j^2}
\def\tauOneSq{\mathbf{\tau}_1^2}
\def\tauPSq{\mathbf{\tau}_J^2}

\def\epsilonj{\varepsilon_j}
\def\epsilonij{\mathbf{\varepsilon}_{ij}}
\def\epsiloncj{\varepsilon_{cj}}
\def\epsilontj{\varepsilon_{tj}}
\def\epsilontjs{\{\varepsilon_{tj}\}_t}
\def\epsiloncjs{\{\varepsilon_{cj}\}_j}
\def\Epsilonj{\boldsymbol{\varepsilon}_j}
\def\Epsiloncj{\boldsymbol{\varepsilon}_{cj}}

\def\alphaj{\alpha_j}
\def\betaj{\beta_j}
\def\alphahatj{\hat{\alpha}_j}
\def\betahatj{\hat{\beta}_j}

\def\zi{z_i}
\def\j{\mathbf{j}}
\def\sigmaj{\sigma_j}
\def\lambdaj{\lambda_j}

\author{Leo Lahti\\Helsinki Institute for Information Technology HIIT\\
  and Adaptive Informatics Research Centre,\\Department of Information
  and Computer Science,\\Aalto University School of Science and
  Technology,\\P.O. Box 15400, FI-00076 Aalto, Finland\\{\tt
    leo.lahti@iki.fi}}

\begin{document}
\title{RPA: analysis of probe reliability and differential gene
  expression on short oligonucleotide arrays.}

\maketitle 

\section{Introduction}

\Rpackage{RPA} (Robust Probabilistic Averaging)
package\footnote{\url{http://www.cis.hut.fi/projects/mi/software/RPA/}}
provides tools for analyzing probe reliability and differential gene
expression on Affymetrix short oligonucleotide arrays.

RPA provides estimates of probe reliability, and a probeset-level
estimate of differential gene expression between a user-specified
reference array and the other arrays. Probabilistic formulation allows
incorporation of prior information concerning probe reliability into
gene expression analysis within a principled framework.  The
underlying probabilistic model for probe-level observations is
described in \cite{Lahti09tcbb}.

Your comments and contributions are welcome.

\subsection{Relation to other probe-level models}

Probe-level calculation of differential expression helps to avoid the
modeling of unidentifiable probe affinities, which are the key
probe-specific parameter in many preprocessing methods. In contrast,
RPA estimates the overall reliability of each probe in terms of a
probe-specific variance. This distinguishes RPA from probe-level
preprocessing methods such as dChip's MBEI \cite{Li01pnas}, RMA
\cite{Irizarry03rma}, or FARMS \cite{Hochreiter06}, that provide
probeset-level summaries but have not been used to investigate probe
performance. However, in addition to probe reliability analysis RPA
can be used for preprocessing in differential gene expression studies,
where it has been shown to outperform other popular preprocessing
methods. For details, see \cite{Lahti09tcbb}.

\subsection{Summary of RPA model}

RPA assumes a Gaussian model for probe effects.  Consider a probe set
targeted at measuring the expression level of target transcript
\(g\). Probe-level observation \(\sij\) of probe \(j\) on array \(i\)
is modeled as a sum of the true expression signal (common for all
probes in the probeset), and probe-specific Gaussian noise: \(\sij =
\gi + \muj + \epsilonij\). Importantly, the stochastic noise component
is probe-specific in RPA, and distributed as \(\epsilonij \sim
N(0,\taujSq)\). The variance parameters \(\taus\) are of interest in
probe reliability analysis; the inverse variance \(1/\taujSq\) can be
used to measure of probe reliability (see get.probe.noise.estimates
function).

The mean parameter \(\muj\) of the noise model describes systematic
probe affinity effect but it is unidentifiable. In RPA, these
parameters cancel out when the signal log-ratio between a
user-specified 'reference' array and the remaining arrays is computed
for each probe: the differential expression signal between arrays \(t
= \{1, \dots, T\} \) and the reference array \(c\) for probe \(j\) is
given by \(\mtj = \stj - \scj = \gt - \gc + \epsilontj - \epsiloncj =
\dt + \epsilontj - \epsiloncj\). In vector notation the differential
gene expression profile of probe \(j\) across the arrays can be
written as \(\mj = \d + \Epsilonj\). In practice, \(\d\) and the
probe-specific variances \(\{\tau_j\}_{j=1}^P\) for the \(P\) probes
within the probeset are estimated simultaneously based on the
probabilistic model. With large sample sizes the solution will
converge to estimating the mean of the probe-level observations
weighted by probe reliability. 

The probe-level data is background corrected, normalized, and
log2-transformed before the analysis. By default, RPA uses the
background correction model of RMA \cite{Irizarry03} and quantile
normalization \cite{Bolstad03}. Our implementation utilizes the
\Rpackage{affy} package \cite{Gautier04a} to handle probe-level data.
For details about short oligonucleotide arrays and the design of the
Affymetrix GeneChip arrays, see the Affymetrix MAS manual
\cite{affy5}.

\section{Probe reliability analysis}

RPA operates on affybatch objects. An affybatch can be created from
Affymetrix CEL files using the ReadAffy function of the BioConductor
\Rpackage{affy} package \cite{Gautier04a}. An affybatch contains the
probe-level data of Affymetrix arrays. Our toy examples use the
\Robject{Dilution} dataset provided by \Rpackage{affydata}
package. Load example data (the 'Dilution' affybatch):

<<startchunk,results=hide>>=
require(affy)
require(affydata)
data(Dilution)
@

{\it RPA.pointestimate} is the main function. Let us perform the
analysis for particular probesets in the Dilution data using the first
array (\(cind = 1\)) as the reference for calculating differential
expression values of the other arrays.

<<RPA.pointestimate, results = hide>>=
require(RPA)
sets <- geneNames(Dilution)[1:2]
rpa.results <- RPA.pointestimate(Dilution, sets, cind = 1)
@

Probe reliability and differential gene expression analysis is
performed on the whole data set as follows (potentially slow).

<<rpa.res, results = hide, eval = FALSE>>=
rpa.results <- RPA.pointestimate(Dilution, cind = 1)
@

The 'rpa2eset' function coerces the probeset-level differential
expression estimates (d) into an ExpressionSet object that can be
analyzed using standard R/BioC tools for gene expression. The results
for a particular probeset are visualized with

<<visu, fig = FALSE, echo = TRUE, print = FALSE, results = hide, eval = FALSE>>=
rpa.plot("1000_at", rpa.results)
@

The output is shown in Figure \ref{fig:illustration}. See
help('rpa.plot') for details.

\begin{figure}[htbp]
\begin{center}
<<barplots, fig = TRUE, echo = FALSE, print = FALSE>>=
rpa.plot("1000_at", rpa.results)
@
\caption{Estimated probe-specific variances and differential gene expression for an example probe set.}
\label{fig:illustration}
\end{center}
\end{figure}

\subsection{Estimating probe-specific noise and probe reliability}

RPA estimates the noise level of each individual probe through the
probe-specific variance parameter (\(\taujSq\)). These can be obtained with

<<noise,results=hide>>=
noise <- get.probe.noise.estimates(rpa.results)
@ 

The higher the variance, the more noisy the probe. Inverse of the
variance, \(\frac{1}{\taujSq}\), can be used to quantitate probe
reliability. Note that the relative weight of a probe within probeset
is determined by the relative noise of the probe with respect to the
other probes in the same probeset. Comparison of probe-specific
variances across probesets may benefit from normalization of this
effect.  The get.probe.noise.estimates function can optionally provide
normalized versions of the noise estimates.


\subsection{Manual analysis of an individual probe set}

Preprocess data before the analysis:

<<Smat,results=hide, eval = FALSE>>=
Smat <- RPA.preprocess(Dilution, cind = 1)
@

Pick probe-level data for a probe set (arrays x probes matrix):

<<pick, results=hide, eval = FALSE>>=
S <- t(Smat$fcmat[Smat$set.inds[["1000_at"]], ])
@

Estimate probeset-level signal and probe-specific variances:

<<iter, results=hide, eval = FALSE>>=
res <- RPA.iteration(S)
@

\subsection{Setting probe-specific priors}

Prior information of probe reliability can be set by tuning the shape
(alpha) and scale (beta) parameters of the model. These are inverse
Gamma distribution parameters, which is the conjugate prior for the
variances. Set priors for a particular probeset. If the 'priors'
parameter is not given, non-informative priors will be given for the
other probesets:

<<setpriors, results = hide>>=
alpha <- beta <- rep(1, 16)
probe.index <- 5
alpha[[probe.index]] <- 3
beta[[probe.index]] <- 1
priors <- set.priors(Dilution, set = "1000_at", alpha, beta)
@

Run RPA with priors:

<<pe, results=hide, eval = FALSE>>=
rpa.results <- RPA.pointestimate(Dilution, sets, priors = priors)
@                                  

\section{Differential gene expression analysis}

RPA provides a wrapper ('rpa') for robust preprocessing of gene
expression data in differential gene expression studies. One of the
arrays is used as the reference against which the differential
expression values are calculated; RPA expression values are
log\(_2\)-ratios related to the reference array. Choice of the control
array does not affect probe reliability estimates. RPA supports also
alternative CDF environments (see function documentation for details).

<<rpa, results = hide, eval = FALSE>>=
eset <- rpa(Dilution, cind = 1)
@

The output is an ExpressionSet object, which allows downstream
analysis of the results using standard R/BioC tools for gene
expression data.

\section{Citing RPA}

Please cite \cite{Lahti09tcbb} when using the package.

\section{Details}

This document was written using:

<<details>>=
sessionInfo()
@

\bibliographystyle{plainnat}
\bibliography{RPA}

\end{document}