%\VignetteIndexEntry{siggenes Manual}
%\VignetteKeywords{Expression Analysis}
%\VignetteDepends{siggenes}
%\VignettePackage{siggenes}


\documentclass[a4paper]{article}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}


\setlength{\parindent}{0cm} \setlength{\parskip}{12pt}

\renewcommand{\baselinestretch}{1.2}


\begin{document}
\title{Identifying differentially expressed genes with
\texttt{siggenes}}

\author{Holger Schwender\\  holger.schw@gmx.de}
\date{}

\maketitle

%\setlength{\parskip}{0pt}

\vspace{12pt}

\begin{abstract}
\noindent In this vignette, we show how the functions contained in
the \texttt{R} package \texttt{siggenes} can be used to perform
both the Significance Analysis of Microarrays (SAM) proposed by
Tusher et al.\ (2001) and the Empirical Bayes Analysis of
Microarrays (EBAM) suggested by Efron et al.\ (2001).
\end{abstract}


\section{Introduction}

Both the Significance Analysis of Microarrays (SAM) proposed by
Tusher et al.\ (2001) and the Empirical Bayes Analysis of
Microarrays (EBAM) suggested by Efron et al.\ (2001) can be used
to identify differentially expressed genes and to estimate the
False Discovery Rate (FDR). They are, however, not restricted to
gene expression data, but can also be applied in any other multiple testing
si\-tuation. Since SAM and EBAM have been developed for high-dimensional data,
it, however, might be better to use more classical multiple testing approaches
such as Bonferroni correction if there are only a few variables that should be tested. 

In this vignette, it is described how the
functions implemented in the \texttt{R} package
\texttt{siggenes} can be used to perform a SAM or an EBAM analysis. 
For details on the algorithms behind these functions, see Schwender (2003), Schwender
et al.\ (2003), and Schwender (2007).


As usual, it is necessary to load the package.

<<echo=F,results=hide>>=
library(siggenes)
@

<<>>=
library(siggenes)
@



In the following, we use the Golub et al.\ (1999) data set as it
is provided by the \texttt{multtest} package to illustrate how the
SAM and the EBAM analyses can be performed.

<<>>=
data(golub)
@


\texttt{data(golub)} consists of a $3,051 \times 38$ matrix \texttt{golub}
containing the expression levels of 3,051 genes and 38 samples, a
vector \texttt{golub.cl} containing the class labels of the 38
samples, and a $3,051\times 3$ matrix \texttt{golub.gnames} whose third
column consists of the names of the genes.

Another example of how SAM can be applied to an \texttt{ExpressionSet}
object can be found in Schwender et al.\ (2006). This article is part of
a special issue of RNews that can be obtained by available

\begin{verbatim}
> browseURL("http://cran.r-project.org/doc/Rnews/Rnews_2006-5.pdf")
\end{verbatim}

This article is also available in the doc-folder of the \texttt{siggenes} package 
and can be opened by

\begin{verbatim}
> openPDF(file.path(find.package("siggenes"),"doc","siggenesRnews.pdf"))
\end{verbatim}

\section{General Information}\label{info}

While a SAM analysis starts by calling the function \texttt{sam}, an EBAM
analysis starts -- depending on whether an optimal choice for the fudge
factor $a_0$ should be specified or not -- either with \texttt{find.a0} or 
\texttt{ebam}. This is due to the fact that in SAM the fudge factor -- if required --
is computed during the actual analysis, whereas in EBAM it has to be determined prior to
the actual analysis.

In \texttt{siggenes}, functions for performing a SAM or an EBAM analysis with
mo\-de\-ra\-ted $t$-statistics (one-class or two-class, paired or unpaired, assuming
equal or unequal group-variances), a moderated $F$-statistic, Wilcoxon rank statistics (one-class
or two-class, paired or unpaired), and
Pearson's $\chi^2$-statistic as well as the Cochran-Armitage trend test statistic $C$
(for analyzing categorical data such as SNP data) are
available. \texttt{siggenes} also provides the possibility to write your own function
for using your own test score in \texttt{sam}, \texttt{find.a0} and \texttt{ebam} (see
Chapter \ref{own} for how to write such functions).

The different test scores can be selected by specifying the argument
\texttt{method} of \texttt{sam} and \texttt{ebam}. In Table \ref{tab:method}, the 
specifications of \texttt{method} for the different analyses are summarized.

\begin{table}[!hb]
\centering \caption{Specification of the argument \texttt{method} for the different
analyses.}\label{tab:method}
\vspace*{8pt}
\begin{tabular}{l l l l}
\hline
Test Score & \texttt{sam} &\texttt{ebam} & \texttt{find.a0}\\
\hline
$t$&\texttt{d.stat}&\texttt{z.ebam}&\texttt{z.find}\\
$F$&\texttt{d.stat}&\texttt{z.ebam}&\texttt{z.find}\\
Wilcoxon&\texttt{wilc.stat}&\texttt{wilc.ebam}&\hspace*{2ex}--\\
$\chi^2$&\texttt{chisq.stat}&\texttt{chisq.ebam}&\hspace*{2ex}--\\
$C$&\texttt{trend.stat}&\texttt{trend.ebam}&\hspace*{2ex}--\\
\hline
\end{tabular}
\end{table}

Note that for \texttt{find.a0} only a function for the moderated
$t$- and $F$-scores is available, since the fudge factor is not required -- neither in EBAM nor in SAM --
if Wilcoxon rank sums or Pearson's $\chi^2$-statistic are employed.

To get the general arguments of \texttt{sam}, \texttt{ebam} and \texttt{find.a0},
\texttt{args} or \texttt{?} can be used, whereas method-specific arguments can be 
obtained by \texttt{args(foo)} or \texttt{?foo}, where \texttt{foo} is one
of the functions/methods summarized in Table \ref{tab:method}. So, e.g.,

<<>>=
args(sam)
@

gives you the general arguments for a SAM analysis, whereas

<<>>=
args(d.stat)
@

returns the additional arguments for a SAM analysis with a moderated $t$- or $F$-statistic. These
arguments can also be specified in \texttt{sam} because of the \texttt{...} in \texttt{sam}.

Note that one of the arguments of \texttt{sam} (and of \texttt{ebam} and \texttt{find.a0}) is
\texttt{control}, which is essentially a list of further arguments for controlling the respective
analysis. These further arguments can typically be ignored, as they are only provided for very special purposes 
(e.g., for trying to reproduce the results of an application of the Excel SAM version), and their defaults
typically work well. All these arguments should only be changed if their meaning is completely understood,
as their defaults have been chosen sensibly. 

For \texttt{d.stat}, the most important arguments are \texttt{data} and \texttt{cl} (see Section \ref{required}). 
Additionally, \texttt{B}, \texttt{var.equal}, and \texttt{rand} might also be of importance. 
All the other arguments should only be changed if their meaning is completely understood, 
as their defaults have been chosen sensibly.

To obtain, e.g., the general help file for an EBAM analysis, use

\begin{verbatim}
> ?ebam
\end{verbatim}

and to get the \texttt{z.ebam}-specific help files, use

\begin{verbatim}
> ?z.ebam
\end{verbatim}

The arguments or the help files for the class-specific methods \texttt{plot}, \texttt{print},
\texttt{summary}, ... can be received by calling the functions \texttt{args.sam}, \texttt{args.ebam}, and
\texttt{args.finda0}, or \texttt{help.sam}, \texttt{help.ebam}, and \texttt{help.finda0}. So, e.g.,

<<>>=
args.sam(summary)
@

returns all arguments of the \texttt{SAM}-class specific method \texttt{summary}, and

\begin{verbatim}
> help.finda0(plot)
\end{verbatim}

opens an html-file containing the help for the \texttt{FindA0}-class specific method \texttt{plot}.

In the analysis with \texttt{sam} and \texttt{ebam}, all required statistics for a SAM
or an EBAM analysis, respectively, are computed. Additionally, the number of differentially expressed
genes and the estimated FDR is determined for an initial (set of) value(s) for the threshold \texttt{delta}.

Note that even though the threshold for calling a gene differentially expressed is called \texttt{delta}
in both \texttt{sam} and \texttt{ebam}, its meaning is totally different in both analyses: In SAM,
\texttt{delta} is the distance between the observed and the expected (ordered) test scores, whereas 
in EBAM \texttt{delta} is the (posterior) probability that a gene with a specific test score is 
differentially expressed.

After the analysis with \texttt{sam} and \texttt{ebam}, the following functions can be employed for
obtaining general and gene-specific information:
\begin{itemize}
\item \texttt{print}: Obtain the number of identified genes and the estimated FDR for other values
of \texttt{delta}.
\item \texttt{summary}: Get specific information on the genes called differentially expressed for a
specified value of \texttt{delta}.
\item \texttt{plot}: Generate a SAM plot or a plot of the posterior probabilities.
\item \texttt{findDelta}: For a given value of the FDR, find the value of \texttt{delta} --
and thus the number of identified variables -- that provides the control of the FDR at the specified level.
Alternatively, find the value of \texttt{delta} -- and thus the level at which the FDR is controlled -- 
for a specified number of variables.
\item \texttt{sam2excel}/\texttt{ebam2excel}: Generate a csv-file containing the information returned
by \texttt{summary} for the use in Excel.
\item \texttt{sam2html}/\texttt{ebam2html}: Generate an html-file containing the information on the
genes called differentially expressed with links to public repositories.
\item \texttt{list.siggenes}: Get the names (as a character vector) of the genes called differentially
expressed when using a specific value of \texttt{delta}.
\item \texttt{link.siggenes}: Generate an html-file containing the output of \texttt{list.siggenes}
with links to public repositories.
\end{itemize}

More details on these functions are given in the Chapters \ref{sam} and \ref{ebam}.
 

\section{Required Arguments: \texttt{data} and \texttt{cl}}\label{required}

In the first step of both a SAM and an EBAM analysis,
two arguments are required: \texttt{data} and \texttt{cl}. 

The first required argument, \texttt{data}, is the matrix (or the
data frame) containing the gene expression data that should be
analyzed. Each row of this matrix must correspond to a gene, and
each column must correspond to a sample. \texttt{data} can also 
be an \texttt{ExpressionSet} object (e.g., the output of \texttt{rma} or
\texttt{gcrma}).

The second required argument, \texttt{cl}, is the vector of length \texttt{ncol(data)}
containing the class labels of the samples. In an analysis of two
class paired data, \texttt{cl} can also be a matrix. If \texttt{data}
is an \texttt{ExpressionSet} object, \texttt{cl} can also be the name of the column
of \texttt{pData(data)} containing the class labels.

The correct specification of the class labels depends on the type of data that
should be analyzed. On the basis of this specification, the
functions identify the type of data automatically.

\textbf{One class data.} In the one class case, \texttt{cl} is expected to
be a vector of length $n$ containing only 1's, where $n$ denotes the number of
samples but another value than 1 is also accepted. In the latter case,
this value is automatically set to 1. So for $n = 10$, the vector \texttt{cl} is given by

<<>>=
n <- 10
rep(1, n)
@


\textbf{Two class, unpaired data.} In
this case, the functions expect a vector \texttt{cl} consisting
only of 0's and 1's, where all the samples with class label '0'
belong to one group (e.g., the control group), and the samples
with class label '1' belong to the other group (e.g., the
case group). So if, for example, the first \texttt{n1} $=5$ columns of the
data matrix correspond to controls and the next \texttt{n2} $=5$
columns correspond to cases, then the vector \texttt{cl} is given
by

<<>>=
n1 <- n2 <- 5
rep(c(0, 1), c(n1, n2))
@


The functions also accept other values than 0 and 1. In this case,
the smaller value is automatically set to 0, and the larger value
to 1. So if, e.g., 1 is used as the label for group 1, and 2 for
the label of group 2, then the functions will automatically set the
class label '1' to 0, and the class label '2' to 1.


\textbf{Two class, paired data.} Denoting
the number of samples by $n$, we here have $K=n/2$ paired
observations. Each of the $K$ samples belonging to the first group
(e.g., the after treatment group) is labelled by one of the
integers between 1 and $K$, and each of the $K$ samples belonging
to the other group (e.g., the before treatment group) is labelled
by one of the integers between -1 and $-K$, where the sample with
class label '$k$' and the sample with label $'-k'$ build an
observation pair, $k=1,\ldots,K$. So if, e.g., the first $K=5$
columns of the data matrix contain samples from the before
treatment group, and the next $K=5$ columns contain samples from
the after treatment group, where the samples 1 and 6, 2 and 7,
..., respectively, build a pair, then the vector \texttt{cl} is
given by

<<>>=
K <- 5
c((-1:-5), 1:5)
@


Another example: If the first column contains the before treatment
measurements of an observation, the second column the after
treatment measurements of the same observation, the third column
the before treatment measurements of the second observation, the
fourth column the after treatment measurements of the second
observation, and so on, then a possible way to generate the vector
\texttt{cl} for $K=5$ paired observations is

<<>>=
K <- 5
rep(1:K, e = 2) * rep(c(-1 ,1), K)
@


There is another way to specify the class labels in the two class paired
case: They can be specified by a matrix with $n$ rows and two columns. One of
the column should contain -1's and 1's specifying the before and after
treatment samples. The other column should consist of the integers between 1
and $K$ indicating the observation pairs. So if we consider the previous example,
\texttt{cl} can also be specified by

<<>>=
K <- 5
cbind(rep(c(-1, 1), 5), rep(1:5, e = 2))
@

While \texttt{cl} must be specify as described above if \texttt{cl} is a vector,
other values will be accepted if \texttt{cl} is a matrix. In the latter case, the
smaller value of the column of \texttt{cl} containing two different values will be
set to -1, and the larger value to 1. The $K$ different values in the other column
are sorted and set to the integers between 1 and $K$.

\textbf{Multi-class and categorical cases.} In these cases, \texttt{cl} should
be a vector containing the integers between 1 and $g$, where $g$ is the number of different
classes/levels. Other labels are also accepted, but will automatically be set to the integers between
1 and $g$.


\section{Significance Analysis of Microarrays}\label{sam}

In this section, we show how the Significance Analysis of
Microarrays (SAM) proposed by Tusher et al.\ (2001) can be applied
to the data set of Golub et al.\ (1999). Another example of how SAM
can be applied to gene expression data can be found in Schwender et
al.\ (2006). This article can be opened by calling

\begin{verbatim}
> openPDF(file.path(find.package("siggenes"),"doc","siggenesRnews.pdf"))
\end{verbatim}

The matrix \texttt{golub} contains the expression values of
3,051 genes and 38 samples, while the vector \texttt{golub.cl}
consists of the class labels that are either 0 and 1. Additionally, the gene names
are provided by the third column of \texttt{golub.gnames}. (Note that if
the row names of the data matrix already comprise the gene names, \texttt{gene.names}
need not to be specified in \texttt{sam}.)

A SAM analysis of the Golub et al.\ (1999) data set (i.e.\ a SAM analysis for
two class unpaired data) can be performed by

<<>>=
sam.out <- sam(golub, golub.cl, rand = 123, gene.names = golub.gnames[,3])
sam.out
@

The argument \texttt{rand} is set to 123 to make the results of \texttt{sam} reproducible.

The output of \texttt{sam.out} consists of
\begin{itemize}
\item \texttt{Delta}: the different values of $\Delta$ for which
the numbers of genes and the estimated FDRs are computed,
\item \texttt{p0}: the estimated prior probability that a gene is not 
differentially expressed,
\item \texttt{False}: the number of falsely called genes (which is \emph{not} the expected
number of false positives, see below and Tusher et al., 2001),
\item \texttt{Called}: the number of genes called differentially expressed,
\item \texttt{FDR}: the estimated FDR computed by \texttt{p0} $\cdot$ \texttt{False} / \texttt{FDR}.
\end{itemize}

Note that the expected number of false positives is given by \texttt{p0} $\times$ \texttt{False}
such that the number of falsely called genes denoted by \texttt{False} is only equal to the expected
number of false positives if \texttt{p0} $=1$.

If one would like to use the Wilcoxon rank sum statistic instead of a moderated
$t$-statistic, the analysis can be done by

<<>>=
sam.out2 <- sam(golub, golub.cl, method = wilc.stat, rand = 123)
@

A little bit more information about the former SAM analysis can be obtained by

<<>>=
summary(sam.out)
@

The last four columns of this table show the values of 
$\text{cut}_\text{up}(\Delta)$ and $\text{cut}_\text{low}(\Delta)$, which are the upper and lower cutoffs
for a gene to be called differentially expressed, and $i_1$ and $i_2$, which are the indices of the (ordered)
genes specifying the cutoffs (for details, see Schwender, 2007).

The output of this analysis with \texttt{sam} contains a table of statistics for a set of initial values
of $\Delta$. If other values of $\Delta$, let's say $1.5, 1.6, 1.7, \ldots,
2.4$, are of interest, one can use \texttt{print} or \texttt{summary} to get the number
of significant genes and the estimated FDR for these values of \texttt{delta}.

<<>>=
print(sam.out, seq(1.5, 2.4, 0.1))
@

\begin{figure}[!t]
\centerline{
\includegraphics[width=8.5cm]{samplot}}\vspace{-15pt}
\caption{SAM plot for $\Delta=2.4$.}\label{samplot}
\vspace*{12pt}
\end{figure}

The function \texttt{plot} can be used to generate a SAM plot for a specific value of
\texttt{delta}.

\begin{verbatim}
> plot(sam.out, 2.4)
\end{verbatim}

Note that the SAM plot is only generated if \texttt{delta} is a numeric value. If \texttt{delta} is
a vector or \texttt{NULL}, a graphical representation of the table produced by \texttt{print} is
plotted. 


The function \texttt{identify} makes it possible to obtain information about the genes by
clicking on the SAM plot.

\begin{verbatim}
> identify(sam.out)
\end{verbatim}

If \texttt{chip}, i.e.\ the chip name (e.g., "hgu133plus2"), is specified and \texttt{ll = TRUE}
in \texttt{identify}, then the locus link and the
symbol of the gene corresponding to the identified point are added to the output. For example,
clicking on the point nearest to the upper right corner, i.e.\ the point corresponding to the
gene with the largest positive expression score $d$, produces 

\begin{verbatim}
          d.value  stdev p.value q.value R.fold
M27891_at  8.1652 0.2958       0       0 7.2772
\end{verbatim}

which does not contain the locus links since the chip type has not been specified.

If the chip name has been specified either by \texttt{chip} or by setting \texttt{data} to an
\texttt{ExpressionSet} object, one can set \texttt{browse = TRUE} in \texttt{identify}. This opens the NCBI
webpage corresponding to the Entrez / locus link of the gene identified by clicking on the SAM plot.

Gene-specific information about the genes called differentially expressed using a specific
value of $\Delta$ (here $\Delta=3.3$) can be obtained by

<<>>=
sum.sam.out <- summary(sam.out, 3.3)
sum.sam.out
@

The generated table contains the row numbers of the identified genes in the data matrix used
(\texttt{Row}), the values of the test statistic (\texttt{d.value}) and the corresponding standard
deviations, i.e.\ the values of the denominator of this statistic (\texttt{stdev}), 
the unadjusted p-values (\texttt{rawp}),
the q-values (\texttt{q.value}), the fold changes (\texttt{R.fold}), and the names of the identified genes
as specified in \texttt{sam} by \texttt{gene.names} (\texttt{Name}).

By default, in the output of \texttt{summary} the identified variables are called genes (or SNPs if 
\texttt{method = cat.stat} is used). To change this, use, e.g.,

<<>>=
print(sum.sam.out, varNames = "Proteins")
@

if protein data is analyzed.  

The rows of \texttt{golub} that contain the values of the differentially expressed genes
can also be obtained by

<<>>=
sum.sam.out@row.sig.genes
@

the general information about the set of significant genes by

<<>>=
sum.sam.out@mat.fdr
@

and the gene-specific information by

<<>>=
sum.sam.out@mat.sig
@

To obtain just the names of the genes called significant using $\Delta=3.3$,

<<>>=
list.siggenes(sam.out, 3.3)
@

If the value of $\Delta$ and thus the number
of differentially expressed genes should be identified 
for which the FDR is controlled at a level of, say, 0.05, call

<<>>=
findDelta(sam.out, fdr = 0.05)
@

If there is one value of \texttt{Delta} for which \texttt{FDR} is exactly equal
to 0.05, only this value of $\Delta$ and the corresponding number of identified
genes is returned. If there is no such $\Delta$, then a lower and upper bound
for $\Delta$ is shown. Here, we would use $\Delta = 0.926010$, as our goal is
to find a value of $\Delta$ for which FDR is less than or equal to 0.05.

Similarly, the estimated FDR can be obtained for a specific number of identified
genes. E.g., calling 200 genes differentially expressed leads to an estimated FDR
of 0.00145.

<<>>=
findDelta(sam.out, genes = 200)
@

\vspace*{18pt}

\section{Empirical Bayes Analysis of Microarrays}\label{ebam}

If a moderated test score such as a moderated $t$- or $F$-statistic
should be used in the Empirical Bayes Analysis of Microarray, the
optimal choice of the fudge factor $a_0$ has to be specified prior to 
the actual EBAM analysis by a
stan\-dar\-dized version of the EBAM analysis (for details, see Efron et
al., 2001). This standardized analysis can be performed using the
function \texttt{find.a0}.

Considering the Golub et al.\ (1999) data as provided by the
package \texttt{multtest}, this analysis is thus done by

<<>>=
find.out <- find.a0(golub, golub.cl, rand = 123)
@

where \texttt{rand} sets the random number generator in an reproducible
state.

<<>>=
find.out
@

To obtain the number of identified genes and the estimated FDR for
another than the default value of \texttt{delta}, i.e.\ the minimum posterior
probability for a gene to be called differentially expressed, say
\texttt{delta} $=0.95$, call

<<>>=
print(find.out, 0.95)
@

Note that a gene is called differentially expressed if
\begin{itemize}
\item its posterior probability is larger than or equal to \texttt{delta},
\item there is no gene with a more extreme test score (i.e\ a larger value
if the score is positive, or a smaller value if the score is negative) than
the considered gene that is not called differentially expressed.
\end{itemize}

The output of \texttt{find.a0} suggest a value for the fudge factor $a_0$. This
suggestion is based -- as proposed by Efron et al.\ (2001) -- on the number of 
differentially expressed genes. (The value of $a_0$ is chosen that leads to the
largest number of differentially expressed genes.)

One, however, should also consider the estimated FDR and take a look on the 
plot of the logit-transformed posterior probabilities for the different values
of $a_0$. The plot displayed in Figure \ref{fig:finda0} can be generated by

\begin{figure}[!t]
\centerline{
\includegraphics[width=8.5cm]{finda0}}\vspace{-15pt}
\caption{Logit-transformed posterior probabilities for standardized EBAM analyses
using different values of the fudge factor $a_0$.}\label{fig:finda0}
\end{figure}

\begin{verbatim}
> plot(find.out)
\end{verbatim}

After having determined the optimal choice for $a_0$, the actual EBAM analysis
with the choice suggested by \texttt{find.a0} can be done by

<<>>=
ebam(find.out)
@

If one would like to use another value of $a_0$, let's say the minimum, i.e. the
0\% quantile, of the standard deviations of the genes, which is comprised by the
second row of \texttt{find.out}, then the EBAM analysis can be performed by

<<>>=
ebam(find.out, which.a0 = 2)
@

Since in \texttt{find.a0} a fast, crude estimate for the number of falsely called
genes (i.e.\ the number of genes expected under the null hypothesis to have a posterior
probability larger than or equal to \texttt{delta}) is used instead of the exact number. If the
analysis should directly start with calling \texttt{ebam}, it is thus necessary
to set \texttt{fast = TRUE} to employ this crude estimate, and therefore, obtain the
same results as with a combination of \texttt{find.a0} and \texttt{ebam}.

Thus,

<<>>=
ebam(golub, golub.cl, a0 = 0, fast = TRUE, rand = 123)
@

leads to the same results as

\begin{verbatim}
ebam(find.out)
\end{verbatim}

since $a_0 = 0$ is suggested by \texttt{find.out} as the optimal choice for the fudge factor. 

The exact mean number of falsely called genes is used when \texttt{fast = FALSE}, which is the
default setting for \texttt{fast}. Thus,

<<>>=
ebam.out <- ebam(golub, golub.cl, a0 = 0, rand = 123)
@

leads to the results of the EBAM analysis employing the exact mean number of falsely called
genes and the value for the fudge factor identified in the analysis with \texttt{find.a0}.

By default, the number of differentially expressed genes and the estimated FDR is computed
for \texttt{delta} $=0.9$. To get these statistics for other values of \texttt{delta}, say
0.91, 0.92, ..., 0.99, call

<<>>=
print(ebam.out, seq(0.91, 0.99, 0.01))
@

The posterior probabilities can be plotted by

\begin{verbatim}
> plot(ebam.out, 0.9)
\end{verbatim}

\begin{figure}[!t]
\centerline{
\includegraphics[width=8.5cm]{ebamplot}}\vspace{-15pt}
\caption{Plot of the posterior probabilities from the actual EBAM analysis using
\texttt{delta} $=0.9$ as threshold for a gene to be called differentially expressed.}\label{ebamplot}
\vspace*{12pt}
\end{figure}

where the threshold \texttt{delta} for a gene to be called differentially expressed
has to be specified (see Figure \ref{ebamplot}).

Having selected a value for \texttt{delta}, say 0.99997, gene-specific information can be
obtained by

<<>>=
summary(ebam.out, 0.99997)
@

The generated table contains the row numbers of the identified genes in the data matrix (\texttt{Row}),
the values of the test statistic (\texttt{z.value}), the posterior probabilities (\texttt{posterior}),
and an adhoc estimate of the local FDR (\texttt{local.fdr}, see Efron et al., 2001).

Instead of employing \texttt{find.a0} to identify an optimal value of $a_0$ and using
this value to compute a moderated $t$-statistic, it is also possible to use the
ordinary $t$-statistic by setting $a_0=0$ in \texttt{find.a0}. (Since for the Golub
et al.\ (1999) $a_0=0$ is the optimal choice for the fudge factor, this analysis has
already been done with output \texttt{ebam.out}).

Note that actually Welch's $t$-statistic (assuming unequal group variances) is computed 
by default. If you would like to use the (moderated) $t$-statistic assuming equal
group variances, call

<<>>=
ebam(golub, golub.cl, a0 = 0, var.equal = TRUE, rand = 123)
@

If you do not want to find the optimal value for the fudge factor, but would like to use
a reasonable value for $a_0$, say the median, i.e.\ the 50\% quantile, of the standard 
deviations of the genes, call

<<>>=
ebam(golub, golub.cl, quan.a0 = 0.5, rand = 123)
@

Instead of using a moderated $t$-statistic, it is also possible to employ Wilcoxon
rank sum statistic. Since in this case it is not necessary to specify a value for
$a_0$, an EBAM analysis with Wilcoxon rank sums is performed by

<<>>=
ebam(golub, golub.cl, method = wilc.ebam, rand =123)
@

\section{Writing your own Score Function}\label{own}

It is also possible to write your own function for computing other test scores as the one
provided by \texttt{siggenes} (see Chapter \ref{info}).

For all three wrappers \texttt{sam}, \texttt{find.a0} and \texttt{ebam}, this function must have
as input the two required arguments 
\begin{description}
\item[\texttt{data}:] A matrix or data frame containing the data. Each row of this data set should
correspond to one of the $m$ variables (e.g., genes), and each column to one of the $n$ observations.
\item[\texttt{cl}:] A vector consisting of the class labels of the observations.
\end{description}

The function can also have additional optional arguments, i.e.\ arguments for which a default is specified,
which can be called in \texttt{sam}, \texttt{find.a0} or \texttt{ebam}.

\subsection{\texttt{sam}}

The output of a function that should be used in \texttt{sam} must be a list con\-si\-sting of the following objects:
\begin{description}
\item[\texttt{d}:] A numeric vector containing the test scores of the genes.
\item[\texttt{d.bar}:] A numeric vector of length \texttt{na.exclude(d)} consisting of the sorted test scores
expected under the null hypothesis.
\item[\texttt{p.value}:] A numeric vector of the same length and order as \texttt{d} containing the $p$-values
of the genes.
\item[\texttt{vec.false}:] A numeric vector of the same length as \texttt{d} consisting of the one-sided
expected numbers of falsely called genes, i.e.\ the mean numbers of test scores that are larger than the test
scores of the genes if the test scores of the genes are positive, and the mean number of test scores smaller than the test
scores of the genes if the test scores of the genes are negative. Let's, e.g., assume that the observed test scores
are
\begin{verbatim}
c(2.4, -1.2, -0.8, 1.3)
\end{verbatim}

and the test scores obtained by two permutations of the class labels are given by
\begin{verbatim}
c(1.4, 2.5, -1.4, -0.8, 2, 1.1, -0.9, -2.3)
\end{verbatim}

then \texttt{vec.false} is given by
\begin{verbatim}
c(0.5, 1, 2, 1.5)
\end{verbatim}

since, e.g., one value from the two permutations is larger than 2.4 (thus, 1/2 is the one-sided expected number
of falsely called genes), and two values are smaller than -1.2 (thus, 2/2 is the one-sided expected number of
falsely called genes).

\item[\texttt{s}:] A numeric vector containing the standard errors of the expression values. If no standard errors
are available, set \texttt{s = numeric(0)}.
\item[\texttt{s0}:] A numeric value specifying the fudge factor. If the fudge factor is not computed, set 
\texttt{s0 = numeric(0)}.
\item[\texttt{mat.samp}:] A $B \times n$ matrix containing the permuted class labels. Set 
\texttt{mat.samp = matrix(numeric(0))} if the exact null distribution is known.
\item[\texttt{msg}:] A character vector containing messages that are displayed when the \texttt{SAM}
specific \texttt{S4} methods \texttt{print} and \texttt{summary} are called. Should end with
"$\backslash$\texttt{n}$\backslash$\texttt{n}". If no message should be shown, set \texttt{msg=""}
\item[\texttt{fold}:] A numeric vector containing the fold changes of the genes. Should be set to
\texttt{numeric(0)} if another analysis than a two-class analysis is performed.
\end{description}

Assume, e.g., that we would like to perform a SAM analysis with the ordinary $t$-statistic assuming
equal group variances and normality. The code of a function \texttt{t.stat} for such an analysis is given by

<<>>=
t.stat <- function(data, cl){
    require(genefilter) ||
        stop("genefilter required.")
    cl <- as.factor(cl)
    row.out <- rowttests(data, cl)
    d <- row.out$statistic
    m <- length(na.exclude(d))
    d.bar <- qt(((1:m) - 0.5)/m, length(cl) - 2)
    p.value <- row.out$p.value
    vec.false <- m * p.value/2
    s <- row.out$dm/d
    msg <- paste("SAM Two-Class Analysis",
         "Assuming Normality\n\n")
    list(d = -d, d.bar = d.bar, p.value = p.value,
        vec.false = vec.false, s = s, s0 = 0,
        mat.samp = matrix(numeric(0)),
        msg = msg, fold = numeric(0))
}
@

where \texttt{row.out\$dm} is a vector containing the numerators of the $t$-statistics.


Note that in the output of \texttt{t.stat} \texttt{d} is set to \texttt{-d}, since in
\texttt{rowttests} the mean of group 2 is subtracted from the mean of
group 1, whereas in \texttt{sam} with \texttt{method = d.stat} the difference is taken 
the other way around.

Now \texttt{t.stat} can be used in \texttt{sam} by setting \texttt{method = t.stat}:

<<>>=
sam(golub, golub.cl, method = t.stat)
@



\subsection{\texttt{find.a0}}

A function that should be used as \texttt{method} in \texttt{find.a0} must return
a list consisting of
\begin{description}
\item[\texttt{r}:] A numeric vector containing the numerator of the observed test scores
of the genes.
\item[\texttt{s}:] A numeric vector comprising the denominator of the observed test 
scores of the genes.
\item[\texttt{mat.samp}:] A matrix in which each row represents one of the permutations
of the vector \texttt{cl} of the classlabels.
\item[\texttt{z.fun}:] A function that computes the values of the test statistics for
each gene. This function must have the required arguments \texttt{data} and \texttt{cl},
but no further arguments. Its output must be a list consisting of two objects: One called
\texttt{r} that contains the numerator of the test statistic, and the other called \texttt{s}
comprising the denominator of the test score. (This function is used to compute the
test scores when considering the permuted class labels.)
\end{description}

In \texttt{find.a0}, both the observed and the permuted $z$-values are monotonically transformed
such that the observed $z$-values follow a standard normal distribution. If the observed
$z$-values, however, should follow another distribution, the output of the function that should
be used in \texttt{find.a0} must also contain an object called
\begin{description}
\item[\texttt{z.norm}:] A numeric vector of the same length as \texttt{r} containing the 
appropriate quantiles of the distribution to which the observed $z$-values should be tranformed.
\end{description}

It is also possible to specify a header that is shown when \texttt{print}ing the output of 
\texttt{find.a0} by adding the object
\begin{description}
\item[\texttt{msg}:] A character vector containing the message that is displayed when 
the \texttt{FindA0} specific method \texttt{print} is called. Should end with
"$\backslash$\texttt{n}$\backslash$\texttt{n}".
\end{description}

For an example, let's assume that we would like to implement an alternative way for using a moderated
version of the ordinary $t$-statistic in \texttt{find.a0}. This function based on 
\texttt{rowttests} from the packages \texttt{genefilter} could, e.g., be specified by

<<>>=
t.find <- function(data, cl, B = 50){
    require(genefilter)
    z.fun <- function(data, cl){
        cl <- as.factor(cl)
        out <- rowttests(data, cl)
        r<- out$dm
        s<- r / out$statistic
        return(list(r = -r, s = s))
    }
    mat.samp <- matrix(0, B, length(cl))
    for(i in 1:B)
        mat.samp[i, ] <- sample(cl)
    z.out <- z.fun(data, cl)
    msg <- paste("EBAM Analysis with a Moderated t-Statistic\n\n")
    list(r = z.out$r, s = z.out$s,
        mat.samp = mat.samp, z.fun = z.fun, msg = msg)
}
@

Note that in the output of \texttt{z.fun} \texttt{r} is set to \texttt{-r}, since in
\texttt{rowttests} the mean of group 2 is subtracted from the mean of
group 1, whereas in \texttt{find.a0} with \texttt{method = z.find} the difference is taken 
the other way around.
 
Now \texttt{t.find} can be employed in \texttt{find.a0} by setting \texttt{method = t.find}:

<<>>=
t.out <- find.a0(golub, golub.cl, method = t.find, B = 100, rand =123)
t.out
@

which produces the same results as

<<>>=
find.a0(golub, golub.cl, var.equal = TRUE, rand =123)
@

and can also be used in the actual EBAM analysis:

<<>>=
ebam(t.out)
@

\subsection{\texttt{ebam}}

A function that should be employed as \texttt{method} in \texttt{ebam} must output
a list consisting of the objects
\begin{description}
\item[\texttt{z}:] A numeric vector containing the test scores of the genes.
\item[\texttt{ratio}:] A numeric vector of the same length as \texttt{z} comprising
the values of the ratio $f_0(z)/f(z)$, where $f_0$ is the distribution of the test
scores expected under the null hypothesis, and $f$ is the distribution of the 
observed test scores.
\item[\texttt{vec.pos}:] A numeric vector of the same length as \texttt{z} consisting
of the numbers of test scores expected under the null hypothesis to be larger than or
equal to \texttt{abs(z[i])}, \texttt{i} $=1,\ldots$, \texttt{length(z)}.
\item[\texttt{vec.neg}:] A numeric vector of the same length as \texttt{z} consisting
of the numbers of test scores expected under the null hypothesis to be smaller than or
equal to \texttt{-abs(z[i])}, \texttt{i} $=1,\ldots$, \texttt{length(z)}.
\end{description}

It is also possible to specify a header that is shown when \texttt{print}ing or
\texttt{summary}zing the output of \texttt{ebam} by adding the object
\begin{description}
\item[\texttt{msg}:] A character vector containing the message that is displayed when 
the \texttt{EBAM} specific methods \texttt{print} or \texttt{summary} are called. 
Should end with "$\backslash$\texttt{n}$\backslash$\texttt{n}".
\end{description}

As an example, let's assume we would like to use the ordinary $t$-statistic in 
\texttt{ebam} and assume normality. A function for performing this type of analysis
is, e.g., given by

<<>>=
t.ebam<-function(data, cl){
    require(genefilter)
    cl <- as.factor(cl)
    out <- rowttests(data, cl)
    z <- -out$statistic
    z.dens <- denspr(z)$y
    m <- length(z)
    vec.pos <- m * out$p.value / 2
    z.null <- dt(z, length(cl) - 2)
    msg<-paste("EBAM Analysis with t-Statistic Assuming Normality.\n\n")
    list(z = z, ratio = z.null/z.dens, vec.pos = vec.pos,
        vec.neg = vec.pos, msg = msg)
}
@

where \texttt{denspr} is a function for estimating a density based on
a histogram and a Poisson regression (cf.\ Efron and Tibshirani, 1996,
and the help file for \texttt{denspr}),
and \texttt{vec.neg = vec.pos} since the null distribution is symmetric.

Now \texttt{t.ebam} can be used in \texttt{ebam} by setting \texttt{method = t.ebam}:

<<>>=
ebam(golub, golub.cl, method = t.ebam)
@

%\vspace{1cm}

\begin{thebibliography}{}

\item[Efron, B.,] and Tibshirani, R.\ (1996). 
Using Specially Designed Exponential Families for Density Estimation. 
\textit{Annals of Statistics}, 24, 2431--2461. 

\item[Efron, B.,] and Tibshirani, R.\ (2002). Empirical Bayes Methods
and False Discovery Rates for Microarrays. \textit{Genetic
Epidemiology}, 23, 70--86.

\item[Efron, B.,] Tibshirani, R., Storey, J.\ D., and Tusher, V.\ (2001).
Empirical Bayes Analysis of a Microarray Experiment.
\textit{Journal of the American Statistical Association}, 96,
1151--1160.

\item[Golub, T.\ R.,] Slonim, D.\ K., Tamayo, P., Huard, C., Gaasenbeek,
M., Mesirov, J.\ P., Coller, H., Loh, M.\ L., Downing, J.\ R.,
Caliguiri, M.\ A., Bloomfield, C.\ D., and Lander, E.\ S.\ (1999).
Molecular Classification of Cancer: Class Discovery and Class
Prediction by Gene Expression Moni\-to\-ring. \textit{Science},
286, 531--537.

\item[Schwender, H.] (2003). Assessing the False Discovery Rate 
in a Statistical Analysis of Gene Expression Data. \textit{Diploma Thesis}, 
Department of Statistics, University of Dortmund.

\item[Schwender, H.] (2007). Statistical Analysis of Genotype and Gene
Expression Data. \textit{Dissertation}, Department of Statistics, 
University of Dortmund.

\item[Schwender, H.,] Krause, A., and Ickstadt, K.\ (2003). Comparison of
the Empirical Bayes and the Significance Analysis of Microarrays.
\textit{Techical Report}, University of Dortmund, Dortmund, Germany.

\item[Schwender, H.,] Krause, A., and Ickstadt, K. (2006). Identifying Interesting
Genes with siggenes. \textit{RNews}, 6(5), 45--50.

\item[Tusher, V.\ G.,] Tibshirani, R., and Chu, G.\ (2001).
Significance Analysis of Microarrays Applied to the Ionizing
Radiation Response. \textit{Proceedings of the National Academy
of Science}, 98, 5116--5121.

\end{thebibliography}



\end{document}