%% LyX 1.6.9 created this file.  For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[english]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}

\usepackage[unicode=true,pdfusetitle,
 bookmarks=true,bookmarksnumbered=false,bookmarksopen=false,
 breaklinks=true,pdfborder={0 0 0},backref=false,colorlinks=false]
 {hyperref}
\usepackage{breakurl}

\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
%\VignetteIndexEntry{Introduction to the les package: Identifying Differential Effects in Tiling Microarray Data with the Loci of Enhanced Significance Framework}
%\VignettePackage{les}

%\usepackage{fancyvrb}
%\fvset{listparameters={\setlength{\topsep}{0pt}}}

\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rmethod}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\texttt{#1}}}

\makeatother

\begin{document}

\title{Introduction to the \Rpackage{les} package:\\
Identifying Differential Effects in\\
Tiling Microarray Data with the\emph{}\\
\emph{Loci of Enhanced Significance} Framework}


\author{Julian Gehring, Clemens Kreutz}

\maketitle
\SweaveOpts{width=7,height=5}

<<echo=false>>=
set.seed(1)
options(width=65, SweaveHooks=list(fig=function() par(mar=c(5.1, 5.1, 2.1, 1.1))))
@


\section{Introduction}

Tiling microarrays have become an important platform for the investigation
of regulation in expression, DNA modification, and DNA-protein interaction.
Since unbiased in relation to annotation they provide an powerful
tool for biological research. Beside the analysis of single conditions,
the investigation of regulation between multiple experimental conditions
is an important part of current research. A common approach consists
in assessing the differential effect on the level of individual probes
and thereby computing p-values $p_{i}$ for each probe $i$ with a
suitable statistical test, for example an F-test, independently. Since
the targets exhibiting differential effects cover multiple probes,
a reasonable next step involves combining information from neighboring
probes. While several approaches for this purpose exist, most of them
lack a statistical interpretation of the results or are specific for
certain platforms and statistical tests.

The approach used here utilizes the fact that p-values in regions
without an differential effect are uniformly distributed. In the case
that regulation is present the distribution of $p_{i}$ is shifted
towards zero; such regions are therefore referred to as \emph{Loci
of Enhanced Significance (LES)}. By assessing the uniform component
of the p-value distribution within a sliding window, the local degree
of significant probes along the genome is estimated. The \Rpackage{les}
package implements the method for detecting \emph{LES} in tiling microarray
data sets, independent of the underlying statistical test and hence
is suitable for a wide range of applications.


\section{Data set and statistic testing on probe level}

The data set used in this vignette is part of a quality control study
for tiling microarrays \cite{johnson_systematic_2008}, in which spike-ins
were used to assess the influence of microarray platforms, preparation
procedures, and analysis algorithms on the accuracy and reproducibility
of ChIP-chip experiments. Here, the expression intensities of one
region from the \emph{undiluted} data set investigated with Affymetrix
tiling microarrays is selected, consisting of 452 probes and two conditions
with three replicates each. The data has been normalized using quantile
normalization and probe positions remapped to a common reference.

Normalized expression intensities are stored in the matrix \Robject{exprs},
with rows representing probes and columns arrays. The corresponding
names contain the position of the probes and the condition of the
samples, respectively. The properties of the spike-in DNA, yielding
a region with differential effect, are described in the \Robject{reference}
data frame.

<<loadData>>=
library(les)
data(spikeInData)
head(exprs)
dim(exprs)
pos <- as.integer(rownames(exprs))
condition <- as.integer(colnames(exprs))
reference
region <- as.vector(reference[ ,c("start", "end")])
@ 

In order to assess the differential effect between the two conditions
on the level of individual probes, we use a modified t-test provided
by the \Rpackage{limma} package \cite{smyth_limma_2005}. This offers
an increased statistical power compared to a classical Student's t-test
for small sample sizes, as present in most tiling microarray experiments.

<<estimateProbeLevelStatistics>>=
library(limma)
design <- cbind(offset=1, diff=condition)
fit <- lmFit(exprs, design)
fit <- eBayes(fit)
pval <- fit$p.value[, "diff"]
@

<<plotProbeLevelStatistics, fig=TRUE, include=FALSE, echo=TRUE>>=
plot(pos, pval, pch=20, xlab="Probe position", ylab=expression(p))
abline(v=region)
@

\begin{figure}
\centering
\includegraphics{les-plotProbeLevelStatistics}
\caption{\label{fig:plotProbeLevelStatistics}Probe-level p-values}
\end{figure}


\section{Incorporation of neighboring probes}

The accumulation of small p-values indicates a \emph{LES} at the location
of the spike-in, whereas the response of individual probes may be
not reliable (Figure \ref{fig:plotProbeLevelStatistics}). Therefore,
an incorporation of neighboring probes should be beneficial and yield
improved results in the detection of the differential effect. For
each probe $i$ the surrounding p-values $p_{j}$ get weights assigned,
corresponding to the distance of probe $i$ and $j$. Being free in
the definition of the weighting function, the spatial relationship
of probes in observing the same effect can be accounted for. A weighted
cumulative density is computed and the fraction of significant $p$
is estimated by an iterative regression. The method is based on the
fact that p-values are uniformly distributed under the null hypothesis
$H_{0}$ whereas p-values violating $H_{0}$ are shifted towards smaller
values \cite{bartholom_estimation_2009}, which is closely related
to the estimation of a false discovery rate. This results in $\Lambda_{i}$,
constituting an estimate of the fraction of p-values violating $H_{0}$
and therefore the degree of significance in the local surrounding
of the evaluated position.

For the analysis, we first store all relevant data in an object of
class \Rclass{Les} by calling the \Rmethod{Les} method. The only
data required are the positions of the probes $i$, the corresponding
p-values $p_{i}$ from the statistical test, and optionally their
chromosomal location.

<<constructLes>>=
res <- Les(pos, pval)
@

Then we can compute our estimate $\Lambda_{i}$ for which we have
to specify a weighting window. Here, we use a triangular window taken
as default with a size of 200bp. The power of detecting a region will
be high if the window matches approximately the properties of the
regulated region. In many experiments a rough prior knowledge is available
which can be sufficient for choosing a suitable window. Further, optimal
weighting windows can be estimated from the data directly.

<<estimateLes>>=
res <- estimate(res, win=200)
@

All data, results and parameters are stored in the \Robject{res}
object of class \Rclass{Les}. We can get an overview of the results
with the \Rmethod{print}, \Rmethod{summary}, or \Rmethod{plot}
method (Figure \ref{fig:showPlotLes}).

<<showPlotLes, fig=TRUE, include=FALSE, echo=TRUE>>=
res
summary(res)
plot(res)
abline(v=region)
@

\begin{figure}
\centering
\includegraphics{les-showPlotLes}
\caption{\label{fig:showPlotLes}Results with triangular window.}
\end{figure}

For comparison we analyze and plot the same data with a rectangular
weighting window (Figure \ref{fig:showPlotLes2}). In this example
the rectangular window leads to better results. The \Rpackage{les}
package includes four predefined windows; custom functions can also
be used, as described in Section \ref{sec:Specification-of-own}.

<<showPlotLes2, fig=TRUE, include=FALSE, echo=TRUE>>=
res2 <- estimate(res, win=200, weighting=rectangWeight)
res2
plot(res2)
abline(v=region)
@

\begin{figure}
\centering
\includegraphics{les-showPlotLes2}
\caption{\label{fig:showPlotLes2}Results with rectangular window.}
\end{figure}


\section{Parameter estimation}

To turn the continuous $\Lambda_{i}$ into distinct regions of interest
we define a threshold $\Theta$. It can be derived from the data by
estimating the number of probes with a significant effect on the whole
array.

<<threshold>>=
res2 <- threshold(res2, grenander=TRUE, verbose=TRUE)
@

Given $\Theta$ we can look for regions that have a continuous $\Lambda_{i}\geq\Theta$.
The \Rmethod{regions} method takes by default the estimated $\Theta$
from the previous step. We can also pass our own estimate for $\Theta$
with the \Rfunarg{limit} argument. Further restrictions can be imposed
on the regions such as the minimal length of a region and the maximum
gap allowed between probes in one region. The \Rmethod{[} method
allows to access any data slot of an object of class \Rclass{Les}.
Here, we use it to extract the data frame with the estimated regions.

<<regions>>=
res2 <- regions(res2, verbose=TRUE)
res2
res2["regions"]
@

<<plotRegions, fig=TRUE, include=FALSE, echo=TRUE>>=
plot(res2, region=TRUE)
abline(v=region)
@

\begin{figure}
\centering
\includegraphics{les-plotRegions}
\caption{\label{fig:plotRegions}Results with estimates for regions.}
\end{figure}


\section{Calculation of confidence intervals}

By bootstrapping probes in each window, confidence intervals for the
statistic $\Lambda_{i}$ can be computed. Since confidence intervals
are primarily interesting in regions of interest and bootstrapping
is by its nature computationally demanding, we can restrict the calculation
to a subset of probes. 

<<plotCi, fig=TRUE, include=FALSE, echo=TRUE>>=
subset <- pos >= 5232400 & pos <= 5233100
res2 <- ci(res2, subset, nBoot=50, alpha=0.1)
plot(res2, error="ci", region=TRUE)
@

\begin{figure}
\centering
\includegraphics{les-plotCi}
\caption{\label{fig:plotCi}Results with confidence intervals and estimates for regions.}
\end{figure}


\section{Visualization}

The \Rmethod{plot} method uses a special system in order to customize
the graphical elements of the figure. It allows to refer to all its
components with the name of the additional input argument; its value
is a list containing named graphical parameters for the underlying
plot function. As an example, we plot smaller region of the chromosome
with confidence intervals, estimated region, and the probe density.
Further, we adapt several parameters changing the graphical representation.
For details, please refer to the help of the \Rpackage{les} package.

<<plotOptions, fig=TRUE, include=FALSE, echo=TRUE>>=
plot(res2, error="ci", region=TRUE, rug=TRUE, xlim=c(5232000, 5233000), sigArgs=list(col="firebrick4"), plotArgs=list(main="LES results", yaxp=c(0, 1, 2)), limitArgs=list(lty=2, lwd=3), regionArgs=list(col="black", density=20), probeArgs=list(col="dodgerblue4", type="p"))
@

\begin{figure}
\centering
\includegraphics{les-plotOptions}
\caption{\label{fig:plotOptions}Results with customized graphical parameters.}
\end{figure}


\section{Exporting results to external software}

With the \Rmethod{export} method the estimated regions as well as
$\Lambda_{i}$ can be saved to standard files formats, in order to
facilitate the export of the results to other software and genome
browsers. The region estimates can be exported to the \emph{bed} and
\emph{gff} formats, the test statistic $\Lambda_{i}$ to the \emph{wig}
format.

<<export>>=
bedFile <- paste(tempfile(), "bed", sep=".")
gffFile <- paste(tempfile(), "gff", sep=".")
wigFile <- paste(tempfile(), "wig", sep=".")
export(res2, bedFile)
export(res2, gffFile, format="gff")
export(res2, wigFile, format="wig")
@


\section{Specification of custom weighting windows\label{sec:Specification-of-own}}

With the \Rfunction{triangWeight}, \Rfunction{rectangWeight}, \Rfunction{epWeight},
and \Rfunction{gaussWeight} functions, four weighting windows are
included in the \Rpackage{les} package, providing a triangular, rectangular,
Epanechnikov, and Gaussian window, respectively. We can also specify
custom window functions and pass it as \Rfunarg{weighting} argument
in the \Rmethod{estimate} method. They have to be specified as a
function depending on the distance of the probes (\Rfunarg{distance})
and the window size (\Rfunarg{win}), as illustrated here with a triangular
weighting.

<<customWeightingFunction>>=
weightFoo <- function(distance, win) {
weight <- 1 - distance/win
return(weight)
}
resFoo <- estimate(res, 200, weighting=weightFoo)
@

<<chi2, eval=FALSE, echo=FALSE, results=hide>>=
regions <- res["regions"]
winsize <- seq(100, 300, by=20)
res2 <- chi2(res2, winsize, regions, offset=2500)
plot(winsize, x["chi2"], type="b")
@

\newpage{}

\bibliographystyle{plain}
\bibliography{ref_les}



\section*{Session information}

<<sessionInfo, results=tex, echo=FALSE>>=
toLatex(sessionInfo(), locale=FALSE)
@
\end{document}