% \VignetteIndexEntry{Introduction to OLIN}
% \VignetteDepends{OLIN}
% \VignetteKeywords{Expression Analysis, Preprocessing}
% \VignettePackage{OLIN}

\documentclass[a4paper,11pt]{article}
\usepackage{a4wide}
\title{Introduction to OLIN package}
\author{Matthias E. Futschik\\ SysBioLab,  Universidade do Algarve\\ URL: \textit{http://olin.sysbiolab.eu}}

\begin{document}
\maketitle
\tableofcontents

\section{Overview}
Microarray measurements are affected by a variety of systematic experimental errors 
limiting the accuracy of data produced. 
Two well-known systematic errors for two-colour arrays are    
the so-called intensity-dependent and spatial (location-dependent) dye bias. 
Normalisation aims to correct for systematic errors in microarray data.

The common linear (or global) normalisation method  often fails to correct for dye  bias 
as this bias is usually not linear.  Although non-linear normalisation procedures have been 
able to reduce the systematic errors,
these methods are based on default parameters and leave it to the user to choose ``good'' 
parameters. The optimal adjustment of the normalisation models to the data, however,
can be crucial for the efficiency of the normalisation process \cite{toni}. 

The OLIN (\textit{Optimised Local Intensity-dependent Normalisation})  
R-package includes two normalisation schemes based on iterative  local regression
and model selection. Both  schemes aim to correct for intensity- and location-dependent  dye bias in  microarray data. For model selection (parameter optimisation), generalized cross-validation (GCV) is used. 

Additionally, several procedures to assess the efficiencies of normalisation are implemented
in the package. 

A graphical user interface for OLIN has been implemented in the package OLINgui, that is included in the OLINgui package available at the Bioconductor repository.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%
\section{Installation requirements}
Following software is required to run the OLIN-package:
\begin{itemize} 
\item  R (> 2.0.0). For installation of R, refer to \textit{http://www.r-project.org}.
\item  R-packages: methods, stats, locfit. For installation of these add-on packages, refer to \textit{http://cran.r-project.org}.
\item Bioconductor packages: Biobase, marray. Refer to \textit{http://www.bioconductor.org} for installation. 
\end{itemize}

If all requirements are fulfilled, the OLIN add-on R-package can be installed. To see how to install add-on R-packages on your computer system, start \textit{R} and type in \textit{help(INSTALL)}.
Optionally, you may use the R-function \textit{install.packages()}. 
Once the OLIN package is installed, you can load the package by 

<<>>=
  library(OLIN) 
@



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Inspection for  intensity-dependent and spatial artifacts}
\label{visu}
Microarray data often contains many systematic errors.
 Such errors have to be identified and removed
 before further data analysis is conducted.  The most basic approach for identification 
 is the visual inspection 
of MA- and MXY-plots. MA-plots display the logged fold change 
($M=log_{2}(Ch2) - log_{2}{Ch1}$)
with respect to the average logged spot intensity  ($A=0.5(log_{2}(Ch1) + log_{2}{Ch2})$).
MXY-plots display $M$ with respect to the corresponding spot location. 
More stringent, but also computationally more expensive,  are statistical tests presented in section \ref{stats}. 


For illustration, we examine a cDNA microarray experiment comparing  gene expression in two colon cancer cell lines (SW480/SW620). The SW480 cell line was derived from a primary tumor, whereas 
the SW620 cell line was cultured from a lymph node metastasis of the same patient.
 Sharing the same genetic background, these cell lines serve as a  model
 of cancer progression~\cite{genomeletters}. The comparison was direct i.e. without using a 
reference sample. cDNA derived from SW480 cells was labeled by Cy3; cDNA derived 
from SW620 was labeled by Cy5. The SW480/620 experiment consisted of four 
technical replicates. The data is stored as object of the class \texttt{marrayRaw} (see
the documentation for the package \texttt{marray}). The average logged spot intensity
$A$ and logged fold changes $M$ can be accessed by using   the slot accessor methods \texttt{maA} and \texttt{maM}, respectively.


First we want to load the data and inspect the spatial distribution of foreground and
background intensities in both channels. This can be done using the  function \texttt{fgbg}
that plots the spatial distribution of fore- and background spot intensities for both channels (figure~\ref{fgbg}):

<<eval=FALSE,echo=TRUE>>=
  data(sw)
  fgbg.visu(sw[,3])
@

\begin{figure}
\centering
<<fig=TRUE,echo=FALSE>>=
  data(sw)
  fgbg.visu(sw[,3])
@
\caption{Foreground and background fluorescence intensities for Cy5-channel (top row) and Cy3-channel (bottom) row.}
\label{fgbg}
\end{figure}

%%%%%%%%%%%%%%%
The  quantity that we are interested in are the (logged) fold-change $M$. To inspect
any existing intensity- or location-dependent bias of $M$, MA- and MXY-lots can be employed.
For MA-lots, the basic \texttt{plot} function can be used (figure~\ref{maplot}). 
MXY-lots can be generated 
by \texttt{mxy.plot}(figure~\ref{mxyplot}). 
Note that the function   \texttt{mxy.plot} assumes the standard array
layout as defined for \textit{marrayRaw/marrayNorm} objects.
  
<<eval=FALSE,echo=TRUE>>=
plot(maA(sw[,3]),maM(sw[,3]),xlab="A",ylab="M")
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\centering
<<fig=TRUE,height=3.5,width=3.5,echo=FALSE>>=
par(cex.lab=0.6)
plot(maA(sw[,3]),maM(sw[,3]),xlab="A",ylab="M",pch='.')
@
\caption{MA-plot of SW480/620 array 3}\
\label{maplot}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

<<eval=FALSE,echo=TRUE>>=
mxy.plot(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),
                 Nsc=maNsc(sw),Nsr=maNsr(sw))
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\centering
<<fig=TRUE,height=5,width=5.5,echo=FALSE>>=
mxy.plot(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),
                 Nsc=maNsc(sw),Nsr=maNsr(sw))
@
\caption{MXY-plot of SW480/620 array 3 with columns and rows as proxies for spatial location}
\label{mxyplot}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Additionally the distribution of absolute values of $M$ can be displayed. This visualisation
may give an indication whether the span of $M$ is equal across the spatial dimensions of
the array (figure~\ref{mxyabsplot}). 
To plot $abs(M)$, the function \texttt{mxy.abs.plot} can be employed :


<<echo=TRUE,eval=FALSE>>=
data(sw.xy)
mxy.abs.plot(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),Nsc=maNsc(sw),Nsr=maNsr(sw))
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\centering
<<fig=TRUE,height=5,width=5.5,echo=FALSE>>=
data(sw.xy)
mxy.abs.plot(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),Nsc=maNsc(sw),Nsr=maNsr(sw))

@
\caption{MXY-plot of absolute logged fold changes}
\label{mxyabsplot}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

For the two MXY-plots above, the column and row indices were used as proxies for location 
of the spot. We can use, however, the physical spot location as determined by the 
scanner for the MXY plots (if the spots' $X$-and $Y$-coordinates are available.) 
Especially, if the gaps between the blocks printed by distinct pins are large,
this option may give a better physical representation of the array.
Using the function \texttt{mxy2.plot}, such plots can be generated (figure~\ref{mxy2plot}): 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
<<eval=FALSE,echo=TRUE>>=
data(sw.xy)
mxy2.plot(maM(sw)[,3],X=sw.xy$X[,3],Y=sw.xy$Y[,3],
          Ngc=maNgc(sw),Ngr=maNgr(sw),Nsc=maNsc(sw),Nsr=maNsr(sw))

@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
\centering
<<fig=TRUE,height=5,width=5.5,echo=FALSE>>=
data(sw.xy)
mxy2.plot(maM(sw)[,3],X=sw.xy$X[,3],sw.xy$Y[,3],
          Ngc=maNgc(sw),Ngr=maNgr(sw),Nsc=maNsc(sw),Nsr=maNsr(sw))

@
\caption{MXY plot of SW480/620 array 1}
\label{mxy2plot}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Optimised local intensity-dependent normalisation}
Inspection of the MA- and MXY-plots in the previous section indicated a intensity- and location-dependent dye 
bias. Additionally, the plots show clearly the non-linearity of the biases. To correct for these systematic errors,
 we apply first a local regression of \emph{M} with respect to \emph{A} and spot location \emph{X,Y}.
The residuals of the regression are normalised \emph{M}.This procedure is motivated by the hybridisation model introduced in reference~\cite{toni}.

 The assumption here are: i) Most genes arrayed are not differentially expressed or 
up-and down-regulation is balanced over the range of \emph{A},  ii) the spotting procedure did not generate an spatial accumulation of up- or down-regulated genes in localized areas on the array. 
The validity of both assumptions  should be carefully checked for the data to be normalised. 


The local regression is performed using LOCFIT algorithm which is based on the same 
computational ideas as popular lowess method~\cite{loader,cleveland} .  Required parameters for LOCFIT 
are the smoothing parameter $\alpha$  and the scale parameter(s) \emph{s}
for multi-dimensional regression.
The parameter $\alpha$  specifies the fraction of points that are included in the neighborhood for 
local fitting  and can have a  value between 0 and 1. Larger  values lead to smoother fits.  
The setting of scale parameters s is necessary for a local regression with multiple  
predictor variables. The scale parameters $s$  determine the  amount of smoothing in one direction 
compared to the other directions.   


Choosing accurate regression parameters is crucial for the quality of the normalisation. 
Too large smoothing parameters, for example, lead to a poor fit where local data features are missed
and underfitting occurs.  If the smoothing parameter is too small,  overfitting is produced and
the residuals subsequently underestimated. 
To optimise the parameter setting, two procedures were developed: OLIN and OSLIN. Both methods
are based  on iterative local regression and parameter selection by GCV.
 They aim to correct for systematic errors linked with spot intensity and location. A detailed mathematical description can be found in reference~\cite{toni}.


\subsection{OLIN}
The OLIN (and OSLIN) are implemented in R by the function \texttt{olin}:

\begin{quote}
\texttt{olin(object,X,Y,alpha,iter,OSLIN,scale,weights,genepix,bg.corr,...)}
\end{quote}


\noindent The input arguments are  as follows:
           \begin{description}
           \item[object] Object of class \emph{marrayRaw} or \emph{marrayNorm} containing 
                         data of the microarray experiment such as spot intensities and array layout
                         for a batch of microarrays.
                         To generate such an object, see the \emph{marryClasses} documentation.
           \item[X]      matrix with x-coordinates of spots. Each column corresponds to one array in the batch. The location is spots detected by scanner
                         is usually 
                         found in the output file program of  the scanning program. 
                          If X=NA, the spotted columns on the  array are used as proxies for the location in x-direction (\emph{default})
           \item[Y] matrix with y-coordinates of spots. If Y=NA, spotted rows on array are used
            as proxies for the location in y-direction (\emph{default}).
           \item[alpha] Vector of $\alpha$  parameters that are tested in the GCV procedure. The $\alpha$  parameter 
                  defines the smoothness of fit.  The default vector is \texttt{seq(0.05,1,0.1)}.
           \item[iter] Number of iterations in the OLIN or OSLIN procedure. The default value is 3. 
           \item[OSLIN] If OSLIN =TRUE, a subsequent optimised scaling of the range of M across the array
                        is performed.
                         The default value is \emph{FALSE}, i.e. OLIN is performed. 
           \item[scale] Vector of scale parameters \emph{s} that are tested in a GCV procedure. 
                    This  allows  a different amount of smoothing in Y-direction compared 
                    to smoothing in X-direction. The default values of \emph{s} are 
                    \texttt{c(0.05,0.1,0.5,1,2,10,20))}.
           \item[weights] Matrix of prior weights of spots for the regression procedures.
                          Spots can be excluded to be used
                          for the local regression by assigning them to zero.  
                          If the weight matrix include negative values, these will be
	                   set to zero.\\
                          For the weights, the weight matrix stored 
                          in the \texttt{maW} slot of  \emph{marrayRaw} objects can be used 
                           (\texttt{weights=maW(object)}). Defaults is \texttt{NA} resulting
                          in uniform weigths of all spots. 
	\item[genepix]{If \texttt{genepix} is set to 
			\texttt{TRUE},  spot weights equal zero or larger are set
			to one for the local regression whereas negative
			spot with negative weights are not used   for the
			regression. The argument \texttt{genepix} should
			be set to \texttt{TRUE}, if
			\texttt{weights=maW(object)} is set and  spot
			quality weights derived by GenePix are stored in
			\texttt{maW(object)}.}
            \item[bg.corr] backcorrection method (for \emph{marrayRaw} objects)  :
			\emph{none} - no background correction, \emph{sub} 
                          - simple background subtraction (default), 
            \emph{movingmin} -  background intensities are first averaged over 3x3 grids
  of neighbouring spots and subsequently substracte, 
 \emph{minimum} - zero or negative intensities after background
        correction are set equal to half the minimum of positive
        corrected intensities, \emph{edwards}-  background correction based on  log-linear
       interpolation, or \emph{normexp}-  background correction based on fitting
       procedure. For further details and references, please refer to to the  help page
of \texttt{backgroundCorrect2} of the \emph{OLIN} package or \texttt{backgroundCorrect} of the \emph{limma} package.
 

          \item[...] Further arguments passed to the  \texttt{locfit} function.	
      \end{description}


For illustration, we apply the OLIN scheme to normalise the third array of the SW480/620 data set. 
(Note that the function is not restricted to single slide normalisation but normalises all arrays
in the given \emph{marray} object.)

<<>>=
norm.olin <- olin(sw[,3],X=sw.xy$X[,3],Y=sw.xy$Y[,3])
@

Inspection of the MA- and MXY-plot indicated that OLIN  was able to correct for the  
intensity- as well as  the  location-dependent dye bias. The
 residuals are well  balanced around zero in the MA-plot (figure~\ref{maolin}).
Similarly, spatial bias is no longer apparent (figure~\ref{mxyolin}).
Spots with positive and negative log ratio M were evenly distributed across the slide. 
The statistical tests
applied in the next section will confirm  these findings. 

<<echo=TRUE,eval=FALSE>>=
plot(maA(norm.olin),maM(norm.olin),main="OLIN",pch=".")

mxy.plot(maM(norm.olin),Ngc=maNgc(norm.olin),Ngr=maNgr(norm.olin),
                Nsc=maNsc(norm.olin),Nsr=maNsr(norm.olin),main="OLIN")

@


%%%%%%%%%%%%%%%%

\begin{figure}[t]
\centering
<<fig=TRUE,width=3,height=3,echo=FALSE>>=
par(cex.main=0.8)
par(cex.lab =0.8)
plot(maA(norm.olin),maM(norm.olin),main="OLIN",pch=".")
@
\caption{MA-plot of array normalized by  OLIN }
\label{maolin}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{figure}
\centering
<<echo=FALSE,fig=TRUE,height=5,width=5.5>>=
 
 mxy.plot(maM(norm.olin),Ngc=maNgc(norm.olin),Ngr=maNgr(norm.olin),
                Nsc=maNsc(norm.olin),Nsr=maNsr(norm.olin),main="OLIN")
@
\caption{MXY-plot of array normalized by  OLIN }
\label{mxyolin}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


OLIN is based on an iterative procedure. In most cases, we found
that two or three iterations are already sufficient for normalisation (figure~\ref{iter}).   


<<echo=TRUE,eval=FALSE>>=
norm.olin.1 <- olin(sw[,3],X=sw.xy$X[,3],Y=sw.xy$Y[,3],iter=1)
norm.olin.2 <- olin(norm.olin.1,X=sw.xy$X[,3],Y=sw.xy$Y[,3],iter=1)
norm.olin.3 <- olin(norm.olin.2,X=sw.xy$X[,3],Y=sw.xy$Y[,3],iter=1)

M <- cbind(maM(sw)[,3],maM(norm.olin.1),maM(norm.olin.2),maM(norm.olin.3))
pairs(M,labels= c("raw","1.Iter.","2.Iter.","3.Iter."))
@


\begin{figure}[t]
\centering
\resizebox{0.8\textwidth}{!}{\includegraphics{scatter.png}}
\caption{Convergence of iterative normalisation }
\label{iter}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{OSLIN} 
If we can assume that the variability of log ratios M should be equal across the array,
 local scaling of \emph{M} can be performed. As in the previous section, the validity of these assumptions
 has to be carefully checked for each experiment analyzed. The underlying requirement is again random 
spotting of arrayed genes. To apply OSLIN:

<<>>=
norm.oslin <- olin(sw[,3],X=sw.xy$X[,3],Y=sw.xy$Y[,3],alpha=c(0.1,1,0.1),OSLIN=TRUE)
@

The local scaling factors are  derived by optimized local regression of the absolute log ratio M. 
The range of regression  parameters tested by GCV is [0.1,1] for smoothing parameter. 
The resulting MA- and MXY-plots for slide 3 are presented in figures~\ref{maoslin} and \ref{mxyoslin}.. 
The variability of log ratios \emph{M} appears to be even across the array.

%%%%%%%%%%%%%%%%%%%%%%%%%%

<<echo=TRUE,eval=FALSE>>=
plot(maA(norm.oslin),maM(norm.oslin),main="OSLIN",pch=".")

mxy.plot(maM(norm.oslin),Ngc=maNgc(norm.oslin),Ngr=maNgr(norm.oslin),
                Nsc=maNsc(norm.oslin),Nsr=maNsr(norm.oslin),main="OSLIN")

@


%%%%%%%%%%%%%%%%

\begin{figure}[t]
\centering
<<fig=TRUE,width=3,height=3,echo=FALSE>>=
par(cex.main=0.8)
par(cex.lab =0.8)
plot(maA(norm.oslin),maM(norm.oslin),main="OSLIN",pch=".")
@
\caption{MA-plot of array normalized by  OSLIN }
\label{maoslin}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The \emph{weights} argument can be used for two purposes. First, it can be 
to exclude a set of spots (such as control spots)  to be used  for local regression. Second, it can be used 
to base the regression on a selected set of genes assumed to be not differentially expressed (\emph{house-keeping
genes}).  
If the normalisation should be based on such a set, weights can be used for local regression. 
In this case, all weights should be  set to zero except for
the house-keeping genes for which weights are set to one. In order to achieve a reliable regression, it is important, however, that there is a sufficient number of house-keeping genes that  cover the whole expression range
and are spotted accross the whole array.

%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{figure}[t]
\centering
<<echo=FALSE,fig=TRUE,height=5,width=5.5>>=
 
 mxy.plot(maM(norm.oslin),Ngc=maNgc(norm.oslin),Ngr=maNgr(norm.oslin),
                Nsc=maNsc(norm.oslin),Nsr=maNsr(norm.oslin),main="OSLIN")
@
\caption{MXY-plot of array normalized by  OSLIN }
\label{mxyoslin}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Note that OLIN and OSLIN
are sensitive to violations of the assumptions that
 most genes are not differentially expressed (or up- and down-regulation
is balanced) and that genes are randomly spotted across the array. 
If these assumptions are not valid, local
regression can lead to an underestimation of differential expression.  
In this context,  OSLIN is especially sensitive in its performance. However, the
sensitivity can be decreased if the minimal smooting parameter \texttt{alpha} (default value: 0.05)
 is set to larger values. 
 

It is also important to note that OLIN/OSLIN is fairly efficient in removing intensity- and spatial-dependent dye bias, so that normalised  data will look quite "good" after normalisation independently of the true underlying data quality. Normalisation by local regression assumes smoothness of bias. Therefore, localised artifacts such as scratches, edge effects or bubbles should be avoided. Spots of these areas should be flagged (before normalization is applied) to ensure data integrity. To stringently detect artifacts, the OLIN functions \texttt{fdr.int, fdr.int2, fdr.spatial,fdr.spatial2, 
p.int, p.int2, p.spatial} and \texttt{p.spatial2} can be used. 
Flagging of the data spots in regions of intensity- or location-dependent bias can be performed 
by \texttt{sig.mask}. For an example of such an flagging or masking procedure, 
see the help page of \texttt{sig.mask}.


\section{Scaling between arrays of a batch}
OLIN and OSLIN adjust log ratio \emph{M} of an array independently of other arrays.
A further reduction  of variation within experiments may be achieved by
additional scaling of M between arrays \cite{yang}. This procedure is frequently
also termed 'between-array normalisation' in contrast to 'within-array normalisation'
as for example performed by O(S)LIN. The log ratios of \emph{M} will be adjusted in different
arrays to achieve a similar overall distribution of \emph{M}.

However, it assumes that the overall scale of M is the same (or at least similar)
in the different arrays to be scaled. This should be  carefully checked. Differences in the 
overall scale of \emph{M} may indicated e.g. changes in hybridisation conditions or mRNA quality.
Caution should also be taken in the interpretation of results for arrays
hybridised with  biologically divergent samples, if between-array  scaling is applied.


Between-array scaling is implemented in the \emph{OLIN} package through function \texttt{bas}.
Three different scaling procedures are supported: 
\begin{enumerate}
\item  arrays are scaled to have the same variance as calculated by
\texttt{var},
\item  arrays are scaled to have the same median absolute deviation calculated by \texttt{mad}
\item arrays are scaled to have  equal values of quantiles. 
\end{enumerate}

For illustration, we apply the first scaling procedure to  OLIN-adjusted SW480/620 data consisting of four
replicate array (figures~\ref{disolin},ref{disbas}):

 
<<echo=TRUE,eval=FALSE>>=


data(sw.olin)

# DISTRIBUTION OF LOGGED RATIOS BEFORE BETWEEN-ARRAY-SCALING
col <- c("red","blue","green","orange")
M <- maM(sw.olin)


plot(density(M[,4]),col=col[4],xlim=c(-2,2))
for (i in 1:3){
  lines(density(M[,i]),col=col[i])
}


# BETWEEN-ARRAY SCALING   
sw.olin.s <- bas(sw.olin,mode="var")
  

# VISUALISATION 
M <- maM(sw.olin.s)
plot(density(M[,4]),col=col[4],xlim=c(-2,2))
for (i in 1:3){
  lines(density(M[,i]),col=col[i])
}


@


%%%%%%%%%%%%%%%%

\begin{figure}[t]
\centering
<<fig=TRUE,width=3.5,height=3,echo=FALSE>>=
data(sw.olin)
par(cex.main =0.8)
par(cex.lab=0.8)
col <- c("red","blue","green","orange")
M <- maM(sw.olin)
plot(density(M[,4]),col=col[4],xlim=c(-2,2))
for (i in 1:3){
  lines(density(M[,i]),col=col[i])
}
@
\caption{Distribution of M for SW480/620 arrays after OLIN  }
\label{disolin}
\end{figure}


\begin{figure}[t]
\centering
<<fig=TRUE,width=3.5,height=3,echo=FALSE>>=
sw.olin.s <- bas(sw.olin,mode="var")
par(cex.main = 0.8)
par(cex.lab = 0.8)
M <- maM(sw.olin.s)
plot(density(M[,4]),col=col[4],xlim=c(-2,2))
for (i in 1:3){
  lines(density(M[,i]),col=col[i])
}

@
\caption{Distribution of M for SW480/620 arrays after OLIN and between-array scaling}
\label{disbas}
\end{figure}

%%%%%%%



 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%
\section{Statistical assessment of efficiency of normalisation}
\label{stats}
An important criterion for the quality of normalisation is its efficiency in 
removing systematic errors. Although visual inspection might readily reveal prominent
artifacts in the microarray data (as shown in section~\ref{visu}), it does not allow for their stringent 
detection. To overcome this limitation, several methods for statistical detection of 
systematic errors were implemented in the OLIN package. They might be especially valuable
for the comparison of the efficiency of different normalisation methods. They also might 
be helpful for newcomers in the field of microarray data analysis, since they can assist the
detection of artifacts and may help to improve the experimental procedures.

A simple method to detect intensity- or location-dependent  bias is to calculate the 
correlation between the log ratio \emph{M} of a spot and the average \emph{M} in the spot's 
neighbourhood ~\cite{jo}. A neighourhood on the intensity scale can be defined by a symmetrical
window of size (2*$delta$ + 1) around the spot. A correlation of zero can be expected assuming
the log ratios are uncorrelated. In contrast, a large positive correlation indicates intensity-dependent
 bias.

<<>>=
 A <- maA(sw[,3])
 M <- maM(sw[,3])
# Averaging
 Mav <- ma.vector(A,M,av="median",delta=50)
# Correlation 
 cor(Mav,M,use="pairwise.complete.obs")
@

Similarly, the location-dependent bias can be assessed: 

<<>>=
# From Vector to Matrix
 MM <-  v2m(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),Nsc=maNsc(sw),Nsr=maNsr(sw),visu=FALSE)
# Averaging of matrix M 
 MMav <- ma.matrix(MM,av="median",delta= 2,edgeNA=FALSE)
# Backconversion to vector
 Mav <- m2v(MMav,Ngc=maNgc(sw),Ngr=maNgr(sw),Nsc=maNsc(sw),Nsr=maNsr(sw),visu=FALSE)
# Correlation 
 cor(Mav,M,use="pairwise.complete.obs")
@

Although correlation analysis can be readily applied, it  cannot 
deliver localization of experimental bias in the data. 
To detect areas of bias in the microarray data, alternative methods can be applied. 

The OLIN package contains two four models. The first model (\texttt{anovaint}) can be used
to assess intensity-dependent bias. For this task, the A-scale is divided into \emph{N} intervals
containing equal number of spots. The null hypothesis tested is the equality of the means of \emph{M} for the different intervals.  The input argument \texttt{index} indicates which array stored in object \emph{sw}
 should
be examined. 
The function \texttt{anovaint}  is a wrapper around the core function \texttt{lm}. 
The output of  \texttt{anovaint} equals \texttt{summery(lm)}.

<<>>=
 print(anovaint(sw,index=3,N=10))
@

<<>>=
 data(sw.olin)
 print(anovaint(sw.olin,index=3,N=10))
@


Similarly, the location-dependent bias can be examined by an ANOVA model implemented as
\texttt{anovaspatial}. The array is divided into (xN x yN) rectangular blocks.   
The null hypothesis tested is the equality of the means of \emph{M} for the different blocks.  
The function \texttt{anovaspatial} is a wrapper around functipn \texttt{lm}. The output
is the summary of \texttt{lm} (which is suppressed in following examples).
Additionally,  \texttt{anovaspatial} allows for visualisation of the results (see figure
~\ref{anova} and \ref{anovaolin}.) The figures display the log10-transformed p-values
as derived in the block-wise \textit{t}-tests. Note that the differences in scales.   


<<echo=TRUE,eval=FALSE>>=
  anovaspatial(sw,index=3,xN=8,yN=8,visu=TRUE)
@

<<echo=TRUE,eval=FALSE>>=
  anovaspatial(sw.olin,index=3,xN=8,yN=8,visu=TRUE) 
@

\begin{figure}[t]
\centering
<<fig=TRUE,echo=FALSE,height=5,width=7>>=
s <-  anovaspatial(sw,index=3,xN=8,yN=8,visu=TRUE)
@
\caption{ANOVA for raw data}
\label{anova}
\end{figure}

\begin{figure}[t]
\centering
<<echo=FALSE,fig=TRUE,height=5,width=7>>=
s <- anovaspatial(sw.olin,index=3,xN=8,yN=8,visu=TRUE) 
@
\caption{ANOVA for normalised data}
\label{anovaolin}
\end{figure}


Additionally, simple one-factorial ANOVA 
models were implemented to test microarray data for  pin- and plate-dependent
bias. Testing for pin bias by \texttt{anovapin} is similar to testing for spatial 
bias by \texttt{anovaspatial}. The factors in \texttt{anovapin} are the pin indices.
The null hypothesis is the equality of the means of \emph{M} for the different pins.
In the same manner, it can be tested if there is a significant variation of \emph{M}
due to the use of distinct microtiter plate for spotting. An ANOVA model for this
task is implemented in function \texttt{anovaplate}.In this case, the null hypothesis 
is the equality of the means of \emph{M} for the different plates.

<<>>=
 print(anovapin(sw.olin,index=3))
@


<<>>=
 print(anovaplate(sw.olin,index=3))
@


ANOVA methods assume normality of the analysed data. This, however, may not be the
general case for microarray data. To relax this restriction, permutation tests can be
applied. Permutation (or randomization) tests have the advantage that a particular data 
distribution is not assumed. They rely solely on the observed data examples and can be 
applied with a variety of test statistics. A major restriction, however, is that 
permutation tests are computationally very intensive.
Generally, such tests are not used in interactive mode but are performed in batch-mode. 


Four permutation tests procedures were implemented in the OLIN package. 
The functions \texttt{fdr.int}  and \texttt{p.int} assess intensity-dependent
bias. The functions \texttt{fdr.spatial} and \texttt{p.spatial} assess location-dependent
bias. The basic procedure is similar for all four functions.
First, a (intensity or location)  neighbourhood of spots is defined similarly to the procedure
we used for the correlation analysis .  Next, a test statistic is constructed by calculating
 the \emph{median} or \emph{mean} of \emph{M} the  spot's  neighbourhood of chosen size.
 An empirical distribution of the test statistic $\bar{M}$ is 
then produced  based on random permutations of the
data. Comparing  $\bar{M}$ of the original data with the empirical distribution, the significance 
of observing  $\bar{M}$ is derived. (Note that a rather low number of random permutations was chosen
to avoid time-consuming calculations here. Generally, however, a larger number should be chosen.)  

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

The functions \texttt{fdr.int}, \texttt{p.int}, \texttt{fdr.spatial} and  \texttt{p.spatial}
perform two one-sided random permutation tests. The result can be visualised by 
the plotting functions \texttt{sigint.plot} and \texttt{sigxy.plot}. 
The significance of a spot neighbourhood with large positive deviations of $\bar{M}$ is 
displayed in red along the A-scale or across the spatial dimensions of the array. 
Correspondingly, the significance of spot neighbourhood with large negative 
deviations of  $\bar{M}$ are displayed in green.  


<<eval=FALSE,echo=TRUE>>=
FDR <- fdr.int(maA(sw)[,3],maM(sw)[,3],delta=50,N=10,av="median")
sigint.plot(maA(sw)[,3],maM(sw)[,3],FDR$FDRp,FDR$FDRn,c(-5,-5))
@



\begin{figure}[t]
\centering
<<echo=FALSE,fig=TRUE,height=8,width=7>>=
FDR <- fdr.int(maA(sw)[,3],maM(sw)[,3],delta=50,N=10,av="median")
sigint.plot(maA(sw)[,3],maM(sw)[,3],FDR$FDRp,FDR$FDRn,c(-5,-5))
@
\caption{MA-plot and FDRs of  intensity-dependent bias for raw data }
\label{fdrintraw}
\end{figure}



<<echo=TRUE,eval=FALSE>>=
data(sw.olin)
FDR <- fdr.int(maA(sw.olin)[,3],maM(sw.olin)[,3],delta=50,N=10,av="median")
sigint.plot(maA(sw.olin)[,3],maM(sw.olin)[,3],FDR$FDRp,FDR$FDRn,c(-5,-5))
@



\begin{figure}
\centering
<<echo=FALSE,fig=TRUE,height=8,width=7>>=
data(sw.olin)
FDR <- fdr.int(maA(sw.olin)[,3],maM(sw.olin)[,3],delta=50,N=10,av="median")
sigint.plot(maA(sw.olin)[,3],maM(sw.olin)[,3],FDR$FDRp,FDR$FDRn,c(-5,-5))
@
\caption{MA-plot and FDRs of  intensity-dependent bias for data normalised by OLIN }
\label{fdrintolin}
\end{figure}

For slide 3, an significant intensity-dependent bias towards channel 2 (Cy3)
was detected for low spot intensities, whereas high-intensity spots are biased
towards channel 2 (figure~\ref{fdrintraw}). After OLIN normalisation, no significant intensity-dependent
bias is apparent (figure~\ref{fdrintolin}). A similar result can be found for the 
removal of location-dependent bias by OLIN (figure~\ref{fdrspatialraw} and \ref{fdrspatialolin}).

%%%%%%%%%%%%%%%%%%%%%%%%%%

<<eval=FALSE,echo=TRUE>>= 
M <- v2m(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),
                Nsc=maNsc(sw),Nsr=maNsr(sw),main="MXY plot of SW-array 1")

FDR <- fdr.spatial(M,delta=2,N=10,av="median",edgeNA=TRUE)
sigxy.plot(FDR$FDRp,FDR$FDRn,color.lim=c(-5,5),main="FDR")
@




\begin{figure}[t]
\centering
<<echo=FALSE,fig=TRUE,height=5,width=5.5>>= 
M <- v2m(maM(sw)[,3],Ngc=maNgc(sw),Ngr=maNgr(sw),
                Nsc=maNsc(sw),Nsr=maNsr(sw),main="MXY plot of SW-array 1")

FDR <- fdr.spatial(M,delta=2,N=10,av="median",edgeNA=TRUE)
sigxy.plot(FDR$FDRp,FDR$FDRn,color.lim=c(-5,5),main="FDR")
@
\caption{FDR of location-dependent bias for raw data }
\label{fdrspatialraw}
\end{figure}


<<eval=FALSE,echo=TRUE>>=
M<- v2m(maM(sw.olin)[,3],Ngc=maNgc(sw.olin),Ngr=maNgr(sw.olin),
                Nsc=maNsc(sw.olin),Nsr=maNsr(sw.olin),main="MXY plot of SW-array 1")
FDR <- fdr.spatial(M,delta=2,N=10,av="median",edgeNA=TRUE)
sigxy.plot(FDR$FDRp,FDR$FDRn,color.lim=c(-5,5),main="FDR")
@


\begin{figure}
\centering
<<echo=FALSE,fig=TRUE,height=5,width=5.5>>=
M<- v2m(maM(sw.olin)[,3],Ngc=maNgc(sw.olin),Ngr=maNgr(sw.olin),
                Nsc=maNsc(sw.olin),Nsr=maNsr(sw.olin),main="MXY plot of SW-array 1")
FDR <- fdr.spatial(M,delta=2,N=10,av="median",edgeNA=TRUE)
sigxy.plot(FDR$FDRp,FDR$FDRn,color.lim=c(-5,5),main="FDR")
@
\caption{FDR of location-dependent bias for data normalised by OLIN}
\label{fdrspatialolin}
\end{figure}


 
Fold changes should be treated with care if the corresponding spots have significantly biased 
neighborhoods even after normalisation. 


The functions \texttt{fdr.int} and \texttt{fdr.spatial} assess the significance of dye bias using the false discovery
rate whereas the related  functions \texttt{p.int} and \texttt{p.spatial} produce (adjusted) p-values.
Another difference between the functions is the procedure to generate random permutation.
The function \texttt{fdr.int} and \texttt{fdr.spatial} are based on re-sampling without replacement,
whereas  \texttt{p.int} and \texttt{p.spatial} are based on re-sampling with replacement.  
A more detailed description of the permutation tests can be found in the \textit{help}-pages of these functions.

Finally, a second set of functions  (\texttt{fdr.int2, p.int2, fdr.spatial2, p.spatial2, sigint.plot2,sigxy.plot2} has been introduced for the present version of the OLIN package. Their functionality is the same as for the first set (i.e. \texttt{fdr.int, p.int,...}). However, the input arguments  and output objects of the functions  may differ (see the  relevant \textit{help}). Both sets will be merged based on S4-style functions in future versions.



%BIBLIOGRAPHY
\begin{thebibliography}{99}
\bibitem{toni} M.Futschik and T. Crompton (2004) Model selection and efficiency testing for normalisation of cDNA microarray data, \emph{Genome Biology}, 5:R60  

\bibitem{toni2}M.E. Futschik and T. Crompton (2005) OLIN: optimized normalization, visualization and quality testing of two-channel microarray data, Bioinformatics, 21(8):1724-6 
\bibitem{genomeletters} M. Futschik, A.Jeffs, S.Pattison, N.Kasabov, M.Sullivan, A.Merrie and A.Reeve (2002) Gene expression profiling of metastatic and non-metastatic colorectal cancer cell-lines,  \emph{Genome Letters}, 1(1) , 26-34
\bibitem{loader} C. Loader (1999) Local regression and likelihood, Springer, New York
\bibitem{cleveland} WS Cleveland (1979) Robust locally weighted regression and smoothing scatterplots,
\emph{J Am Stat Ass}, 74, 829-836
\bibitem{jo} J. Schuchhardt, D. Beule, A. Malik, E. Wolski, H. Eickhoff, H. Lehrach and H.Herzel (2000)
Normalization strategies for cDNA microarrays. \emph{Nucleic Acid Research} 28:e47
\bibitem{yang} Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T.
     P. Speed (2002). Normalization for cDNA microarray data: a robust
     composite method addressing single and multiple slide systematic
     variation. \emph{Nucleic Acids Research}, Vol. 30, No. 4.
 


\end{thebibliography}

\end{document}