%\VignetteIndexEntry{Oligo Vignette}
%\VignetteKeywords{Expression, SNP, Affymetrix, NimbleGen, Oligonucleotide Arrays}
%\VignettePackage{oligo}
\documentclass{article}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Rmethod}[1]{{\texttt{#1}}}
\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textsf{#1}}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\oligo}{\Rpackage{oligo }}

\usepackage{graphicx}

\begin{document}
\title{An Introduction to the Oligo Package}
\date{March, 2007}
\author{Benilton Carvalho}
\maketitle

<<setup, echo=FALSE, results=hide>>=
options(width=60)
options(continue=" ")
options(prompt="R> ")
@ 

\section{Introduction}

The \oligo package is designed to support all microarray designs
provided by Affymetrix and NimbleGen: expression, tiling, SNP and exon
arrays. As of now, chip-specific packages are built via
\Rpackage{makePlatformDesign} and transitioning to the
\Rpackage{pdInfoBuilder} package, which creates the data packages for
the Affymetrix SNP arrays.

\section{Analyzing Affymetrix SNP Arrays}

Genotyping can be performed using \oligo and you will need:

\begin{itemize}
\item \oligo and its dependencies;
\item Chip specific data package, eg. \Rpackage{pd.mapping50k.xba240}:
  package that contains the array specifications and SNP annotation.
\item CEL files.
\end{itemize}

We will start by loading the \oligo package and importing the CEL
files available on the \Rpackage{hapmap100kxba} package. An output
directory should also be defined and that is the location where the
summary files, including genotype calls and confidences are stored.

<<getRawData>>=
library("oligo")
library("hapmap100kxba")
pathCelFiles <- system.file("celFiles", package="hapmap100kxba")
fullFilenames <- list.celfiles(path=pathCelFiles, full.names=TRUE)
outputDir <- file.path(getwd(), "crlmmTest")
@ 

The density of the SNP arrays increased in a way that it is not
interesting to store the intensity matrix in memory; efficient methods
to handle this situations have been developed and batches of SNPs are
analyzed one at a time.

The genotyping algorithm implemented in \oligo is described in
\cite{Carvalho2006}. The whole process can be described in three steps:

\begin{enumerate}
\item Normalization against a reference distribution;
\item Summarization via SNPRMA;
\item Genotype calling via CRLMM.
\end{enumerate}

The normalization step is done by equalizing the observed intensities
accordingly to a reference distribution, based on the 270 Hapmap
samples. These samples are available to the public at
http://www.hapmap.org. The normalization step will remove systematic
biases, which are not due to biological factors.

The SNPRMA algorithm is responsible for summarizing the data. The design
of the SNP arrays is such that up to the 250K density, there are probes
for both alleles on both strands. On the SNP 6.0 platform, given a SNP,
there are probes only on one strand.

Therefore, for the designs up to 250K, SNPRMA will create summaries at
the SNP-Allele-Strand-level. For each SNP there are four numbers
$(\theta_{A-}, \theta_{B-}, \theta_{A+}, \theta_{B+})$, which are
proportional to the log-intensities in each of these combinations of
allele and strand ($-$: antisense; $+$: sense). They are represented by
four matrices: \Robject{antisenseThetaA}, \Robject{antisenseThetaB},
\Robject{senseThetaA} and \Robject{senseThetaB}, which are the
components of the \Rclass{SnpQSet} object. One can extract these objects
using accessors of the same name.

For the SNP 6.0 array, a similar approach is taken, but the summaries
are given at the SNP-Allele-level and there will be only $(\theta_A,
\theta_B)$ estimates for each SNP. An object of class
\Rclass{SnpCnvQSet} is returned and contains two matrices:
\Robject{thetaA} and \Robject{thetaB}. Accessors with the same name are
provided.

Average intensities and log-ratios are defined as across allele and
within strand, ie:
\begin{eqnarray}
  A_{s} & = & \frac{\theta_{A, s}+\theta_{B, s}}{2} \\
  M_{s} & = & \theta_{A, s} - \theta_{B, s},
\end{eqnarray}
where $s$ defines the strand (antisense or sense). These quantities can
be obtained via \Rmethod{getA()} and \Rmethod{getM} methods, which
return high-dimensional arrays with dimensions corresponding to SNP's,
samples and strands, respectively. These measures are later used for
genotyping.

The CRLMM algorithm can be applied on a \Rclass{SnpQSet} or
\Rclass{SnpCnvQSet} object in order to produce genotype calls. It
involves running a mixture of regressions via EM algorithm to adjust for
average intensity and fragment length in the log-ratio scale. These
adjustments may take long time to run, depending on the combination of
number of samples and computer resources available.

%% \textbf{A word of warning}: the \Rmethod{crlmm()} method searches for a
%% variable \Robject{gender} in the \Rclass{phenoData} slot of the
%% \Rclass{SnpQSet} object. If it fails to find that variable, it will try
%% estimate the gender from the data. If there is not enough discrimination
%% power to estimate the gender, the following error message will be
%% returned:
%% 
%% {\bf empty cluster: try a better set of initial centers}
%% 
%% Increasing the sample size is one of the possible solutions, although
%% the preferred one is to have \Robject{gender} already defined in the
%% \Rclass{phenoData} slot.

%% The \Rclass{phenoData} slot includes covariates about the
%% samples. Genotyping and copy number analyses often make use of gender
%% information in order to provide more precise inferences. The code
%% below exemplifies the creation of the \Rclass{phenoData} object.
%% 
%% <<setPD>>=
%% aboutSamples <- data.frame(gender=c("female", "female", "male"))
%% rownames(aboutSamples) <- basename(fullFilenames)
%% aboutVars <- data.frame(labelDescription="male/female")
%% rownames(aboutVars) <- "gender"
%% pd <- new("AnnotatedDataFrame",
%%           data=aboutSamples,
%%           varMetadata=aboutVars)
%% @ 
%% 
%% <<crlmm, cache=TRUE>>=
%% crlmmOut <- justCRLMM(fullFilenames, phenoData=pd, verbose=FALSE)
%% @ 

<<crlmm>>=
crlmm(fullFilenames, outputDir, verbose=FALSE)
@ 

The \Rfunction{crlmm} method does not return an object to the R session. Instead, it saves the objects to disk, as not all systems are guaranteed to meet the memory requirements that \Rclass{SnpCallSetPlus} (for 100K and 500K arrays) or \Rclass{SnpCnvCallSetPlus} (for SNP 5.0 and SNP 6.0 arrays) objects might need. For convenience, the \Rfunction{getCrlmmSummaries} will read the information from disk and make a \Rclass{SnpCallSetPlus} or \Rclass{SnpCnvCallSetPlus} object available to the user.

%% The \Robject{crlmmOut} object above belongs to the \Rclass{SnpCallSet}
%% class and contains the genotype calls and confidence measures
%% associated to the calls, represented respectively by the
%% \Robject{calls} and \Robject{callsConfidence} matrices. These matrices
%% can be accessed using the methods of the same name as demonstrated
%% below:

<<theCalls>>=
crlmmOut <- getCrlmmSummaries(outputDir)
calls(crlmmOut)[1:5,1:2]
callsConfidence(crlmmOut)[1:5,1:2]
@ 

<<theCalls>>=
crlmmCalls <- readSummaries("calls", outputDir)
crlmmConf <- readSummaries("conf", outputDir)
crlmmCalls[1:5, 1:2]
crlmmConf[1:5, 1:2]
@ 

The genotype calls are represented by 1 (AA), 2 (AB) and 3 (BB). The
confidence is the predicted probability that the algorithm made the
right call.

Summaries generated by the algorithm can also be accessed from the R
session. The options for summaries are ``alleleA'', ``alleleB'',
``alleleA-sense'', ``alleleA-antisense'', ``alleleB-sense'',
``alleleB-antisense''. The options ``alleleA'' and ``alleleB'' must be
used \textbf{only} with SNP 5.0 and SNP 6.0 platforms. The remaining
options must be used with 50K and 250K arrays.

<<alleleA>>=
alleleAsense <- readSummaries("alleleA-sense", outputDir)[1:5,]
alleleBsense <- readSummaries("alleleB-sense", outputDir)[1:5,]
log.ratios.sense <- alleleAsense-alleleBsense
log.ratios.sense[, 1:2]
@ 

<<clean, echo=FALSE>>=
unlink(outputDir, recursive=TRUE)
@ 

\subsection{Exploring the Annotation Package}

The user who is willing to make deeper investigation using the
annotations provided for each SNP array can use SQL queries to access
more other information that might not be directly exposed.

The example below demonstrates how to see the available tables, fields
and extract chromosome, physical location and cytoband for the first
five SNP's (probes querying specific SNP's have names starting with
the string ``SNP'').

<<>>=
conn <- db(pd.mapping50k.xba240)
dbListTables(conn)
dbListFields(conn, "featureSet")
sql <- "SELECT man_fsetid, chrom, physical_pos FROM featureSet WHERE man_fsetid LIKE 'SNP%' LIMIT 5"
dbGetQuery(conn, sql)
@ 

\bibliography{oligoVignette}{}
\bibliographystyle{plain}
 
\section{Details}

This document was written using:

<<>>=
sessionInfo()
@ 


\end{document}