%\VignetteIndexEntry{An introduction to CNAnorm}
\documentclass[a4paper]{article}
\usepackage{natbib}

\title{CNAnorm: A package to detect Copy Number Alterations (CNA) from sequencing data}
\author{Stefano Berri}

\begin{document}

\maketitle

CNAnorm is a package for the analysis of Copy Number Alteration (CNA) of tumour samples
using low coverage (around 0.01 - 0.5X) high throughput sequencing\cite{gusnanto2011a}.
In particular, CNAnorm aims to perform a meaningful \emph{normalisation} of the sample
by estimation of the underlying tumour's ploidy.
CNAnorm allows both a fully automated as well as an interactive approach to the
normalisation step. It provides serveral normalisation methods. If the user has
some external information about the ploidy of the genome, it is possible to
manually inform CNAnorm and then perform the normalisation.
CNAnorm also provides a method for plotting the normalised genome profile.

For more information and the original data, see the authors' website.
\\
\\
\verb@http://www.precancer.leeds.ac.uk/cnanorm/@


\section{Input data}

You can load the example data
<<>>=
library(CNAnorm)
data(LS041)
# show the data
LS041[1:5,]
@

The input data can be produced from sam/bam files using the perl script \verb@bam2windows.pl@
that you can obtain from the authors' website (\verb@http://www.precancer.leeds.ac.uk/cnanorm/@). There you can also find the bam
files used to produce the dataframe LS041.

The first step is to create an object of class CNAnorm with the input data.
The input data consists of a number of reads in the test (tumour) and control (normal) for a variable number of 
\textbf{constant width windows}. Chromosome/contig name and the starting
position of each window must also be provided.
GC content for each window is optional.
If the data is in a dataframe like LS041, it is easiest to use the function
\emph{dataFrame2object} to create a new object.

<<>>=
# set a seed for reproducible comparison
set.seed(31)
CN <- dataFrame2object(LS041)
@

\section{Basic use}
Here, we will demostrate a standard and robust set of instructions for obtaining
copy number profile without correcting for tumour content or ploidy.

\subsection{GC-correction}

GC correction is an optional but recommended step to reduce GC content sequencing bias.

Because of the difficulty in correctly mapping reads to chromosome Y and M, we can
flag them for exclusion from the GC correction or ploidy detection.

<<>>=
toSkip <- c("chrY", "chrM")
CN <- gcNorm(CN, exclude=toSkip)
@

\subsection{From read count to copy number profiles}

CNAnorm will try to detect ``states'' of copy number in order to perform normalisation.
It is strongly recommended to smooth the signal\cite{huang2007a} to decrease noise without
losing resolution.

<<>>=
CN <- addSmooth(CN, lambda=7)
@

It is now possible to search for the most representative peak (state) closest to the median of the ratios. 

<<>>=
CN <- peakPloidy(CN, exclude=toSkip, method='closest')
@

`closest' is a robust normalisation method. It will not attempt to estimate
ploidy and tumour content, but it will work with a wide range of samples,
windows sizes, total number of reads, and so on. This method can cope with
heterogeneous, policlonal tumours. Once the user is more familiar with CNAnorm,
methods `density' or `mixture' can be used to estimate
ploidy and tumour content if the analysed tumour is largely monoclonal. See
section ``Advanced use'' 

We can now visualise the distribution of reads.

\begin{center}
<<fig=TRUE, echo=TRUE>>=
plotPeaks(CN)
@
\end{center}

\subsubsection{Segmentation}

CNAnorm uses an external package, DNAcopy\cite{venkatraman2007a}, to segment the data.
If we want to plot DNAcopy segments, before the
normalisation we need to add the DNAcopy information.

<<>>=
CN <- validation(CN)
CN <- addDNACopy(CN)
CN <- discreteNorm(CN)
@

The above steps can be performed also with function CNAnormWorkflow that
exposes most of the parameters of all functions with robust defaults.

<<>>=
# re-set seed for reproducible results
set.seed(31)
CNauto <- CNAnormWorkflow(LS041, gc.do=TRUE, gc.exclude=toSkip, 
    peak.exclude=toSkip)
identical(CN, CNauto)
@

Finally, we can plot the copy number profile for the whole genome. As there is no
assumption about the species of the sample, we will use show.centromeres = FALSE

\begin{center}
<<fig=TRUE, echo=TRUE, width=7, hight=3>>=
plotGenome(CN, superimpose='DNACopy', show.centromeres=FALSE)
@
\end{center}

We can also specify a set of chromosomes to show, or a specific range.
If the reference genome is the human genome (hg19), we can use built-in information
to plot the location of centromeres.

\begin{center}
<<fig=TRUE, echo=TRUE, width=7, hight=3>>=
toPlot <- c('chr10', 'chr11', 'chr12')
subSet <- chrs(CN) %in% toPlot
plotGenome(CN[subSet], superimpose='DNACopy')
@
\end{center}

Several aspects of the plot can be customised (colours, point and line size,
etc.) by setting the desired value in object gPar and then passing it to
plotGenome

\begin{center}
<<fig=TRUE, echo=TRUE, width=7, hight=3>>=
data(gPar)
gPar$genome$colors$gain.dot <- 'darkorange'
gPar$genome$colors$grid <- NULL
gPar$genome$cex$gain.dot <- .4
gPar$genome$cex$loss.dot <- .4
plotGenome(CN[subSet], superimpose='DNACopy', gPar=gPar, 
    colorful=TRUE)
@
\end{center}

Finally, we can export results in a table-like format.
<<>>=
exportTable(CN, file = "CNAnorm_table.tab", show = 'center')
@

\section{Advanced use}
CNAnorm was designed to be able to detect absolute or integer copy number
states and detect tumour content. It is important to remember, though, that
several solutions could be equally supported by the same
dataset\cite{gusnanto2011a} and that the process will succeed only if the
tumour is largely monoclonal.

We can use a different method to normalise the data, in order to estimate
absolute ploidy and tumour content.


\begin{center}
<<fig=TRUE,echo=TRUE>>=
CN <- peakPloidy(CN, exclude=toSkip, method='density')
plotPeaks(CN)
@
\end{center}

We can now see that each peak has been assigned to an integer state and an estimate of 
tumour content is given. The vertical dotted red line shows the median value
of the ratio. The blue labels (in the form ``BN $\rightarrow$ AN'') on top of each peak shows the ratio Before Normalisation
and the ratio After Normalisation. Ploidy (plotted on the left axis of the genome plot), will be 2 * AN.
The red circle is used to define ratio $=$ 1 on the right axes of the genome plot.

We can also provide extra arguments to the segmentation engine: DNAcopy.weight and 
DNAcopy.segment improve segmentation. These are particularly important if
higher depth data is used in order to minimise false positives (oversegmentation).
As the data is from human tumour samples, we can use
built-in information about centromere location to perform segmentation on a per chromosome arm basis
(independent.arms).


\begin{center}
<<fig=TRUE,echo=TRUE, width=7, hight=3>>=
CN <- validation(CN)
data(hg19_hs_ideogr)
CN <- addDNACopy(CN, DNAcopy.weight='poisson', 
    DNAcopy.segment=list(alpha=0.001), independent.arms=TRUE, 
    ideograms=hg19_hs_ideogr)
CN <- discreteNorm(CN)
plotGenome(CN, superimpose='DNACopy')
@
\end{center}

The genome wide copy number profile is now be mostly aligned to integer states.

\subsection{Manual correction}

The reported normalisation is highly plausible. Let's assume, however, that the
user knows that all chromosomes assigned to copy number 2 are in LOH
(independent evidence) and/or that there are exactly 2 copies of chr5 (by FISH)
and/or that the tumour content is probably lower (pathologist report).  The
user can thus ``shift'' the state up or down

\begin{center}
<<fig=TRUE,echo=TRUE>>=
CN <- validation(CN, ploidy = (sugg.ploidy(CN) - 1))
plotPeaks(CN, show='validated')
@
\end{center}

Ploidy has been shifted and a new tumour content estimation provided.
We need to perform the normalisation once more and then we can plot the new copy number profile 

\begin{center}
<<fig=TRUE,echo=TRUE, width=7, hight=3>>=
CN <- discreteNorm(CN)
plotGenome(CN, superimpose='DNACopy')
@
\end{center}

\bibliographystyle{plain}
\bibliography{ref}

\end{document}