% -*- mode: noweb; noweb-default-code-mode: R-mode; -*-
\documentclass[11pt]{article}
%% Set my margins
\setlength{\oddsidemargin}{0.0truein}
\setlength{\evensidemargin}{0.0truein}
\setlength{\textwidth}{6.5truein}
\setlength{\topmargin}{0.0truein}
\setlength{\textheight}{9.0truein}
\setlength{\headsep}{0.0truein}
\setlength{\headheight}{0.0truein}
\setlength{\topskip}{0pt}
%% End of margins

%%\pagestyle{myheadings}
%%\markboth{$Date$\hfil$Revision$}{\thepage}
\usepackage[pdftex,
bookmarks, 
bookmarksopen,
pdfauthor={Kun Liang},
pdftitle={DBChIP Vignette}]
{hyperref}

\usepackage{natbib}

\newcommand{\DBChIP}{\texttt{DBChIP}}
\newcommand{\dbchip}{\texttt{DBChIP}~} 

\title{DBChIP: Detecting differential binding of transcription factors with ChIP-seq}
\author{Kun Liang$^{1,2}$ and S\"und\"uz Kele\c{s}$^{1,2}$\\
  $^1$Department of Statistics, University of Wisconsin\\ 
  Madison, WI 53706.\\
  $^2$Department of Biostatistics and Medical Informatics, 
  University of Wisconsin\\Madison, WI 53706.}

\date{\today}

\SweaveOpts{engine=R, echo=TRUE, pdf=TRUE}

\begin{document}
%\VignetteIndexEntry{DBChIP}
%\VignetteKeywords{DBChIP}
%\VignettePackage{DBChIP}
\maketitle

\tableofcontents

\section{Introduction}

This document provides an introduction to the differential binding analysis of ChIP-seq data with the \dbchip package 
\citep{Liang12.DBChIP}. We focus on the binding sites that have similar read profiles and well-defined centers 
throughout the genome. These binding sites tend to have read profiles that look like sharp peaks. The examples are 
transcription factor binding and some histone modifications measured by ChIP-seq. 

An increasing number of ChIP experiments are investigating the same type of binding event (protein-DNA binding or 
histone modification) under different conditions (treatments, time points, cell lines, etc.). A natural question is 
which binding sites behave differently across the conditions, and we set out to address this question. 

\section{Overview}

Here are the major steps \dbchip performs:\\

\emph{1. Consensus site}: Binding site predictions from multiple conditions are merged into consensus binding sites. 
This step is necessary because the predictions for the same binding site across different conditions are unlikely to be 
exactly the same.\\ 

\emph{2. Counting reads}: The number of reads contributing to the binding are counted for each consensus site 
from the aligned sequence reads data.\\

\emph{3. Detecting differential binding}: We formally test a null hypothesis of non-differential binding at each 
consensus site. The tests are generally carried out through a generalized linear model with Negative Binomial 
distribution to account for the over-dispersion among replicates.  We report a $p$-value and fold change estimates between 
conditions for each site.  \dbchip can then report significantly differentially bound sites according to a pre-specified 
false discovery rate (FDR) threshold. 

\section{Example} 

The example dataset cames from a study of the transcription factor PHA-4 in \emph{C.elegans} under two developmental 
conditions: embryonic and the first stage of larval development (L1) under starvation conditions \citep{Zhong10.PHA-4}.  
There are two replicates in each condition.  In \DBChIP, replicates stand for biological replicates; technical 
replicates usually can be merged after their consistency is established.  To keep the data size small, only alignment 
data (ChIP and control) and identified binding sites in chromosome I with position < 0.9M bp are included. 

First, we load the \dbchip library and the \texttt{PHA4} dataset.

<<loading>>=
library(DBChIP)
data("PHA4")
@

Here we specify the experiment condition of each ChIP replicate.

<<assign condition>>=
conds <- factor(c("emb","emb","L1", "L1"), levels=c("emb", "L1"))
@

\DBChIP~requires a set of binding site predictions from each experiment condition. The binding site predictions should 
contain the following fields: \texttt{chr}, chromosome; \texttt{pos}, the binding position; (optional) \texttt{weight}, 
a measure of strength of the binding (for example, the number of reads in the peak). Here we read the predictions into 
\texttt{binding.site.list} from local files. 

<<load binding sites>>=
path <- system.file("ext", package="DBChIP")
binding.site.list <- list()
binding.site.list[["emb"]] <- read.table(paste(path, "/emb.binding.txt", sep=""), 
header=TRUE)
head(binding.site.list[["emb"]])
binding.site.list[["L1"]] <- read.table(paste(path, "/L1.binding.txt", sep=""), 
header=TRUE)
bs.list <- read.binding.site.list(binding.site.list)
@

Then the binding sites from different conditions are merged into consensus sites.

<<merge>>=
consensus.site <- site.merge(bs.list)
@

\texttt{PHA4}~data contain the raw ChIP (\texttt{chip.data.list}) and control/input (\texttt{input.data.list}) data.  
ChIP data are organized as biological replicates. 

<<look at ChIP data>>=
names(chip.data.list)
head(chip.data.list[["emb_rep1"]])
@

Here the alignment data are in Minimum ChIP-Seq (MCS) format, which is a data.frame with following fields: \texttt{chr} 
(factor), \texttt{pos} (integer) and \texttt{strand} (factor, ``+'' and ``-''). Note that the \texttt{pos} is the 
$5^\prime$ position of the read. Most commonly used alignment formats are supported in \DBChIP, see Section 
\ref{alignment} for more details.  On the other hand, input data are usually organized per condition with replicates 
within each condition merged. This is because the focus in differential binding analysis is on the biological variation 
among ChIP replicates, while the input data are mainly used to provide estimates of the background noise level. 

<<look at input data>>=
names(input.data.list)
@

To facilitate the data loading, we use \texttt{load.data} function:
<<load data>>=
dat <- load.data(chip.data.list=chip.data.list, conds=conds, consensus.site=
consensus.site, input.data.list=input.data.list, data.type="MCS")
@


Then we count the reads around each consensus binding site.

<<site count>>=
dat <- get.site.count(dat)
@

Differential binding detection
<<differential binding>>=
dat <- test.diff.binding(dat)
rept <- report.peak(dat)
rept
@

The column \texttt{FC.L1} contains the fold change of the L1 condition with respect to the embryonic condition 
(\texttt{emb} is the first condition and is used as the baseline). By default, \texttt{report.peak} returns top 10 most 
differentially bound sites.  The number of sites to return can be specified through parameter \texttt{n}. We can also 
specify a FDR level to return only the sites deemed significant enough. Finally, we can inspect our results by looking 
at their coverage plots. 

\begin{figure}[tbh]
\begin{center}
<<fig-bs, fig=TRUE,height=6,width=6>>=
plotPeak(rept[1,], dat)
@
\caption{\label{fig:bs} Coverage plot for the most significantly differentially bound site (chromosome I location 260346). 
Color index: ChIP sample forward strand (blue), ChIP sample reverse strand (red), control 
sample forward strand (green), control sample reverse strand (orange).}
\end{center}
\end{figure}

In Figure \ref{fig:bs}, each read is extended by the average fragment length (default 200 bp) from its $5^\prime$ end 
towards its $3^\prime$ end.  The coverage at each nucleotide is defined as the number of extended reads covering the 
position and is computed separately for ChIP sample forward strand (blue), ChIP sample reverse strand (red), control 
sample forward strand (green), control sample reverse strand (orange). 

\section{Technical details} 

\subsection{Alignment data}
\label{alignment}

Besides the Minimum ChIP-Seq (MCS) format used in our example, most commonly used alignment formats (Eland, MAQ, 
Bowtie, SOAP, BAM, etc.) are supported through the \texttt{AlignedRead} object from \texttt{Bioconductor} 
\texttt{ShortRead} package.  For example, a BAM file can be read into an \texttt{AlignedRead} object as follows: 

<<ShortRead, eval=FALSE>>=
library(ShortRead)
aln <- readAligned("./", pattern="emb.bam", type="BAM")
chip.data.list[["emb"]] <- aln
@

The last option is through BED files, where we require at least the first 6 fields (chrom, start, end, name, score and 
strand). Here we provide the list of BED file names.

<<BED, eval=FALSE>>=
chip.data.list <- list()
chip.data.list[["emb"]] <- "/path/emb.bed.file"
chip.data.list[["L1"]] <- "/path/L1.bed.file"
@ 

Then we can simply specify \texttt{data.type="BED"} in the  \dbchip or \texttt{load.data} function.

\subsection{Consensus site}

Here we provide more operational details about obtaining consensus sites through the \texttt{site.merge} function. 
Because the predictions for the same binding site across multiple conditions tend to cluster together, we employ 
agglomerative (bottom-up) hierarchical clustering with centroid linkage to group predicted locations into different 
clusters. The centroid is computed as the average of the locations within each cluster. If the distance between 
centroids of two clusters are smaller than \texttt{in.distance} (default 100 bp), the clusters are considered as coming 
from the same binding site and are merged into one cluster. On the other hand, two clusters are considered as coming 
from separate sites if the distance between two respective centroids are larger than \texttt{out.distance} (default 250 
bp). If the distance between the centroids of two clusters is between \texttt{in.distance} and \texttt{out.distance}, 
the cluster with higher weight will be kept. Finally, the consensus position within each cluster is an (weighted) 
average of original positions. 

\section{Session Info}
<<session>>=
sessionInfo()
@

\bibliographystyle{unsrt}
\bibliography{DBChIP}

\end{document}