% -*- mode: noweb; noweb-default-code-mode: R-mode; -*- \documentclass[11pt]{article} %% Set my margins \setlength{\oddsidemargin}{0.0truein} \setlength{\evensidemargin}{0.0truein} \setlength{\textwidth}{6.5truein} \setlength{\topmargin}{0.0truein} \setlength{\textheight}{9.0truein} \setlength{\headsep}{0.0truein} \setlength{\headheight}{0.0truein} \setlength{\topskip}{0pt} %% End of margins %%\pagestyle{myheadings} %%\markboth{$Date$\hfil$Revision$}{\thepage} \usepackage[pdftex, bookmarks, bookmarksopen, pdfauthor={Kun Liang}, pdftitle={DBChIP Vignette}] {hyperref} \usepackage{natbib} \newcommand{\DBChIP}{\texttt{DBChIP}} \newcommand{\dbchip}{\texttt{DBChIP}~} \title{DBChIP: Detecting differential binding of transcription factors with ChIP-seq} \author{Kun Liang$^{1,2}$ and S\"und\"uz Kele\c{s}$^{1,2}$\\ $^1$Department of Statistics, University of Wisconsin\\ Madison, WI 53706.\\ $^2$Department of Biostatistics and Medical Informatics, University of Wisconsin\\Madison, WI 53706.} \date{\today} \SweaveOpts{engine=R, echo=TRUE, pdf=TRUE} \begin{document} %\VignetteIndexEntry{DBChIP} %\VignetteKeywords{DBChIP} %\VignettePackage{DBChIP} \maketitle \tableofcontents \section{Introduction} This document provides an introduction to the differential binding analysis of ChIP-seq data with the \dbchip package \citep{Liang12.DBChIP}. We focus on the binding sites that have similar read profiles and well-defined centers throughout the genome. These binding sites tend to have read profiles that look like sharp peaks. The examples are transcription factor binding and some histone modifications measured by ChIP-seq. An increasing number of ChIP experiments are investigating the same type of binding event (protein-DNA binding or histone modification) under different conditions (treatments, time points, cell lines, etc.). A natural question is which binding sites behave differently across the conditions, and we set out to address this question. \section{Overview} Here are the major steps \dbchip performs:\\ \emph{1. Consensus site}: Binding site predictions from multiple conditions are merged into consensus binding sites. This step is necessary because the predictions for the same binding site across different conditions are unlikely to be exactly the same.\\ \emph{2. Counting reads}: The number of reads contributing to the binding are counted for each consensus site from the aligned sequence reads data.\\ \emph{3. Detecting differential binding}: We formally test a null hypothesis of non-differential binding at each consensus site. The tests are generally carried out through a generalized linear model with Negative Binomial distribution to account for the over-dispersion among replicates. We report a $p$-value and fold change estimates between conditions for each site. \dbchip can then report significantly differentially bound sites according to a pre-specified false discovery rate (FDR) threshold. \section{Example} The example dataset cames from a study of the transcription factor PHA-4 in \emph{C.elegans} under two developmental conditions: embryonic and the first stage of larval development (L1) under starvation conditions \citep{Zhong10.PHA-4}. There are two replicates in each condition. In \DBChIP, replicates stand for biological replicates; technical replicates usually can be merged after their consistency is established. To keep the data size small, only alignment data (ChIP and control) and identified binding sites in chromosome I with position < 0.9M bp are included. First, we load the \dbchip library and the \texttt{PHA4} dataset. <>= library(DBChIP) data("PHA4") @ Here we specify the experiment condition of each ChIP replicate. <>= conds <- factor(c("emb","emb","L1", "L1"), levels=c("emb", "L1")) @ \DBChIP~requires a set of binding site predictions from each experiment condition. The binding site predictions should contain the following fields: \texttt{chr}, chromosome; \texttt{pos}, the binding position; (optional) \texttt{weight}, a measure of strength of the binding (for example, the number of reads in the peak). Here we read the predictions into \texttt{binding.site.list} from local files. <>= path <- system.file("ext", package="DBChIP") binding.site.list <- list() binding.site.list[["emb"]] <- read.table(paste(path, "/emb.binding.txt", sep=""), header=TRUE) head(binding.site.list[["emb"]]) binding.site.list[["L1"]] <- read.table(paste(path, "/L1.binding.txt", sep=""), header=TRUE) bs.list <- read.binding.site.list(binding.site.list) @ Then the binding sites from different conditions are merged into consensus sites. <>= consensus.site <- site.merge(bs.list) @ \texttt{PHA4}~data contain the raw ChIP (\texttt{chip.data.list}) and control/input (\texttt{input.data.list}) data. ChIP data are organized as biological replicates. <>= names(chip.data.list) head(chip.data.list[["emb_rep1"]]) @ Here the alignment data are in Minimum ChIP-Seq (MCS) format, which is a data.frame with following fields: \texttt{chr} (factor), \texttt{pos} (integer) and \texttt{strand} (factor, ``+'' and ``-''). Note that the \texttt{pos} is the $5^\prime$ position of the read. Most commonly used alignment formats are supported in \DBChIP, see Section \ref{alignment} for more details. On the other hand, input data are usually organized per condition with replicates within each condition merged. This is because the focus in differential binding analysis is on the biological variation among ChIP replicates, while the input data are mainly used to provide estimates of the background noise level. <>= names(input.data.list) @ To facilitate the data loading, we use \texttt{load.data} function: <>= dat <- load.data(chip.data.list=chip.data.list, conds=conds, consensus.site= consensus.site, input.data.list=input.data.list, data.type="MCS") @ Then we count the reads around each consensus binding site. <>= dat <- get.site.count(dat) @ Differential binding detection <>= dat <- test.diff.binding(dat) rept <- report.peak(dat) rept @ The column \texttt{FC.L1} contains the fold change of the L1 condition with respect to the embryonic condition (\texttt{emb} is the first condition and is used as the baseline). By default, \texttt{report.peak} returns top 10 most differentially bound sites. The number of sites to return can be specified through parameter \texttt{n}. We can also specify a FDR level to return only the sites deemed significant enough. Finally, we can inspect our results by looking at their coverage plots. \begin{figure}[tbh] \begin{center} <>= plotPeak(rept[1,], dat) @ \caption{\label{fig:bs} Coverage plot for the most significantly differentially bound site (chromosome I location 260346). Color index: ChIP sample forward strand (blue), ChIP sample reverse strand (red), control sample forward strand (green), control sample reverse strand (orange).} \end{center} \end{figure} In Figure \ref{fig:bs}, each read is extended by the average fragment length (default 200 bp) from its $5^\prime$ end towards its $3^\prime$ end. The coverage at each nucleotide is defined as the number of extended reads covering the position and is computed separately for ChIP sample forward strand (blue), ChIP sample reverse strand (red), control sample forward strand (green), control sample reverse strand (orange). \section{Technical details} \subsection{Alignment data} \label{alignment} Besides the Minimum ChIP-Seq (MCS) format used in our example, most commonly used alignment formats (Eland, MAQ, Bowtie, SOAP, BAM, etc.) are supported through the \texttt{AlignedRead} object from \texttt{Bioconductor} \texttt{ShortRead} package. For example, a BAM file can be read into an \texttt{AlignedRead} object as follows: <>= library(ShortRead) aln <- readAligned("./", pattern="emb.bam", type="BAM") chip.data.list[["emb"]] <- aln @ The last option is through BED files, where we require at least the first 6 fields (chrom, start, end, name, score and strand). Here we provide the list of BED file names. <>= chip.data.list <- list() chip.data.list[["emb"]] <- "/path/emb.bed.file" chip.data.list[["L1"]] <- "/path/L1.bed.file" @ Then we can simply specify \texttt{data.type="BED"} in the \dbchip or \texttt{load.data} function. \subsection{Consensus site} Here we provide more operational details about obtaining consensus sites through the \texttt{site.merge} function. Because the predictions for the same binding site across multiple conditions tend to cluster together, we employ agglomerative (bottom-up) hierarchical clustering with centroid linkage to group predicted locations into different clusters. The centroid is computed as the average of the locations within each cluster. If the distance between centroids of two clusters are smaller than \texttt{in.distance} (default 100 bp), the clusters are considered as coming from the same binding site and are merged into one cluster. On the other hand, two clusters are considered as coming from separate sites if the distance between two respective centroids are larger than \texttt{out.distance} (default 250 bp). If the distance between the centroids of two clusters is between \texttt{in.distance} and \texttt{out.distance}, the cluster with higher weight will be kept. Finally, the consensus position within each cluster is an (weighted) average of original positions. \section{Session Info} <>= sessionInfo() @ \bibliographystyle{unsrt} \bibliography{DBChIP} \end{document}