% \VignetteIndexEntry{TCseq Vignette} % \VignetteDepends{TCseq} % \VignetteKeywords{Time course sequencing analysis, Clustering} % \VignettePackage{TCseq} \documentclass[a4paper]{article} \usepackage{a4wide} \usepackage[utf8]{inputenc} \usepackage{float} \title{TCseq: time course sequencing data analysis} \author{Mengjun, Lei Gu} \date{ \today } \begin{document} \SweaveOpts{concordance=TRUE} \maketitle The TCseq package provides a unified suite for analysis of different types of time course sequencing data. It can be applied to transcriptome time course data such as RNA-seq as well as epigenome time course data such as ATAC-seq, ChIP-seq. The main focuses of this package are on differential analysis between different time points and temporal pattern analysis and visualization. Unlike RNA-seq, the genomic regions of interest of sequencing data like ATAC-seq, ChIP-seq are not pre-defined and are specific to each experimental conditions, which limits the subsequential differential analysis between conditions. For those data type, the TCseq package provides functions to combine and merge conditionally specific genomic regions and generate a reference genomic regions for all conditions. This package then uses the negative binomial generalized linear model implemented in edgeR to provide differential analysis \cite{Robinson}. To capture the temporal patterns of the time course data, the package includes several unsupervised clustering methods to identify and a function to visualize the patterns. This vignette uses an example ATAC-seq time course data to illustrate how to use the TCseq package. \section{Input data} The minimal input data for the TCseq are experiment design and reference genomic regions. \subsection{Generate reference genomic regions} For RNA-seq, the reference genomic regions are predefined (genes or exons). While for epigenome sequencing data, genomic regions of interest are usually defined as reads enriched regions which are also called peaks. peaks set for a given condition can be identified by peak callers such as MACs and is specific to that condition. The TCseq package provides a function to read in a set of peak set file in BED format, combines these files in to a single data frame, merges overlapping regions according use defined criteria and takes the largest bound as the reference region for all the overlapping regions. The merge criteria can be either absolute overlapping bases or overlapping ration (absolute overlapping bases divide mininum length of the regions to be merged). If a set of BED files are availble under certain directory, say dir.peaks, the file names of the BED files to be merged have common substring "narrowpeaks", then the reference genomic regions can be generated by: <<>>= library(TCseq) @ <>= dir <- dir.peaks gf <- peakreference(dir = dir, pattern = "narrowpeaks") @ The resulting data frame have four columns as follows: <<>>= data("genomicIntervals") head(genomicIntervals) @ \subsection{Create a TCA object} The TCseq uses an S4 class TCA to store all input data for subsequential analysis. When read counts table is not available, only data frames of experiment design and reference genomic regions are required to create a TCA object, TCseq also provides a function to generate counts table, to use the function, file names of BAM files for each sample/library have to be provided in the data frame of experiment design: <<>>= # Experiment design data("experiment_BAMfile") head(experiment_BAMfile) # create a TCA object tca <- TCA(design = experiment_BAMfile, genomicFeature = genomicIntervals) tca @ The count table then can be created (suppose the BAM files are store in the directory dir.BAM): <>= tca <- countReads(tca, dir = dir.BAM) @ When the counts table is available, BAM file information is not mandatory in the experiment design. Counts table can be provides when creating a TCA object: <<>>= #Experiment design without BAM file information data("experiment") #Counts table data("countsTable") tca <- TCA(design = experiment, genomicFeature = genomicIntervals, counts = countsTable) tca @ The counts table can also be assigned to an existing TCA object: <>= counts(tca) <- countsTable @ In addition, a TCA object can also be created from an existing RangedSummarizedExperiment or SummarizedExperiment. For summarizedExperiment, additional reference genomic regions information must be provided, while for RangedSummarizedExperiment object, the reference genomic regions will be extracted directly from the object. For a SummarizedExperiment object: <<>>= suppressWarnings(library(SummarizedExperiment)) se <- SummarizedExperiment(assays=list(counts = countsTable), colData = experiment) tca <- TCAFromSummarizedExperiment(se = se, genomicFeature = genomicIntervals) @ The TCA object with experiment design, read counts, reference genomic regions can be used for following differential analysis. \section{Differential Analysis} The differetial event is detected by using the generalized linear model (GLM) methods \cite{McCarthy} implemented in edgeR package. <<>>= tca <- DBanalysis(tca) @ Low quality genomic regions (read counts are low for all the time points) can also be filtered out. The following step only keeps genomic regions with two or more more samples that have read counts more than 10. <<>>= tca <- DBanalysis(tca, filter.type = "raw", filter.value = 10, samplePassfilter = 2) @ Differential analysis results between given timepoints can be extracted by: <<>>= DBres <- DBresult(tca, group1 = "0h", group2 = c("24h","40h","72h")) str(DBres, strict.width = "cut") head(DBres$`24hvs0h`) @ Significant differential events (log2-fold > 2 or log2-fold < -2, adjusted p-value < 0.05) can be further extracted by: <<>>= DBres.sig <- DBresult(tca, group1 = "0h", group2 = c("24h","40h","72h"), top.sig = TRUE) str(DBres.sig, strict.width = "cut") @ \section{Temporal pattern analysis} \subsection{Construct time course table} To detect temporal patterns of the time course sequencing data, the TCseq package uses unsupervised clustering methods. First, a time course table is created for clustering analysis. The rows of the time course table are genomic regions, and the columns are time points, the values can be chosen from normalized read counts or logFC of all time points compared to the initial time point. Such table can be created as follows: <<>>= # values are logFC tca <- timecourseTable(tca, value = "FC", norm.method = "rpkm", filter = TRUE) @ or <<>>= # values are normalized read counts tca <- timecourseTable(tca, value = "expression", norm.method = "rpkm", filter = TRUE) @ When the "filter" parameter is set to be TRUE, the time course table will filter out all genomic regions with no significant changes between any two time points. The table can be accessed by: <<>>= t <- tcTable(tca) head(t) @ \subsection{Clustering analysis} Two types of clustering algorithms are included in the package: hard clustering (hierachical, pam, kmeans) and soft clustering (fuzzy cmeans \cite{Futschik}). The temporal patterns are analyzed using the following function: <<>>= tca <- timeclust(tca, algo = "cm", k = 6, standardize = TRUE) @ Instead of absolute value of different time series, one might only focus on the change patterns and expect time series with similar pattern to be cluster in same group. In this case, "standardize" parameter gives an option to perform z-score transformation on the data to be clustered, which reduces the noises introduced by the difference in the absolute values. \subsection{Visualize the clustering results} The clustering results can be visualized as follows: <>= p <- timeclustplot(tca, value = "z-score(PRKM)", cols = 3) @ \begin{figure}[H] \centering \includegraphics[width=\textwidth]{clusterRes.png} \caption{Visualization of clustering results} \end{figure} Individual clusters can also be plotted: <>= #plot cluster 1: print(p[[1]]) @ \begin{figure}[H] \centering \includegraphics[width=0.5\textwidth]{subcluster.png} \caption{Visualization of cluster 1} \end{figure} To plot the cmeans clustering results, the TCseq provides several color schemes to color code the membership values which indicate the degree to which data points belong to a cluster. %BIBLIOGRAPHY \begin{thebibliography}{} \bibitem {Robinson} Robinson, M.D., McCarthy, D.J. and Smyth, G.K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics, 26, 139-140,2010. \bibitem {McCarthy} McCarthy,D.J.,Chen, Y., Smyth, G. K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic acids research 40, 4288-4297,2012. \bibitem{Futschik} Futschik, M.E. and Carlisle, B. Noise-robust soft clustering of gene expression time-course data, Journal of bioinformatics and computational biology, 3, 965-988, 2005. \bibitem{lokesh} L. Kumar and M. Futschik, Mfuzz: a software package for soft clustering of microarray data, Bioinformation, 2(1),5-7,2007 \end{thebibliography} \end{document}