% \VignetteEngine{knitr}
% \VignetteIndexEntry{An introduction to clusterProfiler}
% \VignettePackage{clusterProfiler}}

% To compile this document, run the commands within R
% require(cacheSweave); Sweave('clusterProfiler.Rnw', driver=cacheSweaveDriver()); tools::texi2dvi("clusterProfiler.tex", pdf=TRUE)

\documentclass[12pt]{article}
\usepackage[a4paper,left=3cm,top=2cm,bottom=3cm,right=3cm,ignoreheadfoot]{geometry}
\pagestyle{empty}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{helvet}
\usepackage[titletoc]{appendix}
\usepackage{tocloft}

\setlength{\parindent}{0em}
\setlength{\parskip}{.5em}

\renewcommand{\familydefault}{\sfdefault}

\newcommand{\Rcode}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\small\texttt{#1}}}
\newcommand{\Rfunction}[1]{\Robject{#1}}
\newcommand{\Rpackage}[1]{\textit{#1}}
\newcommand{\Rparameter}[1]{\textit{#1}}
\newcommand{\Rclass}[1]{\textit{#1}}
\newcommand{\R}{\textsf{R}}

\usepackage[
  bookmarks,%
  colorlinks,%
  linktoc=section,%
  linkcolor=RedViolet,%
  citecolor=RedViolet,%
  urlcolor=RedViolet,%
  linkbordercolor={1 1 1},%
  citebordercolor={1 1 1},%
  urlbordercolor={1 1 1},%
  raiselinks,%
  plainpages,%
  pdftex
  ]{hyperref}


\usepackage{cite}
\renewcommand{\floatpagefraction}{0.9}

\usepackage{sectsty}
\sectionfont{\sffamily\bfseries\color{RoyalBlue}\sectionrule{0pt}{0pt}{-1ex}{1pt}}
\subsectionfont{\sffamily\bfseries\color{RoyalBlue}}
\subsubsectionfont{\sffamily\bfseries\color{RoyalBlue}}
\usepackage{fancyhdr}
\pagestyle{fancy}
\fancyhead{}
\fancyfoot{}
\lfoot{}\cfoot{}\rfoot{}
\renewcommand{\headrule}{}

\usepackage{graphicx}
%\usepackage{xstring}



<<include=FALSE>>=
library(DOSE)
library(GO.db)
library(org.Hs.eg.db)
library(pathview)
library(clusterProfiler)
library(knitr)
opts_chunk$set(tidy=TRUE,tidy.opts=list(keep.blank.line=FALSE, width.cutoff=50),out.truncate=80,out.lines=6,cache=TRUE,dev='pdf',include=TRUE,fig.width=6,fig.height=6,resolution=150)
@

\newcommand{\fixme}[1]{{\textbf{Fixme:} \textit{\textcolor{blue}{#1}}}}


\title{\textsf{\textbf{Using \Rpackage{clusterProfiler} to identify and compare functional profiles of gene lists}}}
\author{Guangchuang Yu \\
School of Biological Sciences \\
The University of Hong Kong, Hong Kong SAR, China \\
email: \texttt{guangchuangyu@gmail.com}}

\begin{document}
\maketitle

<<options,echo=FALSE>>=
options(digits=3, width=80, prompt=" ", continue=" ")
@

\tableofcontents

\section{Introduction}

In recently years, high-throughput experimental techniques such as
microarray, RNA-Seq and mass spectrometry can detect cellular
moleculars at systems-level. These kinds of analyses generate huge
quantitaties of data, which need to be given a biological
interpretation. A commonly used approach is via clustering in the gene
dimension for grouping different genes based on their similarities \cite{yu2010}.

To search for shared functions among genes, a common way is to
incorporate the biological knowledge, such as Gene Ontology (GO) and
Kyoto Encyclopedia of genes and Genomes (KEGG), for identifying
predominant biological themes of a collection of genes.

After clustering analysis, researchers not only want to determine
whether there is a common theme of a particular gene cluster, but also
to compare the biological themes among gene clusters. The manual step
to choose interesting clusters followed by enrichment analysis on each
selected cluster is slow and tedious. To bridge this gap, we designed
\Rpackage{clusterProfiler} \cite{yu2012}, for comparing and visualizing functional
profiles among gene clusters.

\section{Citation}

Please cite the following articles when using
\Rpackage{clusterProfiler}. \\


G Yu, LG Wang, Y Han, QY He. clusterProfiler: an R package for
comparing biological themes among gene clusters. \textit{OMICS: A
  Journal of Integrative Biology}. 2012, 16(5), 284-287. \\

\section{Supported organisms}
At present, \Rpackage{clusterProfiler} about 20 species as shown below:
    
\begin{itemize}
  \item \textit{Arabidopsis}
  \item \textit{Anopheles}
  \item \textit{Bovine}
  \item \textit{Canine}
  \item \textit{Chicken}
  \item \textit{Chimp}
  \item \textit{E coli strain K12}
  \item \textit{E coli strain Sakai}
  \item \textit{Fly}
  \item \textit{Human}
  \item \textit{Malaria}
  \item \textit{Mouse}
  \item \textit{Pig}
  \item \textit{Rat}
  \item \textit{Rhesus}
  \item \textit{Worm}
  \item \textit{Xenopus}
  \item \textit{Yeast}
  \item \textit{Zebrafish}
\end{itemize}

These species are all supported by GO and KEGG analyses.

GO analyses also support \textit{Coelicolor} and \textit{Gondii}.


\section{Gene Ontology Classification}
In \Rpackage{clusterProfiler}, \Rfunction{groupGO} is designed for gene classification based on GO distribution at a specific level.

<<groupGO>>=
require(DOSE)
data(geneList)
gene <- names(geneList)[abs(geneList) > 2]
head(gene)
ggo <- groupGO(gene=gene, organism="human",
				ont="BP", level=3, readable=TRUE)
head(summary(ggo))
@

\section{Enrichment Analysis}
\subsection{Hypergeometric model}
Enrichment analysis \cite{boyle2004} is a widely used approach to identify biological
themes. Here we implement hypergeometric model to assess whether the
number of selected genes associated with disease is larger than
expected. 

To determine whether any terms annotate a specified list of
    genes at frequency greater than that would be expected by chance,
    \Rpackage{clusterProfiler} calculates a p-value using the hypergeometric distribution:

$
p = 1 - \displaystyle\sum_{i = 0}^{k-1}
  \frac{
      {M \choose i}
      {{N-M} \choose {n-i}}
    } {
      {N \choose n}
    }
$

In this equation, \textit{N} is the total number of genes in the
background distribution, \textit{M} is the number of genes within that
distribution that are annotated (either directly or indirectly) to the
node of interest, \textit{n} is the size of the list of genes of
interest and \textit{k} is the number of genes within that list which
are annotated to the node. The background distribution by default is
all the genes that have annotation.

P-values were adjusted for multiple comparison, and q-values were also calculated for FDR control.

\subsection{Gene set enrichment analysis}
A common approach in analyzing gene expression profiles was identifying differential expressed genes that are deemed interesting. The enrichment analysis we demonstrated previous were based on these differential expressed genes. This approach will find genes where the difference is large, but it will not detect a situation where the difference is small, but evidenced in coordinated way in a set of related genes. Gene Set Enrichment Analysis (GSEA) \cite{subramanian_gene_2005} directly addresses this limitation. All genes can be used in GSEA; GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. Since it is likely that many relevant phenotypic differences are manifested by small but consistent changes in a set of genes.
\\
\\
Genes are ranked based on their phenotypes. Given a priori defined set of gens \textit{S} (e.g., genes shareing the same \textit{GO} or \textit{KEGG} category), the goal of GSEA is to determine whether the members of \textit{S} are randomly distributed throughout the ranked gene list (\textit{L}) or primarily found at the top or bottom.
\\
\\
There are three key elements of the GSEA method:
\begin{itemize}
	\item Calculation of an Enrichment Score.\\
The enrichment score (\textit{ES}) represent the degree to which a set \textit{S} is over-represented at the top or bottom of the ranked list \textit{L}. The score is calculated by walking down the list \textit{L}, increasing a running-sum statistic when we encounter a gene in \textit{S} and decreasing when it is not. The magnitude of the increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The \textit{ES} is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic \cite{subramanian_gene_2005}.
	\item Esimation of Significance Level of \textit{ES}.\\
The \textit{p-value} of the \textit{ES} is calculated using permutation test. Specifically, we permute the gene labels of the gene list \textit{L} and recompute the \textit{ES} of the gene set for the permutated data, which generate a null distribution for the \textit{ES}. The \textit{p-value} of the observed ES is then calculated relative to this null distribution.
	\item Adjustment for Multiple Hypothesis Testing.\\
When the entire \textit{GO} or \textit{KEGG} gene sets is evaluated, \Rpackage{clusterProfiler} adjust the estimated significance level to account for multiple hypothesis testing and also \textit{q-values} were calculated for FDR control.
\end{itemize}

\subsection{GO enrichment analysis}

<<enrichGO>>=
ego <- enrichGO(gene=gene,
                universe = names(geneList),
                organism="human",
                ont="CC",
                pvalueCutoff=0.01,
                readable=TRUE)
head(summary(ego))
@

<<gseGO>>=
ego2 <- gseGO(geneList=geneList,
              organism="human",
              ont="CC",
              nPerm=100,
              minGSSize=120,
              pvalueCutoff=0.01,
              verbose=FALSE)
head(summary(ego2))
@

\subsection{KEGG pathway enrichment analysis}

<<enrichKEGG>>=
kk <- enrichKEGG(gene=gene,
                 organism="human",
                 pvalueCutoff=0.01, 
                 readable=TRUE)
head(summary(kk))
@

<<gseKEGG>>=
kk2 <- gseKEGG(geneList=geneList,
               organism="human",
               nPerm=100,
               minGSSize=120,
               pvalueCutoff=0.01,
               verbose=FALSE)
head(summary(kk2))
@

\subsection{DO enrichment analysis}

Disease Ontology (DO) enrichment analysis is implemented in \Rpackage{DOSE}, please refer to the package vignettes. The \Rfunction{enrichDO} function is very useful for identifying disease association of interesting genes, and function \Rfunction{gseAnalyzer} function is designed for gene set enrichment analysis of \textit{DO}.

\subsection{Reactome pathway enrichment analysis}

With the demise of KEGG (at least without subscription), the KEGG pathway data in Bioconductor will not update and we encourage user to analyze pathway using \Rpackage{ReactomePA} which use Reactome as a source of pathway data. The function call of \Rfunction{enrichPathway} and \Rfunction{gsePathway} in \Rpackage{ReactomePA} is consistent with \Rfunction{enrichKEGG} and \Rfunction{gseKEGG}.

\subsection{Function call}
The function calls of \Rfunction{groupGO}, \Rfunction{enrichGO},
\Rfunction{enrichKEGG}, \Rfunction{enrichDO} and \Rfunction{enrichPathway} are consistent. 
The input parameters of
\textit{gene} is a vector of entrezgene (for human and mouse) or ORF
(for yeast) IDs, and \textit{organism} should be supported species (please refer to the manual of the specific function).

For gene set enrichment analysis, the function of \Rfunction{gseGO}, \Rfunction{gseKEGG}, \Rfunction{gseAnalyzer} and \Rfunction{gsePathway} need extra paramter \textit{nPerm} to specify the permutation number.

For GO analysis, \textit{ont} must be assigned to one of "BP", "MF",
and "CC" for biological process, molecular function and cellular
component, respectively. In \Rfunction{groupGO}, the \textit{level}
specify the GO level for gene projection.

In enrichment analysis, the \textit{pvalueCutoff} is to restrict the
result based on their pvalues and the adjusted p values. \textit{Q-values} were also calculated for controlling
false discovery rate (FDR). 

The \textit{readable} is a logical parameter to indicate the
input gene IDs will map to gene symbols or not.

\subsection{Visualization}
The output of \Rfunction{groupGO}, \Rfunction{enrichGO} and \Rfunction{enrichKEGG} can be visualized by bar plot, enrichment map and category-gene-network plot. It is very common to visualize the enrichment result in bar or pie chart. We believe the pie chart is misleading and only provide bar chart.

\subsubsection{barplot}
<<barplot, fig.height=5, fig.width=6>>=
barplot(ggo, drop=TRUE, showCategory=12)
@



<<barplot-enrich, fig.height=5, fig.width=8>>=
barplot(ego, showCategory=8)
@


\subsubsection{enrichMap}
Enrichment map can be viusalized by \Rfunction{enrichMap}, which support results obtained from hypergeometric test and gene set enrichment analysis.
<<enrichMap, fig.cap="enrichment map of enrichment result", fig.align="center", fig.height=16, fig.width=16, out.width="0.9\\textwidth", fig.pos="h">>=  
enrichMap(ego)
@ 

<<enrichMap2, fig.cap="enrichment map of gsea result", fig.align="center", fig.height=16, fig.width=16, out.width="0.9\\textwidth", fig.pos="h">>=  
enrichMap(ego2)
@ 


\subsubsection{cnetplot}
In order to consider the potentially biological complexities in which a gene may belong to multiple annotation categories and provide information of numeric changes if available, we developed \Rfunction{cnetplot} function to extract the complex association.
<<cnetplot, fig.height=14, fig.width=14>>=
cnetplot(ego, categorySize="pvalue", foldChange=geneList)
@


<<cnetplot-KEGG, fig.height=14, fig.width=14>>=
cnetplot(kk, categorySize="geneNum", foldChange=geneList)
@

\subsubsection{gseaplot}
Running score of gene set enrichment analysis and its association of phenotype can be visualized by \Rfunction{gseaplot}.
<<gseaplot, fig.cap="plotting gsea result", fig.align="center", fig.height=6, fig.width=6, out.width="0.6\\textwidth", fig.pos="h">>=  
gseaplot(kk2, geneSetID = "hsa04145")
@


\subsubsection{pathview from pathview package}
\Rpackage{clusterProfiler} users can also use \Rfunction{pathview} from the \Rpackage{pathview} \cite{luo_pathview} to visualize KEGG pathway.

The following example illustrate how to visualize "hsa04110" pathway, which was enriched in our previous analysis.

<<viewKEGG>>=
require(pathview)
hsa04110 <- pathview(gene.data=geneList, pathway.id="hsa04110", species="hsa", limit=list(gene=max(abs(geneList)), cpd=1))
@

\begin{figure}[ht]
\centering
\includegraphics[width=.9\textwidth,type=png,ext=.png,read=.png]{hsa04110.pathview}
\caption{visualize KEGG pathway using pathview}
\label{viewKEGG}
\end{figure}

For further information, please refer to the vignette of \Rpackage{pathview} \cite{luo_pathview}.


\section{Biological theme comparison}
\Rpackage{clusterProfiler} was developed for biological theme comparison, and it provides a function, \Rfunction{compareCluster}, to automatically calculate enriched functional categories of each gene clusters.

<<compareCluster, fig.height=8, fig.width=8>>=
data(gcSample)
ck <- compareCluster(geneCluster=gcSample, fun="enrichKEGG")
plot(ck)
@


By default, only top 5 (most significant) categories of each cluster
was plotted. User can changes the parameter \textit{showCategory} to
specify how many categories of each cluster to be plotted, and if
\textit{showCategory} was set to \textit{NULL}, the whole result will
be plotted.

The dot sizes were based on their corresponding row percentage by
default, and user can set the parameter \textit{by} to "count" to make
the comparison based on gene counts. The parameter \textit{by} can also set 
to "rowPercentage" to normalize the dot sizes, since some categories may
contain a large number of genes, and make the dot sizes of those small
categories too small to compare. The default parameter \textit{by} is setting 
to "geneRatio", which corresponding to the "GeneRatio" column of the output.
To provide the full information, we
also provide number of identified genes in each category (numbers in
parentheses) when \textit{by} is setting to "rowPercentage" and number of gene clusters in each cluster label (numbers in parentheses) when \textit{by} is setting to "geneRatio", 
as shown in Figure 3. If the dot sizes were based on
"count", the row numbers will not shown.

The p-values indicate that which categories are more likely to have
biological meanings. The dots in the plot are color-coded based on
their corresponding p-values. Color gradient ranging from red to blue
correspond to in order of increasing p-values. That is, red indicate
low p-values (high enrichment), and blue indicate high p-values (low
enrichment). P-values and adjusted p-values were filtered out by the threshold giving by
parameter \textit{pvalueCutoff}, and FDR can be estimated by \textit{qvalue}.

User can refer to the example in \cite{yu2012}; we analyzed the publicly available expression dataset of breast tumour tissues from 200 patients (GSE11121, Gene Expression Omnibus) \cite{schmidt2008}. We
identified 8 gene clusters from differentially expressed genes, and
using \Rfunction{compareCluster} to compare these gene clusters by
their enriched biological process.


Another example was shown in \cite{yu2011}, we calculated functional
similarities among viral miRNAs using method described in
\cite{yu_new_2011}, and compared significant KEGG pathways regulated
by different viruses using \Rfunction{compareCluster}.

The comparison function was designed as a general-package for
comparing gene clusters of any kind of ontology associations, not
only \Rfunction{groupGO}, \Rfunction{enrichGO}, and \Rfunction{enrichKEGG} this package provided, but also other biological and biomedical ontologies, for instance, \Rfunction{enrichDO} from \Rpackage{DOSE} and \Rfunction{enrichPathway} from \Rpackage{ReactomePA} work fine with \Rfunction{compareCluster} for comparing biological themes in disease and reactome pathway perspective. More details can be
  found in the vignettes of \Rpackage{DOSE} and \Rpackage{ReactomePA}.

\section{Session Information}

The version number of R and packages loaded for generating the vignette were:

<<sessInfo, results='asis', echo=FALSE>>=
toLatex(sessionInfo())
@

\bibliographystyle{unsrt}
\bibliography{clusterProfiler}
\end{document}