%\VignetteIndexEntry{tigre User Guide}
%\VignetteKeywords{TimeCourse, GeneExpression, Transcription}
%\VignetteDepends{Biobase, annotate, puma, BiocStyle}
%\VignettePackage{tigre}
\documentclass[a4paper]{article}
\usepackage{url}

<<style, eval=TRUE, echo=FALSE, results=tex>>=
BiocStyle::latex()
@

\title{tigre User Guide}
\author{Antti Honkela, Pei Gao,\\
  Jonatan Ropponen, Miika-Petteri Matikainen,\\
  Magnus Rattray, and Neil D. Lawrence}

\newcommand{\tigre}{\Biocpkg{tigre}}

\begin{document}
\maketitle
\SweaveOpts{keep.source=TRUE}

\section{Abstract}

The \tigre{} package implements our methodology of Gaussian process
differential equation models for analysis of gene expression time
series from single input motif networks.  The package can be used for
inferring unobserved transcription factor (TF) protein concentrations
from expression measurements of known target genes, or for ranking
candidate targets of a TF.

\section{Citing \tigre{}}

The \tigre{} package is based on a body of methodological research.
Citing \tigre{} in publications will usually involve citing one or
more of the methodology papers
\cite{Honkela2010PNAS,Gao2008,Lawrence2007} that the software is
based on as well as citing the software package itself
\cite{Honkela2011}.

<<echo=FALSE, eval=TRUE>>=
options(width = 60)
@

\section{Introductory example analysis - Drosophila development}
\label{section:Introductory example}

In this section we introduce the main functions of the \Rpackage{puma}
package by repeating some of the analysis from the PNAS
paper~\cite{Honkela2010PNAS}\footnote{Note that the results reported
  in the paper were run using an earlier version of this package for
  MATLAB, so there can be minor differences.}.

\subsection{Installing the \tigre{} package}

The recommended way to install \tigre{} is to use the
\Rfunction{BiocManager::install} function. Installing in this way
should ensure that all appropriate dependencies are met.

<< eval=FALSE >>==
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("tigre")
@

% To install the tigre software, unpack the software and run
% \begin{verbatim}
% R CMD INSTALL tigre
% \end{verbatim}

To load the package start R and run
<<>>=
library(tigre)
@ 

\subsection{Loading the data}

To get started, you need some preprocessed time series expression
data.  If the data originates from Affymetrix arrays, we highly
recommend processing it with \Rfunction{mmgmos} from the
\Rpackage{puma} package.  This processing extracts error bars on the
expression measurements directly from the array data to allow judging
the reliability of individual measurements.  This information is
directly utilised by all the models in this package.

To start from scratch on Affymetrix data, the .CEL files from
\url{ftp://ftp.fruitfly.org/pub/embryo_tc_array_data/} may be
processed using:
<<eval=FALSE>>=
# Names of CEL files
expfiles <- c(paste("embryo_tc_4_", 1:12, ".CEL", sep=""),
              paste("embryo_tc_6_", 1:12, ".CEL", sep=""),
              paste("embryo_tc_8_", 1:12, ".CEL", sep=""))
# Load the CEL files
expdata <- ReadAffy(filenames=expfiles,
                    celfile.path="embryo_tc_array_data")
# Setup experimental data (observation times)
pData(expdata) <- data.frame("time.h" = rep(1:12, 3),
                             row.names=rownames(pData(expdata)))
# Run mmgMOS processing (requires several minutes to complete)
drosophila_mmgmos_exprs <- mmgmos(expdata)
drosophila_mmgmos_fragment <- drosophila_mmgmos_exprs
@ 

This data needs to be further processed to make it suitable for our
models.  This can be done using
<<eval=FALSE>>=
drosophila_gpsim_fragment <-
  processData(drosophila_mmgmos_fragment,
              experiments=rep(1:3, each=12))
@ 

Here the last argument specifies that we have three independent time
series of measurements.

In order to save time with the demos, a part of the result of this is
included in this package and can be loaded using
<<>>=
data(drosophila_gpsim_fragment)
@ 

\subsection{Learning individual models}

Let us now recreate some the models shown in the plots of the PNAS
paper~\cite{Honkela2010PNAS}:
<<>>=
# FBgn names of target genes
targets <- c('FBgn0003486', 'FBgn0033188', 'FBgn0035257')

# Load gene annotations
library(annotate)
aliasMapping <- getAnnMap("ALIAS2PROBE",
                  annotation(drosophila_gpsim_fragment))
# Get the probe identifier for TF 'twi'
twi <- get('twi', env=aliasMapping)
# Load alternative gene annotations
fbgnMapping <- getAnnMap("FLYBASE2PROBE",
                 annotation(drosophila_gpsim_fragment))
# Get the probe identifiers for target genes
targetProbes <- mget(targets, env=fbgnMapping)

st_models <- list()
# Learn single-target models for each gene individually
for (i in seq(along=targetProbes)) {
  st_models[[i]] <- GPLearn(drosophila_gpsim_fragment,
                            TF=twi, targets=targetProbes[i],
                            quiet=TRUE)
}
# Learn a joint model for all targets
mt_model <- GPLearn(drosophila_gpsim_fragment, TF=twi,
                    targets=targetProbes,
                    quiet=TRUE)
# Display the joint model parameters
show(mt_model)
# Learn a model without TF mRNA and TF protein translation
nt_model <- GPLearn(drosophila_gpsim_fragment,
                    targets=c(twi, targetProbes[1:2]), quiet=TRUE)
@ 

\subsection{Visualising the models}

The models can be plotted using commands like
<<eval=FALSE>>=
GPPlot(st_models[[1]], nameMapping=getAnnMap("FLYBASE",
                        annotation(drosophila_gpsim_fragment)))
GPPlot(mt_model, nameMapping=getAnnMap("FLYBASE",
                  annotation(drosophila_gpsim_fragment)))
GPPlot(nt_model, nameMapping=getAnnMap("FLYBASE",
                  annotation(drosophila_gpsim_fragment)))
@ 

\begin{figure}
  \begin{center}
<<fig=TRUE, echo=FALSE>>=
GPPlot(st_models[[1]], nameMapping=getAnnMap("FLYBASE",
                         annotation(drosophila_gpsim_fragment)))
@
\end{center}
\caption{Single target models for the gene FBgn0003486. The models for
  each repeated time series are shown in different columns.}
\end{figure}

\begin{figure}
  \begin{center}
<<fig=TRUE, echo=FALSE>>=
GPPlot(mt_model, nameMapping=getAnnMap("FLYBASE",
                   annotation(drosophila_gpsim_fragment)))
@
\end{center}
\caption{Multiple-target model for all the example genes.  The
  call creates independent figures for each repeated time series.}
\end{figure}

\begin{figure}
  \begin{center}
<<fig=TRUE, echo=FALSE>>=
GPPlot(nt_model, nameMapping=getAnnMap("FLYBASE",
                   annotation(drosophila_gpsim_fragment)))
@
\end{center}
\caption{Multiple-target model without TF protein translation for
  selected example genes without.  The
  call creates independent figures for each repeated time series.}
\end{figure}

\subsection{Ranking the targets}

Bulk ranking of candidate targets can be accomplished using
<<>>=
## Rank the targets, filtering weakly expressed genes with average
## expression z-score below 1.8
scores <- GPRankTargets(drosophila_gpsim_fragment, TF=twi,
                        testTargets=targetProbes,
                        options=list(quiet=TRUE),
                        filterLimit=1.8)
## Sort the returned list according to log-likelihood
scores <- sort(scores, decreasing=TRUE)
write.scores(scores)
@ 

To save space, \Rfunction{GPRankTargets} does not return the models by
default.  If those are needed later e.g. for plotting, they can be
recreated using the inferred parameters saved together with the
ranking using
<<>>=
topmodel <- generateModels(drosophila_gpsim_fragment,
                           scores[1])
show(topmodel)
@ 

\subsection{Ranking using known targets with multiple-target models}

Ranking using known targets with multiple-target models can be
accomplished simply by adding the \texttt{knownTargets} argument
<<>>=
## Rank the targets, filtering weakly expressed genes with average
## expression z-score below 1.8
scores <- GPRankTargets(drosophila_gpsim_fragment, TF=twi,
                        knownTargets=targetProbes[1],
                        testTargets=targetProbes[2:3],
                        options=list(quiet=TRUE),
                        filterLimit=1.8)
## Sort the returned list according to log-likelihood
scores <- sort(scores, decreasing=TRUE)
write.scores(scores)
@ 

\subsection{Running ranking in a batch environment}

\Rfunction{GPRankTargets} can be easily run in a batch environment
using the argument \texttt{scoreSaveFile}.  This indicates a file to
which scores are saved after processing each gene.  Thus one could,
for example, split the data to, say, 3 separate blocks according to
the reminder after division by 3 and run each of these independently.
The first for loop could then be run in parallel (e.g. as separate
jobs on a cluster), as each step is independent of the others.  After
these have all completed, the latter loop could be used to gather the
results.
<<eval=FALSE>>=
for (i in seq(1, 3)) {
  targetIndices <- seq(i,
    length(featureNames(drosophila_gpsim_fragment)), by=3)
  outfile <- paste('ranking_results_', i, '.Rdata', sep='')
  scores <- GPrankTargets(preprocData, TF=twi,
                          testTargets=targetIndices,
                          scoreSaveFile=outfile)
}

for (i in seq(1, 3)) {
  outfile <- paste('ranking_results_', i, '.Rdata', sep='')
  load(outfile)
  if (i==1)
    scores <- scoreList
  else
    scores <- c(scores, scoreList)
}
show(scores)
@ 

\section{Experimental feature: Using non-Affymetrix data}

Using non-Affymetrix data, or data without associated uncertainty
information for the expression data in general, requires more because
of two reasons
\begin{itemize}
\item noise variances need to be estimated together with other model
  parameters; and
\item weakly expressed genes cannot be easily filtered \emph{a
    priori}.
\end{itemize}

The first of these is automatically taken care of by all the above
functions, but the latter requires some extra steps after fitting the
models.

In order to get started, you need to create an
\Rfunction{ExpressionTimeSeries} object of your data set.  This can be
accomplished with the function
<<eval=FALSE>>=
procData <- processRawData(data, times=c(...),
                           experiments=c(...))
@ 

Filtering of weakly expressed genes requires more care and visualising
the fitted models is highly recommended to avoid mistakes.

Based on initial experiments, it seems possible to perform the
filtering based on the statistic \texttt{loglikelihoods(scores) -
  baseloglikelihoods(scores)}, but selection of suitable threshold is
highly dataset specific.

\section{Exporting results to an SQLite database}

The results of the analysis can be stored to an SQLite database. The
database can then be browsed and queried using the
\href{https://github.com/PROBIC/tigreBrowser}{tigreBrowser}
result browser. The data
is inserted to the database by using \Rfunction{export.scores} function.

An example of the usage of \Rfunction{export.scores} is given below
<<eval=FALSE>>=
export.scores(scores, datasetName='Drosophila',
              experimentSet='GPSIM/GPDISIM',
              database='database.sqlite',
              preprocData=drosophila_gpsim_fragment,
              models=models,
              aliasTypes=c('SYMBOL', 'GENENAME', 'FLYBASE', 'ENTREZID'))
@

In this example, \texttt{scores} is the return value of
\Rfunction{GPRankTargets}, \texttt{'Drosophila'} is the name of a dataset in
database and \texttt{'GPSIM/GPDISIM'} is the name of an experiment set in
database. In general, results with the same dataset name are considered to be
part of same dataset. That is, if no results with a given dataset are already
in the database, a new dataset entry is created.  On the other hand, if the
database already contains results with the same dataset name, new results will
be added to the old dataset.

Also, results from different experiments can be combined into a set of
experiments by giving them the same experiment set name. This is useful as a
result browser may display results depending on the experiment set.

\texttt{database.sqlite} is the filename of a database file. The file will
be created if it does not exist already.

The function will create model figures and add them to the database if
preprocessed data is given. In this example, models are given to the
function as a parameter. This is not necessary, however, as the function can
create models if preprocessed data is supplied. Nevertheless, the function will
finish faster if it does not have to (re-)create models.

In addition to log likelihoods and z-scores, this function will also export
different gene names and aliases to the database. By default, the function will
read GENENAME, SYMBOL and ENTREZID datas from relevant annotations and insert
those into the database. The function takes also \texttt{aliasTypes} argument
which is used to define which annotation information is inserted. In the
example above, FLYBASE gene numbers are also added to the genes in the
database. The insertion of alias annotations and z-scores requires that the
preprocessed data is supplied.

\section{Session Info}

<<sessionInfo>>=
sessionInfo()
@ 

\bibliography{gpsim}

\end{document}