%
% NOTE -- ONLY EDIT THE .Rnw FILE!!!  The .tex file is
% likely to be overwritten.
%
%\VignetteDepends{graph,ScISI,ppiData}
%\VignetteKeywords{Protein Interaction, Graphs}
%\VignettePackage{ppiStats}
%\VignetteIndexEntry{ppiStats}

\documentclass[11pt]{article}


\usepackage{amsmath,pstricks}
\usepackage[authoryear,round]{natbib}
\usepackage{hyperref}
\usepackage{times}
\usepackage{comment}
\usepackage{subfigure}

\textwidth=6.2in
\textheight=8.5in
\oddsidemargin=.1in

\evensidemargin=.1in
\headheight=-.3in

\newcommand{\scscst}{\scriptscriptstyle}
\newcommand{\scst}{\scriptstyle}

\newcommand{\Rfunction}[1]{{\texttt{#1}}}
\newcommand{\Robject}[1]{{\texttt{#1}}}
\newcommand{\Rpackage}[1]{{\textit{#1}}}


\bibliographystyle{plainnat}

\title{Exploring protein interaction data using \Rpackage{ppiStats}}
\author{Denise Scholtens and Tony Chiang}
\date{}

\begin{document}
\maketitle

\section*{Introduction}

\Rpackage{ppiStats} is a package that provides tools for the description and
analysis of protein interaction data represented in node-and-edge graph form.
The package is specifically designed to address data generated by bait-prey
technologies (e.g. yeast-two-hybrid (Y2H) and affinity purification-mass
spectrometry (AP-MS)) in which directed edges extend from bait
proteins toward detected prey proteins, with the nature of the detected
interaction depending on the technology.

This vignette demonstrates the main functionality of the \Rpackage{ppiStats} package 
including: 1) summarizing observed data, 2) identifying systematically biased 
proteins in a data set, and 3) estimating stochastic error probabilities.  In order 
to demonstrate \Rpackage{ppiStats}, we will make use of the protein interaction data 
sets available in \Rpackage{ppiData}.

<<loadlibs>>=
library("ppiStats")
library("ppiData")
@ 

In this vignette, we will use to AP-MS data sets reported by \cite{Gavin2002} 
(Gavin02) and \cite{Gavin2006} (Gavin06) for \textit{Saccharomyces cerevisiae}.  
AP-MS technology is a bait-prey system designed to detect complex co-membership on 
behalf of proteins.

Most of the functions available in \Rpackage{ppiStats} require objects that extend 
the class \Robject{graph}, defined in the \Rpackage{graph} package.  For example, 
\Robject{Gavin2002BPGraph} and \Robject{Gavin2006BPGraph} are \Robject{graphNEL} 
objects available in \Rpackage{ppiData}.  For assistance in converting raw data 
files into the appropriate object instance, see the help pages for the functions 
\Rfunction{bpMatrix} and \Rfunction{genBPGraph} in the \Rpackage{ppiStats} package.  
The \Rpackage{graph} and \Rpackage{RpsiXML} packages also contain helpful 
functionality.


<<getGavinData>>=
data(Gavin2002BPGraph)
data(Gavin2006BPGraph)
GavinGraphs = list(Gavin02=Gavin2002BPGraph,Gavin06=Gavin2006BPGraph)
GavinGraphs
@ 

\section*{Summarizing observed data}

While AP-MS data can in concept assay all possible complex co-membership 
interactions in a cell, in fact only a certain portion of those interactions are 
assayed in an experiment.  The sampled portion of the `interactome' depends on which 
proteins are selected as baits, the subset of those baits that are amenable to the 
technology, and the set of proteins in the cell that are available for detection as 
prey under the conditions of the experiment.  \cite{Chiang2007} present the notion 
of `viability' to summarize these ideas, defining `viable baits' (VBs) as baits that 
detect at least one prey and `viable prey' (VP) as proteins that are detected by at 
least one bait.  It is possible for a protein to be both a viable bait and a viable 
prey (VBP).  These definitions are important since only the edges that extend from 
VB nodes to VP nodes are deemed `tested' in the analyses presented here.  All other 
potential interactions (e.g. an edge between two nodes that are VP but not VBP) are 
treated as untested.

The function \Rfunction{createSummaryTables} lists several summary statistics about 
the viable protein population in an experiment.  In addition to describing the 
viable protein population, the total number of directed interactions (TI) is 
reported.

<<summaryTables>>=
summaryTables = createSummaryTables(GavinGraphs)
summaryTables
@ 

When examining the VB and VP populations, it can be helpful to compare the number of 
represented proteins to the total population in the cell, especially if the 
experiment is intended to be genome-wide.  Figure \ref{fig:viabilityCharts} 
demonstrates VB and VP coverage for the Gavin02 and Gavin06 experiments, relative to 
the entire 6000+ genes in yeast.

<<viabilityCharts,echo=TRUE,results=hide,fig=TRUE,width=5,height=5,prefix=FALSE,include=FALSE>>=
viabilityCharts(GavinGraphs)
@ 

\begin{figure}[tp]
\centering
\includegraphics[width=5in]{viabilityCharts}
\caption{\label{fig:viabilityCharts} Enumerations of the VB, VP, and VBP 
populations for the Gavin02 and Gavin06 datasets in comparison with the entire 
set of 6000+ genes in yeast.} 
\end{figure}

Rather than simply enumerating the viable protein population, the function 
\Rfunction{idViableProteins} identifies the specific VB, VP, and VBP proteins 
represented in a data set.

<<findVBVPVBPsets>>=
GavinViableProteins = lapply(GavinGraphs,FUN=idViableProteins)
sapply(GavinViableProteins,FUN=function(x) lapply(x,FUN=length))

Gavin06VBPproteins =  GavinViableProteins[["Gavin06"]]$VBP
Gavin06VBPproteins[1:4]
@ 

\section*{Identifying systematically biased proteins}

In some bait-prey experiments, certain proteins are prone to systematic bias.  
Specifically, in the VBP population, some proteins have a high imblance between in- 
and out-degree.  In-degree in the VBP population represents the number of times a 
VBP node is detected as prey by another VBP protein.  Out-degree represents the 
number of times that VBP node detects other prey.  When in-degree is much higher 
than out-degree, or vice versa, it may be an indication that the node is subject to 
more than simply stochastic measurement error in line with the sensitivity and 
specificity of the technology.  The \Rfunction{inOutScatterCharts} function compares 
in- and out-degree for the VBP nodes and provides boundaries for determining which 
nodes may be particularly prone to systematic error according to the Binomial model 
of \cite{Chiang2007}.

<<inOutScatterCharts,echo=TRUE,results=hide>>=
inOutScatterCharts(GavinGraphs)
@




\begin{figure}[htp]
  \centering
  \subfigure{
  \includegraphics[width=2.5in]{scp-Gavin02-ident.pdf}
  }
  \subfigure{
    \includegraphics[width=2.5in]{scp-Gavin02-sqrt.pdf}
  }
  \subfigure{
    \includegraphics[width=4in]{scp-Gavin02-hist.pdf}
  }
  \caption{Scatter plots corresponding to the in- and out-degree (and the square 
root of each) for the VBP nodes in Gavin02. The points within the diagonal bands 
represent those proteins for which there is not enough evidence to reject the null 
hypothesis that the protein might be affected by some systematic bias while the 
points outside the bands are rejected at the 0.01 p-value. The bottom plot describes 
the distribution of the p-values for each data point. While the majority of proteins 
are not rejected, we can see that a substantial number proteins are. Those proteins 
which are rejected should not be analyzed using the standard statistical methods to 
model stochastic variability nor should they be used to estimate the underlying 
features within the data.}
  \label{splots:Gavin02}
\end{figure}



\begin{figure}[htp]
  \centering
  \subfigure{
  \includegraphics[width=2.5in]{scp-Gavin06-ident.pdf}
  }
  \subfigure{
    \includegraphics[width=2.5in]{scp-Gavin06-sqrt.pdf}
  }
  \subfigure{
    \includegraphics[width=4in]{scp-Gavin06-hist.pdf}
  }
  \caption{The same plots as described in Figure \ref{splots:Gavin02} presented here 
for the Gavin06 data.}
  \label{splots:Gavin06}
\end{figure}

It appears that the Gavin06 dataset contains more proteins that are subject to 
systematic error than the Gavin02 dataset.  We can identify these proteints using 
the function \Rfunction{idSystematic}, remove them from the graph, and then proceed 
with estimation of stochastic error probabilities using the remaining VBP nodes.


<<sepSystematicStochastic>>=

systematicVBPsGavin06 = idSystematic(Gavin2006BPGraph,Gavin06VBPproteins,bpGraph=TRUE)

unbiasedGavin06VBPs = setdiff(Gavin06VBPproteins,systematicVBPsGavin06)

Gavin06VBPsubgraph = subGraph(unbiasedGavin06VBPs,Gavin2006BPGraph)
Gavin06VBPsubgraph
@

\section*{Estimating stochastic error probabilities}

Each edge between the subgraph induced by the VBP nodes is tested twice, once in 
each direction.  With perfectly sensitive and specific technology, each edge would 
either be observed twice or not observed twice.  The presence of unreciprocated 
edges indicates the presence of measurement error - either a false positive (FP) or 
a false negative (FN) observation.

The function \Rfunction{estimateCCMErrorRates} estimates stochastic error 
probabilities, specifically the false positive ($p_{FP}$) and false negative 
($p_{FN}$) probabilities, using the method of moments approach described by 
\cite{Chiang2007}.  \Rfunction{estimateCCMErrorRates} calls 
\Rfunction{estErrProbMethodOfMoments} to compute a solution manifold and then 
selects a single pair of estimates for $p_{FP}$ and $p_{FN}$ using a gold standard 
approach as described in \cite{Scholtens2008}.  In the example below, we use the 
\Robject{ScISIC} dataset from the package \Rpackage{ScISI} as the gold standard.  
This is a set of manurally curated complexes culled from GO and MIPS and serves to 
define a set of edges that should be detected by AP-MS technology.

<<estimateStochasticErrorProbabilities>>=
data(ScISIC)

Gavin06m = as(Gavin06VBPsubgraph,"matrix")
errorProbabilities = estimateCCMErrorRates(Gavin06m,ScISIC)

pTPestimate = errorProbabilities$globalpTP
pFPestimate =errorProbabilities$globalpFP
solutionManifold = errorProbabilities$probPairs

pTPestimate
pFPestimate
@

Figure \ref{fig:stochErr} plots the calculated solution manifold and highlights the 
gold standard estimates.

<<plotErrorEstimates,height=6,width=6,fig=TRUE,include=FALSE,echo=TRUE,results=hide,prefix=FALSE>>=
plot(solutionManifold[,"pFPs"],solutionManifold[,"pFNs"],type="l",col="violet",xlab="pFP",ylab="1-pTP")
points(pFPestimate,1-pTPestimate,cex=3,pch=8,col="turquoise")
@

\begin{figure}[tp]
\centering
\includegraphics[width=4in]{plotErrorEstimates}
\caption{\label{fig:stochErr} A plot of the solution manifold for estimates of 
$p_{FP}$ and $p_{FN}$ with the star denoting the gold standard point estimate.}
\end{figure}

Estimates of $p_{FP}$ and $p_{FN}$ are helpful in using statistical models to 
estimate features and parameters for that dataset.  One popular statististic is node 
degree, def. the number of interacting partners for a protein.  Nodes in an 
experimental AP-MS graph have an unknown true degree.  An AP-MS experiment yields an 
observed number of reciprocated edges, as well as unreciprocated in- and 
out-edges.  A few examples are listed below.

<<inoutStats>>=
Gavin06VBPobserved = calcInOutDegStats(Gavin06VBPsubgraph)
Gavin06VBPobserved$unrecipInDegree[5:10]
Gavin06VBPobserved$unrecipOutDegree[5:10]
Gavin06VBPobserved$recipIn[5:10]
@

The function \Rfunction{degreeEstimates} estimates degree for VBP nodes according to 
the parametric model described by \cite{Scholtens2008} using the observed data and 
the stochastic error probability estimates.

<<estimateDegree>>=
Gavin06VBPdegree = degreeEstimates(Gavin06m,pTPestimate,pFPestimate)
Gavin06VBPdegree[5:10]

@

\section*{Conclusion}

The \Rpackage{ppiStats} package is useful for describing data in graphs and is aimed 
at bait-prey systems.  It is useful for summarizing graph data, identifying 
systematically biased baits and estimating stochastic error probabilities in 
preparation for estimation of relevant graph features.  

<<sessionInfo>>=
sessionInfo()
@ 


\begin{thebibliography}{9}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
  \providecommand{\doi}[1]{doi: #1}\else
  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi

\bibitem[Gavin et~al. (2002)]{Gavin2002}
A.~C. Gavin et~al.
\newblock Functional organization of the yeast proteome by systematic analysis of 
protein complexes.
\newblock \emph{Nature}, 415:\penalty0 141--147, 2002.

\bibitem[Gavin et~al. (2006)]{Gavin2006}
A.~C. Gavin et al.
\newblock Proteome survey reveals modularity of the yeast cell machinery.
\newblock \emph{Nature}, 440:\penalty0 631--636, 2006.

\bibitem[Chiang et~al. (2007)]{Chiang2007}
T. Chiang et~al.
\newblock Coverage and error models of protein-protein interaction data by directed 
graph analysis.
\newblock \emph{Genome Biology}, 8:R186, 2007.

\bibitem[Scholtens et~al. (2008)]{Scholtens2008}
D. Scholtens et~al.
\newblock Estimating node degree in bait-prey graphs.
\newblock \emph{Bioinformatics}, 24:218-224, 2008.

\end{thebibliography}

\end{document}