%  To compile this:
%  1. use R-2.4.1 
%  2. Sweave("READMEXhyb.Rnw")
%  3. pdflatex READMEXhyb; pdflatex READMEXhyb
%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[11pt,a4paper]{article}
%\VignetteIndexEntry{Xhyb}
\usepackage{amsmath,fullpage}
\usepackage{theorem}
\usepackage{color}
\usepackage{url}
\parindent 0in  % Left justify

\newcommand{\Rpackage}[1]{\textsf{#1}}
\newcommand{\Rfunction}[1]{\textit{#1}}
\newcommand{\Robject}[1]{\texttt{#1}}
\newcommand{\Rclass}[1]{{\textit{#1}}}
\newcommand{\Rfunarg}[1]{{\textit{#1}}}

\newcommand{\inclfig}[3]{%
 \begin{figure}[tb] \begin{center}
   \includegraphics[width=#2]{#1}
   \caption{\label{#1}#3}
 \end{center} \end{figure}
}

\theoremstyle{break} \newtheorem{Ex}{Exercise}
\newenvironment{solution}{\color{blue}}{\color{black}}

\author{Tineke Casneuf}
\title{XhybCasneuf package}

\begin{document}

\maketitle

This package contains the data that was generated by and investigated in our study, 
'The Effect of Cross-Hybridisation on Expression Correlations Inferred from Microarrays' 
(manuscript submitted). \\
In the first part, we investigated the relationship between
reporter-to-transcript sequence similarity and correlation of expression signals. 
We assessed the extent to which off-target reporters in probe sets, i.e. 
reporters that (partially) align to another transcript than the one intended, 
influence the expression correlation of the intended and off-target probe set.
For a given probe set X, intented to target gene X and a potential off-target gene Y, 
the variables were calculated as followed:\\
First X's reporters' sequences were aligned to the transcript of Y. 
To quantify the potential off-target affinity of probe set X to Y, the
75th percentile $Q_{XY}^{75}$ was then calculated of these alignment scores 
$\{a_1,\ldots,a_n\}$.
We obtained the expression correlation measure by calculating the Pearson 
correlation coefficient between the intensity values of X and Y in 14 
different plant tissues. These data are a subset of the AtgenExpress 
project data \cite{ATGE, tissue}.
\\
We illustrated our findings with three detailed examples of cross-hybridisation.
We also ran a simulation to demontstrate the effect of individual reporters on summarisation.
\\
Our study shows that numerous probe sets on a widely used commercial array platform 
contain off-target reporters, and many probe sets show a signal pattern that is highly 
similar to that of unintended transcripts. In addition, a positive correlation is 
revealed between off-target alignment score of different reporters and the magnitude of 
their off-target correlation. We evaluated the conventional probe set design, as defined 
by the array manufacturer (Affymetrix CDF), and compared it to a custom-made CDF. 
We demonstrate that careful reporter mapping alleviates cross-hybridisation effects 
to a substantial extent. This analysis was conducted on the ATH1 Affymetrix GeneChip for
\textit{Arabidopsis thaliana}.
\\ 
This package contains, for both the Affymetrix and the custom-made CDF, 
the data of probe set pairs with a off-target sensitivity score of $\ge55$.
\\
Load the library:
<<>>=
 library("XhybCasneuf")
@

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Probe set off-target sensitivity and expression correlation}


First, we will have a look at the relation between expression correlation and 
the off-target sensitivity of these all pairs of the Affymetrix and our custom-made
CDF. Because a positive trend between (reporter) alignment strength and expression correlation 
is not unexpected for functionally related genes like paralogous genes or genes that 
share protein domains, we omitted gene pairs that aligned to each other 
with BLAST~\cite{BLAST} in at least one direction with an E-value smaller than 
$10^{-10}$.  


<<>>=
## all probe set pairs
data(AffysTissue)
data(CustomsTissue)

## probe set pairs of genes that do not align to each other with BLAST with a E-value smaller than $10^{-10}$
data(AffysTissue.noBl)
data(CustomsTissue.noBl)
@

We now write function that will construct boxplots of these 4 data sets:
<<>>=
 myXs <- c(seq(55,70, length.out =3),seq(75,125,length=5))

 tiltedmyboxF <- function(X,Y, main){
   par(mar = c(7, 4, 4, 2) + 0.1)
   boxplot(Y~X, col = "skyblue2", ylim=c(-1,1), ylab = expression(rho[xy]), varwidth=T,xaxt = "n", cex.lab = 1.4, main = main, xlab ="")
   axis(1, labels = FALSE)
   text(seq_len(nlevels(X)), par("usr")[3] - 0.15, srt = 45, adj = 1, labels = levels(X), xpd = TRUE)
   text(4, par("usr")[3] - 0.6, adj = 1, labels = expression(Q[xy]^75), xpd = TRUE)
   abline(0,0, lty=2)
  }
@

And we plot them:

<<fig=TRUE, echo=TRUE, width = 9, height = 9>>=
# X11(height = 10, width = 10)
 layout(matrix(1:4, nrow = 2, byrow = F))
  # ALL PAIRS of the custom and affymetrix CDF
  tiltedmyboxF(cut(AffysTissue$alSum, myXs, include.lowest = TRUE, right = TRUE), AffysTissue$peCC, main = "A")
  tiltedmyboxF(cut(CustomsTissue$alSum, myXs, include.lowest = TRUE, right = TRUE), CustomsTissue$peCC, main = "B")
  # EXCLUDE gene pairs with BLAST HIT with E-value < 10^-10
  tiltedmyboxF(cut(AffysTissue.noBl$alSum, myXs, include.lowest = TRUE, right = TRUE), AffysTissue.noBl$peCC, main = "C")
  tiltedmyboxF(cut(CustomsTissue.noBl$alSum, myXs, include.lowest = TRUE, right = TRUE), CustomsTissue.noBl$peCC, main = "D")
@

These boxplots reveal a positive
relation between the two variables: a gene whose expression is measured by reporters 
that align well to a different transcript tends to have an expression signal that is 
correlated with that of the other transcript (plots A and B). This positive trend is present even 
between gene pairs that do not share longer stretches of sequence similarity 
and where the reporter to off-target alignment is only based on short near-matches
(plots C versus A and D versus B).


%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Reporter off-target sensitivity and expression correlation} 

We also studied the behavior of off-target 
sensitivity and signal correlation of different reporters within a probe set. 
For a probe set X and an off-target gene Y, we calculated the 
metacorrelation $\mbox{cor}(\rho_{X_iY}, a_i)$ between the 
alignment scores $a_i$ of X's reporters to Y's transcript sequence 
and the Pearson correlation coefficients of the reporters' signal patterns to     %'
the expression pattern of Y. We reasoned that if cross-hybridisation
occurs, a positive trend between reporter to off-target correlation and 
the alignment score $a_i$ can be detected. Conversely, lack of such a trend
may indicate that cross-hybridisation is negligible.

<<fig=TRUE, echo=TRUE, width = 13, height =5>>=
data(AffysTissueMC)
data(CustomsTissueMC)

myboxplot <- function(X, Y, main){
 boxplot(Y~X, col = "skyblue2", ylim=c(-1,1), ylab = expression(cor(rho[x[i]*Y]*a[i])), varwidth=T,xaxt = "n", cex.lab = 1.4, main = main, xlab =expression(Q[xy]^75))
 axis(1, labels = FALSE)
 text(seq_len(nlevels(X))+0.25, par("usr")[3] - 0.12, srt = 20, adj = 1, labels = levels(X), xpd = TRUE)
 abline(0,0, lty=2)
}

 par(mfrow=c(1,2))
 # ALL PAIRS of the Affymetrix CDF
 X <- cut(AffysTissueMC$alSum, myXs, include.lowest = TRUE, right = TRUE)
 myboxplot(X, AffysTissueMC$Mcor, main = "A")
 # ALL PAIRS of the custom-made CDF
 X <- cut(CustomsTissueMC$alSum, myXs, include.lowest = TRUE, right = TRUE)
 myboxplot(X, CustomsTissueMC$Mcor, main = "B")

@

The distribution of the metacorrelations of most probe set pairs corresponds to 
a random distribution centered around zero.
However for those strata with high off-target sensitivity score the distribution is shifted
upwards. This means that within these probe sets some reporters do not correlate with the 
off-target, while others do, depending on their alignments score.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Examples}

We illustrate our findings with three examples. The data for these
are contained in an 'XhybExamples' -S4- class object.
'plotExample' is a method that takes 'XhybExamples' objects and plots them:
\\
Example 1:
<<fig=TRUE, echo=TRUE>>=
data(ex1)
plotExample(ex1)
@
\\
Example 2:
<<fig=TRUE, echo=TRUE>>=
data(ex2)
plotExample(ex2)
@
\\
Example 3:
<<fig=TRUE, echo=TRUE>>=
data(ex3)
plotExample(ex3)
@

Plot A contains the summarized expression values of a probe set X 
(in blue) and an off-target gene Y (in orange) in the tissue dataset. 
Plot B shows the background corrected, normalized 
signal profiles of X's reporters. In plot C,
for each reporter $\rho_{X_{i}Y}$, the Pearson correlation coefficient 
calculated between its signal profile and that of Y (orange in A-D-G) 
is plotted in function of its alignment score $a_{X_{i}Y}$. 
The colors used to plot the profiles in B and the data in C
correspond to the alignment score of the particular reporter to Y's transcript and 
is explained in the legend. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Simulation}

We also used a simulation to show the effect of individual reporters on summarisation. 

A given gene A has a sinusoidal expression pattern
over the course of 14 time points in an experiment. Plot A 
shows the signal profiles of the 11 reporters of this
gene's probe set, with data simulated using an established error model  %'
for microarray data~\cite{rockedurbin}. The 11 reporters of a probe set
B in plot B show random signals without any underlying
trend. Nine of the reporters of a probe set C have identical signals as nine
reporters of probe set B, while the remaining two reporters
cross-hybridize with the transcript of gene A (plot C). 
The correlation coefficients calculated on the summarised expression values
obtained by different methods (median polish~\cite{rma2} (shown in plot D),
dChip~\cite{dchip1,dchip2} introduced by Li-Wong and one-step Tukey's    
Biweight~\cite{algoraff} used in Affymetrix' MAS~5 software 
are printed to the screen.

<<fig=TRUE, echo=TRUE, width = 8, height = 8>>=
runSimulation()
@



\begin{thebibliography}{10}

\bibitem{ATGE}
Schmid, M. and Davison, T.S. and Henz, S.R. and Pape, UJ and Demar, M. and
Vingron, M. and Scholkopf, B. and Weigel, D. and Lohmann, JU:
\textbf{A gene expression map of Arabidopsis thaliana development}
\emph{Nature Genetics} 2005, \textbf{37}(5): 501-506

\bibitem{tissue}
  AtGenExpress dataset
\url{http://www.weigelworld.org/resources/microarray/AtGenExpress/AtGE_dev_samples.pdf}

\bibitem{BLAST}
Altschul, S.F. and Gish, W. and Miller, W. and Myers, E.W. and Lipman, D.J.:
\textbf{Basic local alignment search tool} \textit{Journal of Molecular Biology},
1990, \textbf{215}: 403-410


\bibitem{rockedurbin}
Rocke, D.M. and Durbin B. :
\textbf{A Model for Measurement Error for Gene Expression Arrays}
\textit{Journal of Computational Biology} 2001, 
\textbf{8}(6): 557-569


\bibitem{rma2}
Irizarry, R.A. and Bolstad, B.M. and Collin, F. and Cope, L.M. and Hobbs, B. and Speed, T.P.
\textbf{{S}ummaries of {A}ffymetrix {G}ene{C}hip probe level data}
\textit{Nucleic Acids Res} 2003, \textbf{31}(4):e15

\bibitem{dchip1}
Li, C. and Wong, W.H.
\textbf{Model-based analysis of oligonucleotide arrays:
model validation, design issues and standard error application}
\textit{Genome Biology} 2001, \textbf{2}(8): 1-11

\bibitem{dchip2}
Li, C. and Wong, W.H.
\textbf{Model-based analysis of oligonucleotide arrays:
Expression index computation and outlier detection}
\textit{Proceedings of the National Academy of Sciences of the United Sates of America}
 2001, \textbf{98}(1): 31-36

\bibitem{algoraff}
Statistical Algorithms Description Document (2002)
 \url{http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf}

\end{thebibliography}

\end{document}