% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. % % \VignetteIndexEntry{copa Overview} % \VignetteDepends{copa, Biobase, colonCA} % \VignetteKeywords{Expression Analysis, Postprocessing} % \VignettePackage{copa} \documentclass[11pt]{article} \usepackage[authoryear,round]{natbib} \usepackage{hyperref} \usepackage{times} \usepackage{comment} \parindent 0.5in \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rfunarg}[1]{\textit{#1}} \bibliographystyle{plainnat} \begin{document} \title{\bf Using copa} \author{James W. MacDonald} \maketitle \section{Overview} In certain cancers (lymphoma, sarcoma, leukemia), it is common to have recurrent chromosomal rearrangements that may be a causal factor in the progression of the disease \citet{Rowley:2001}. However, until recently these rearrangements have not been commonly found in other carcinomas \citet{Tomlins:2005}. Tomlins et al. describe a method they call Cancer Outlier Profile Analysis (COPA) that can be used to detect recurrent chromosomal rearrangements using microarray data. Their method however is limited to use with the Oncomine website \url{www.oncomine.org}, and does not at this time either pre-filter the data for likely candidates, nor compute any sort of inferential statistic. \section{Introduction} The idea behind COPA is very simple; it is well known that genetic translocations occur in cancer cells, and that these translocations can result in the up-regulation of oncogenes that may affect the progression of the cancer. This happens when the 5' activation domain of a constitutively expressed (or up-regulated) gene is fused to the 3' portion of a given oncogene, thus increasing the expression of the oncogene. This translocation can happen between the activating gene and multiple oncogenes, as was shown by Tomlins \textit{et al.}, as well as others \citet{Fonseca:2004}. Since a given translocation is only likely to occur once per sample, if there were multiple partners for a given activating gene, we would expect to see certain cancer samples with a high expression of say, gene A, whereas other cancer samples might have high expression of gene B, but these samples would be mutually exclusive. In addition, we would expect that the normal samples would not have high expression for either gene A nor B. We can use this idea to both pre-filter genes as well as finding interesting genes that may be involved in translocations. Common methods for detecting differences between tumor and normal samples will not work for finding these genes (e.g., $t$-tests); we need to find those genes where only a subset of the samples have high expression. To do this, we center and scale the data (on a row-wise basis) using the median and median average difference (MAD). We can then select a common value (default is 5) as a cutoff for 'outlier' status and apply this to all genes. We then simply look for pairs of genes that have a large number of mutually exclusive outlier (cancer) samples, but few or no normal outliers. The candidate gene pairs will be ranked based on the sum of outlier samples for each pair, as this seems to be the most reasonable criterion for ranking. Since there may be several gene pairs with the same number of outliers we need to add an additional criterion to rank the ties. We use a modification of the ranking scheme used by Tomlins \textit{et al.}. They simply ranked the genes using the 75$^{th}$, 95$^{th}$ and 99$^{th}$ percentiles of the centered and scaled expression values. Since we are looking at pairs rather than individual genes, we take the difference between the 75$^{th}$ percentile of the tumor and normal samples, and then compute the sum of these differences for each gene pair. This value quantifies how different the outlier pairs are from their corresponding normals. We chose the 75$^{th}$ percentile for ranking rather than say, the 95$^{th}$ because the values at the higher percentile are what caused the gene pairs to be selected in the first place, so we want to use a less extreme percentile to distinguish between the tied pairs. \section{Using copa} To search for gene pairs that may be involved in translocations is very simple. Just load the \Rpackage{copa} package, and run the \Rfunction{copa} function using your microarray expression values. For this vignette, we will be using the \Rpackage{colonCA} package, which contains an \Robject{ExpressionSet} with normal and tumor colon expression data. <<>>= library(copa) library(colonCA) data(colonCA) head(pData(colonCA), 10) @ We will use the third column of the \Robject{phenoData} object as our classlabel, which tells \Rfunction{copa} which samples are tumor and which are normal. There is no need to pre-filter the expression data; \Rfunction{copa} has an internal pre-filtering step that selects the top \Rfunarg{pct}(percentile) of the data, based on the number of outliers. The default is to use the $95^{th}$ percentile as a cutoff; if this results in more than 1000 genes, \Rfunction{copa} will give a warning and allow you to abort the run (and presumably re-run with a higher value for \Rfunarg{pct}). One thing to keep in mind is that \Rfunction{copa} is going to be computing all pairwise sums of outliers, which can get to be a large number of computations really quickly (hence the warning at n = 1000). Although this portion of the function is written in C, and is actually quite fast, a large number of comparisons will tend to slow things down. <<>>= rslt <- copa(colonCA, as.numeric(pData(colonCA)[,3])) @ We can now look at a plot showing the number of outliers for each gene. <>= plotCopa(rslt, idx = 1, col = c("lightgreen","salmon")) @ \begin{figure} \centering <>= plotCopa(rslt, idx = 1, col = c("lightgreen","salmon")) @ \caption{Plot of 'Top' Gene Pair} \label{fig:copa} \end{figure} Figure~\ref{fig:copa} shows the outlier status of the 'top' gene pair (based on having the most outlier samples). If using an Affymetrix GeneChip for which there is an annotation package, one can label the plots with the corresponding gene symbol by specifying the \Rfunarg{lib} argument. Unfortunately, the colonCA data is based on a Hum6000 Affy chip, for which there is no annotation package. This plot doesn't look that impressive, as there are only a few samples that fulfill the criterion for outlier status. We can look at how many gene pairs there are with a given number of outliers using the \Rfunction{tableCopa} function. <<>>= tableCopa(rslt) @ We might then want to know which genes have \Sexpr{as.numeric(names(tableCopa(rslt)[1]))}. We can list them out using the \Rfunction{summaryCopa} function. <<>>= summaryCopa(rslt, 9) @ We might now want to know how significant this result is. We can test this hypothesis by permuting the class labels of the samples many times and then checking to see how often we see pairs of genes with a certain number of outliers. By permuting the class labels we are mixing up the tumor and normals, so any pair with a large number of outliers by definition has arisen by chance. If we get many gene pairs with say, \Sexpr{as.numeric(names(tableCopa(rslt)[1]))} outliers then it is fairly likely that our observed results could have arisen by chance as well. However, if the opposite is true, then we have some reassurance that the observed results are not a chance event, and these gene pairs may well be undergoing some sort of recombination. <<>>= prm <- copaPerm(colonCA, rslt, 9, 24) sum(prm >= 9) @ In this instance, there are \Sexpr{sum(prm >= 9)} times that the permuted data resulted in a number of outliers as large or larger than what we observed. This indicates that there may well be some recombination going on here, and it might be worthwhile to explore further. A few notes about this function. First, it repeatedly re-runs the \Rfunction{copa} function after permuting the classlabels, so any caveats that I gave above about the number of genes to use above applies a hundred fold here. Note that the percentile cutoff used to create the \Robject{copa} object will be re-used for the permutations, so if the cutoff is too lenient, you may repeatedly be queried because of too many genes. Second, the default for this function is 100 permutations. This is enough to get a basic idea, but is far too few to calculate a $p$-value or false discovery rate (FDR). For that, one should use at least 500 - 1000 permutations. Even at 1000 permutations, the smallest $p$-value will be 0.001 (actually the smallest will be 0, but the second smallest will be 0.001). Running \Rfunction{copaPerm} here on approximately \Sexpr{dim(rslt$mat)[1]} genes takes about 90 seconds. Increasing either the number of genes or the permutations may necessitate an overnight run. \bibliography{copa} \end{document}