%\VignetteIndexEntry{msmsTests: post test filters to improve reproducibility} %\VignetteDepends{msmsTests} %\VignetteKeywords{multivariate, hplot} %\VignettePackage{msmsTests} \documentclass[12pt,a4paper,oneside]{article} \usepackage{fullpage} \usepackage{float} \usepackage[section]{placeins} \usepackage{enumerate} \begin{document} \SweaveOpts{keep.source=TRUE,concordance=TRUE} \title{msmsTests package\\ LC-MS/MS post test filters to improve reproducibility} \author{Josep Gregori, Alex Sanchez, and Josep Villanueva\\ Vall Hebron Institute of Oncology \&\\ Statistics Dept. Barcelona University\\ \texttt{josep.gregori@gmail.com}} \maketitle \tableofcontents \section{Introduction} The \emph{omics} technologies are in general challenged by a high dimensionality problem, whereby a high number of statistical tests are carried out at the same time on very few samples. The high number of tests that are carried out simultaneously requires a p-value adjustment to control the false discovery rate (FDR). The low number of replicates, combined with a high number of statistical tests easily leads to low reproducibility in the results. Shi et al. addressed this issue in a systematic manner using a large dataset elaborated within the studies of the MicroArray Quality Control Consortium (MAQC) \cite{shi2010}. The MAQC studied the possible sources of the limitations of microarray (MA) studies that precluded obtaining reproducible results across laboratories and platforms. Their conclusions underscored the influence of batch effects and the limitation introduced by using only the p-value criteria to get ordered lists of differentially expressed features as the main weaknesses in MA studies \cite{shi2008} \cite{luo2010}. This recommendation makes an intuitive appeal to the use of a post-test filter, excluding features that show low p-values but poor effect size or low signal strength, as a mean to guaranty a minimal level of reproducibility. Label-free differential proteomics is based on comparing the expression of proteins between different biological conditions \cite{mallik2010} \cite{neilson2011}. The expression of a protein by LC-MS/MS may be obtained by MS peak intensity measures or by the number of spectral counts assigned to that protein in a LC-MS/MS run. Here we concentrate on spectral counts. In proteomics reproducibility is also of great concern, and the use of such post-test filters may help notably \cite{gregori2013}. \section{An example LC-MS/MS dataset} The example dataset \cite{gregori2013} is the result of multiple spiking experiments, showing real LC-MS/MS data . Samples of 500 micrograms of a standard yeast lysate are spiked with 100, 200, 400 and 600fm of a complex mix of 48 equimolar human proteins (UPS1, Sigma-Aldrich). The number of technical replicates vary between 3 to 6. \newline The dataset consists in an instance of the \emph{MSnSet} class, defined in the MSnbase package \cite{Gatto2012}, a S4 class \cite{chambers} \cite{genolini}. This \emph{MSnSet} object contains a spectral counts (SpC) matrix in the \emph{assayData} slot, and a factor treatment in the \emph{phenoData} slot. (See also the expressionSet vignette by vignette("ExpressionSetIntroduction",package="Biobase") \cite{gentleman}) <>= options(continue=" ") @ <>= library(msmsTests) data(msms.spk) msms.spk dim(msms.spk) head(pData(msms.spk)) table(pData(msms.spk)$treat) @ Although the mix is equimolar the signal strength of each protein is markedly different, allowing to cover the full range of SpC values, what makes it specially worth in this sort of experiments: <>= msms.spc <- exprs(msms.spk) treat <- pData(msms.spk) idx <- grep("HUMAN",rownames(msms.spc)) mSpC <- t( apply(msms.spc[idx,],1,function(x) tapply(x,treat,mean)) ) apply(mSpC,2,summary) @ \section{Differential expression tests on spectral counts} Spectral counts (SpC) is an integer measure which requires of tests suited to compare counts, as the GML methods based in the Poisson distribution, the negative-binomial, or the quasilikelihood \cite{agresti2002} Generally speaking no test is superior to the other. They are just more or less indicated in some cases. The Poisson regression requires the estimation of just one parameter, and is indicated when the number of replicates is low, two or three. A drawback of the Poisson distribution is that its variance equals its mean, and it is not able to explain extra sources of variability a part of the sampling. Quasi-likelihood is a distribution independent GLM, but requires of the estimation of two parameters and hence needs a higher number of replicates, i.e. not less than four. The negative-binomial requires two parameters too, but we may use the implementation of this GLM in the edgeR package \cite{robinson2010} which uses an empirical Bayes method to share information across features and may be employed with a restricted number of replicates; in the worst case it limits with the Poisson solution. \section{Poisson GLM regression example} When using the Poisson distribution we implicitly accept a model not sensitive to biological variability \cite{agresti2002}. So it is just recommended in cases where we have very few replicates, if any, and we do not expect a significant biological variability between samples. <>= ### Subset to the 200 and 600fm spikings fl <- treat=="U200" | treat=="U600" e <- msms.spk[,fl] table(pData(e)) ### Remove all zero rows e <- pp.msms.data(e) dim(e) ### Null and alternative model null.f <- "y~1" alt.f <- "y~treat" ### Normalizing condition div <- apply(exprs(e),2,sum) ### Poisson GLM pois.res <- msms.glm.pois(e,alt.f,null.f,div=div) str(pois.res) ### DEPs on unadjusted p-values sum(pois.res$p.value<=0.01) ### DEPs on multitest adjusted p-values adjp <- p.adjust(pois.res$p.value,method="BH") sum(adjp<=0.01) ### The top features o <- order(pois.res$p.value) head(pois.res[o,],20) ### How the UPS1 proteins get ordered in the list grep("HUMAN",rownames(pois.res[o,])) ### Truth table nh <- length(grep("HUMAN",featureNames(e))) ny <- length(grep("HUMAN",featureNames(e),invert=TRUE)) tp <- length(grep("HUMAN",rownames(pois.res)[adjp<=0.01])) fp <- sum(adjp<=0.01)-tp (tt.pois1 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp)) @ \section{Quasi-likelihood GLM regression example} The quasi-likelihood is a distribution free model that allows for overdispersion, and could be indicated where an appreciable source of biological variability is expected. In this model, instead of specifying a probability distribution for the data, we just provide a relationship between mean and variance. This relationship takes the form of a function, with a multiplicative factor known as the overdispersion, which has to be estimated from the data \cite{agresti2002}. Its use in proteomics has been documented by Li et al. (2010)\cite{li2010}. <>= ### Quasi-likelihood GLM ql.res <- msms.glm.qlll(e,alt.f,null.f,div=div) str(ql.res) ### DEPs on unadjusted p-values sum(ql.res$p.value<=0.01) ### DEPs on multitest adjusted p-values adjp <- p.adjust(ql.res$p.value,method="BH") sum(adjp<=0.01) ### The top features o <- order(ql.res$p.value) head(ql.res[o,],20) ### How the UPS1 proteins get ordered in the list grep("HUMAN",rownames(ql.res[o,])) ### Truth table tp <- length(grep("HUMAN",rownames(ql.res)[adjp<=0.01])) fp <- sum(adjp<=0.01)-tp (tt.ql1 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp)) @ \section{edgeR: negative binomial GLM regression example} The negative-binomial provides another model that allows for overdispersion. The implementation adopted in this package is entirely based in the solution provided by the package edgeR \cite{robinson2010} which includes empirical Bayes methods to share information among features, and thus may be employed even when the number of replicates is as low as two. The negative-binomial is downward limited, when no overdispersion is observed, by the Poisson distribution. <>= ### Negative-binomial nb.res <- msms.edgeR(e,alt.f,null.f,div=div,fnm="treat") str(nb.res) ### DEPs on unadjusted p-values sum(nb.res$p.value<=0.01) ### DEPs on multitest adjusted p-values adjp <- p.adjust(nb.res$p.value,method="BH") sum(adjp<=0.01) ### The top features o <- order(nb.res$p.value) head(nb.res[o,],20) ### How the UPS1 proteins get ordered in the list grep("HUMAN",rownames(nb.res[o,])) ### Truth table tp <- length(grep("HUMAN",rownames(nb.res)[adjp<=0.01])) fp <- sum(adjp<=0.01)-tp (tt.nb1 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp)) @ \section{Reproducibility} In the \emph{omics} field, reproducibility is of biggest concern. Very low p-values for a protein in an experiment are not enough to declare that protein as of interest as a biomarker. A good biomarker should give as well a reproducible signal, and posses a biologically significant effect size. According to our experience, a protein giving less than three counts in the most abundant condition results of poor reproducibility, to be declared as statistically significant, experiment after experiment. On the other hand most of the false positives in an spiking experiment show log fold changes below 1. These two observations \cite{gregori2013} allow to improve the results obtained in the previous sections. The trick is to flag as relevant those proteins which have low p-values, high enough signal, and good effect size. In performing this relevance filter we may even accept higher adjusted p-values than usual. This flagging is provided by the function \texttt{test.results}. <>= ### Cut-off values for a relevant protein as biomarker alpha.cut <- 0.05 SpC.cut <- 2 lFC.cut <- 1 ### Relevant proteins according to the Poisson GLM pois.tbl <- test.results(pois.res,e,pData(e)$treat,"U600","U200",div, alpha=alpha.cut,minSpC=SpC.cut,minLFC=lFC.cut, method="BH")$tres (pois.nms <- rownames(pois.tbl)[pois.tbl$DEP]) ### Truth table ridx <- grep("HUMAN",pois.nms) tp <- length(ridx) fp <- length(pois.nms)-length(ridx) (tt.pois2 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp)) ### Relevant proteins according to the quasi-likelihood GLM ql.tbl <- test.results(ql.res,e,pData(e)$treat,"U600","U200",div, alpha=0.05,minSpC=2,minLFC=1,method="BH")$tres (ql.nms <- rownames(ql.tbl)[ql.tbl$DEP]) ### Truth table ridx <- grep("HUMAN",ql.nms) tp <- length(ridx) fp <- length(ql.nms)-length(ridx) (tt.ql2 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp)) ### Relevant proteins according to the negative-binomial GLM nb.tbl <- test.results(nb.res,e,pData(e)$treat,"U600","U200",div, alpha=0.05,minSpC=2,minLFC=1,method="BH")$tres (nb.nms <- rownames(nb.tbl)[nb.tbl$DEP]) ### Truth table ridx <- grep("HUMAN",nb.nms) tp <- length(ridx) fp <- length(nb.nms)-length(ridx) (tt.nb2 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp)) @ As you may see, without the post-test filter a relatively low adjusted p-value cut-off is required to keep an acceptable number of false positives. The post-test filter allows to relax the p-value cut-off improving at the same time both the number of true positives and false positives. <>= fl <- pois.tbl$adjp <= 0.05 sig <- sum(fl) tp <- length(grep("HUMAN",rownames(pois.tbl)[fl])) fp <- sig-tp tt.pois0 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp) fl <- ql.tbl$adjp <= 0.05 sig <- sum(fl) tp <- length(grep("HUMAN",rownames(ql.tbl)[fl])) fp <- sig-tp tt.ql0 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp) fl <- nb.tbl$adjp <= 0.05 sig <- sum(fl) tp <- length(grep("HUMAN",rownames(nb.tbl)[fl])) fp <- sig-tp tt.nb0 <- data.frame(TP=tp,FP=fp,TN=ny-fp,FN=nh-tp) library(xtable) df <- data.frame(Test=character(9),Significance=numeric(9), Filtered=character(9), TP=integer(9),FP=integer(9),TN=integer(9),FN=integer(9), stringsAsFactors=FALSE) df$Test[1] <- df$Test[4] <- df$Test[7] <- "Poisson" df$Test[2] <- df$Test[5] <- df$Test[8] <- "Quasi-likelihood" df$Test[3] <- df$Test[6] <- df$Test[9] <- "Negative-binomial" df$Significance[1:3] <- 0.05 df$Significance[4:6] <- 0.01 df$Significance[7:9] <- 0.05 df$Filtered[1:6] <- "No" df$Filtered[7:9] <- "Yes" df[1,4:7] <- tt.pois0 df[2,4:7] <- tt.ql0 df[3,4:7] <- tt.nb0 df[4,4:7] <- tt.pois1 df[5,4:7] <- tt.ql1 df[6,4:7] <- tt.nb1 df[7,4:7] <- tt.pois2 df[8,4:7] <- tt.ql2 df[9,4:7] <- tt.nb2 print(xtable(df,align=c("r","r","c","c","c","c","c","c"), caption=c("Truth tables"), display=c("d","s","f","s","d","d","d","d")), type="latex",hline.after=c(-1,0,3,6,9),include.rownames=FALSE) @ \section{Other functions in the package} A useful tool to visualize the global results of differential expression tests is a table of accumulated frequencies of features by p-values in bins of log fold changes. It may help in finding the most appropriate post-test filter cut-off values in a given experiment. <>= ### All features pval.by.fc(ql.tbl$adjp,ql.tbl$LogFC) ### Filtering by minimal signal fl <- ql.tbl$U600 > 2 pval.by.fc(ql.tbl$adjp[fl],ql.tbl$LogFC[fl]) @ Another usual tool is a volcanoplot with the ability to visualize the effect of different post-test filter cut-off values. \begin{figure}[H] \centering <>= par(mar=c(5,4,0.5,2)+0.1) res.volcanoplot(ql.tbl,max.pval=0.05,min.LFC=1,maxx=3,maxy=NULL, ylbls=3) @ \caption{Volcanoplot}\label{volcano} \end{figure} \newpage \begin{thebibliography}{11} \bibitem{shi2010} Shi, L. et al. \emph{MAQC Consortium: The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models}. Nat Biotech 2010, 28, 827-838. \bibitem{shi2008} Shi, L. et al. \emph{The balance of reproducibility, sensitivity, and specificity of lists of differentially expressed genes in microarray studies}. BMC Bioinformatics 2008, 9 Suppl 9, S10. \bibitem{luo2010} Luo, J. et al. \emph{A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data}. The Pharmacogenomics Journal 2010, 10, 278-291. \bibitem{mallik2010} Mallick P., Kuster B. \emph{Proteomics: a pragmatic perspective.} Nat Biotechnol 2010;28:695-709. \bibitem{neilson2011} Neilson K.A., Ali N.A., Muralidharan S., Mirzaei M., Mariani M., Assadourian G., et al. \emph{Less label, more free: approaches in label-free quantitative mass spectrometry.} Proteomics 2011;11:535-53. \bibitem{gregori2013} Gregori J., Villareal L., Sanchez A., Baselga J., Villanueva J., \emph{An Effect Size Filter Improves the Reproducibility in Spectral Counting-based Comparative Proteomics}. Journal of Proteomics 2013, http://dx.doi.org/10.1016/j.jprot.2013.05.030 \bibitem{Gatto2012} Laurent Gatto and Kathryn S. Lilley, MSnbase - an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation, Bioinformatics 28(2), 288-289 (2012). \bibitem{chambers} Chambers J.M. \emph{Software for data analysis: programming with R}, 2008 Springer \bibitem{genolini} Genolini C. \emph{A (Not So) Short Introduction to S4} (2008) \bibitem{gentleman} Falcon S., Morgan M., Gentleman R. \emph{An Introduction to Bioconductor's ExpressionSet Class} (2007) \bibitem{agresti2002} Agresti A., \emph{Categorical Data Analysis}, Wiley-Interscience, Hoboken NJ, 2002 \bibitem{robinson2010} Robinson MD, McCarthy DJ and Smyth GK (2010). \emph{edgeR: a Bioconductor package for differential expression analysis of digital gene expression data}. Bioinformatics 26, 139-140 \bibitem{li2010} Li, M.; Gray, W.; Zhang, H.; Chung, C. H.; Billheimer, D.; Yarbrough, W. G.; Liebler, D. C.; Shyr, Y.; Slebos, R. J. C. \emph{Comparative shotgun proteomics using spectral count data and quasi-likelihood modeling}. J Proteome Res 2010, 9, 4295-4305 \bibitem{josep2012} Gregori J., Villareal L., Mendez O., Sanchez A., Baselga J., Villanueva J., \emph{Batch effects correction improves the sensitivity of significance tests in spectral counting-based comparative discovery proteomics}, Journal of Proteomics, 2012, 75, 3938-3951 \bibitem{BH} Benjamini, Y., and Hochberg, Y. (1995). \emph{Controlling the false discovery rate: a practical and powerful approach to multiple testing}. Journal of the Royal Statistical Society Series B, 57, 289-300. \end{thebibliography} \end{document}