%\VignetteIndexEntry{Running gene set enrichment analysis with the "npGSEA" package} %\VignettePackage{npGSEA} \documentclass[12pt]{article} <>= BiocStyle::latex() @ \usepackage{cite} \usepackage{Sweave} \SweaveOpts{keep.source=TRUE,eps=FALSE,pdf=FALSE,png=TRUE,include=FALSE,width=4,height=4.5,resolution=150} \setkeys{Gin}{width=0.5\textwidth} % use a vertical rule for R output instead of prompts for R input \usepackage{fancyvrb} \definecolor{darkgray}{gray}{0.2} \DefineVerbatimEnvironment{Sinput}{Verbatim}{xleftmargin=1em,formatcom={\color{darkgray}}} \DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=1em,frame=leftline,framerule=.6pt,rulecolor=\color{darkgray},framesep=1em,formatcom={\color{darkgray}}} \fvset{listparameters={\setlength{\topsep}{0pt}}} \renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}} \author{Jessica L. Larson$^{1*}$ and Art Owen$^{2}$ \\[1em] \small{$^{1}$ Department of Bioinformatics and Computational Biology, Genentech, Inc.} \\ \small{$^{2}$ Department of Statistics, Stanford University} \\ \small{\texttt{$^*$larson.jessica (at) gene.com}}} \title{Moment based gene set enrichment testing -- the npGSEA package} \date{\today} \begin{document} \maketitle \tableofcontents \newpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction } %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Gene set methods are critical to the analysis of gene expression data. The \verb@npGSEA@ package provides methods to run permutation-based gene set enrichment analyses without the typically computationally expensive permutation cost. These methods allow users to adjust for covariates and approximate corresponding permutation distributions. We are currently evaluating the applicability and accuracy of our method for RNA-seq expression data. Our methods find the exact relevant moments of a weighted sum of (squared) test statistics under permutation, taking into account correlations among the test statistics. We find moment-based gene set enrichment $p$-values that closely approximate the permutation method $p$-values. This vignette describes a typical analysis workflow and includes some information about the statistical theory behind \verb@npGSEA@. For more technical details, please see Larson and Owen, 2015 \nocite{larson:2015}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Example workflow for GSEA} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Preparing our gene sets and our dataset for analysis} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% For our example, we will use the \verb@ALL@ dataset. We begin by loading relevant libraries, subsetting the data, and running \verb@featureFilter@ on this data set. For details on these methods, please see the \verb@limma@ manual. <>= library(ALL) library(hgu95av2.db) library(genefilter) library(limma) library(GSEABase) library(npGSEA) data(ALL) ALL <- ALL[, ALL$mol.biol %in% c('NEG','BCR/ABL') & !is.na(ALL$sex)] ALL$mol.biol <- factor(ALL$mol.biol, levels = c('NEG', 'BCR/ABL')) ALL <- featureFilter(ALL) @ We adjust the feature names of the \verb@ALL@ dataset so that they match the names of our gene sets below. We convert them to entrez ids. <>= featureNames(ALL) <- select(hgu95av2.db, featureNames(ALL), "ENTREZID", "PROBEID")$ENTREZID @ We now make four arbitrary gene sets by randomly selecting from the genes in our universe. <>= xData <- exprs(ALL) geneEids <- rownames(xData) set.seed(12345) set1 <- GeneSet(geneIds=sample(geneEids,15, replace=FALSE), setName="set1", shortDescription="This is set1") set2 <- GeneSet(geneIds=sample(geneEids,50, replace=FALSE), setName="set2", shortDescription="This is set2") set3 <- GeneSet(geneIds=sample(geneEids,100, replace=FALSE), setName="set3", shortDescription="This is set3") set4 <- GeneSet(geneIds=sample(geneEids,500, replace=FALSE), setName="set4", shortDescription="This is set4") @ As a positive control, we also make three gene sets that include our top differentially expressed genes. <>= model <- model.matrix(~mol.biol, ALL) fit <- eBayes(lmFit(ALL, model)) tt <- topTable(fit, coef=2, n=200) ttUp <- tt[which(tt$logFC >0), ] ttDown <- tt[which(tt$logFC <0), ] set5 <- GeneSet(geneIds=rownames(ttUp)[1:20], setName="set5", shortDescription="This is a true set of the top 20 DE genes with a positive fold change") set6 <- GeneSet(geneIds=rownames(ttDown)[1:20], setName="set6", shortDescription="This is a true set of the top 20 DE genes with a negative fold change") set7 <- GeneSet(geneIds=c(rownames(ttUp)[1:10], rownames(ttDown)[1:10]), setName="set7", shortDescription="This is a true set of the top 10 DE genes with a positive and a negative fold change") @ We then collapse all of our gene sets into a \verb@GeneSetCollection@. For more information on \verb@GeneSets@ and \verb@GeneSetCollections@, see the \verb@GSEABase@ manual. <>= gsc <- GeneSetCollection( c(set1, set2, set3, set4, set5, set6, set7) ) gsc @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Running npGSEA} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Now that we have both our gene sets and experiment, we are ready to run \verb@npGSEA@ and determine the level of enrichment in our experiment. We can use \verb@npGSEA@ with our \verb@eset@ or expression data (\verb@xData@) directly. We call \verb@npGSEASummary@ to get a summary of the results. \verb@T_Gw@ is explained in more detail in Section 3.2. <>= yFactor <- ALL$mol.biol res1 <- npGSEA(x = ALL, y = yFactor, set = set1) ##with the eset res1 res2_exprs <- npGSEA(xData, ALL$mol.biol, gsc[[2]]) ##with the expression data res2_exprs @ \verb@npGSEA@ has several built in accessor functions to gather more information about the analysis of your set of interest in your experiment. <>= res3 <- npGSEA(ALL, yFactor, set3) res3 geneSetName(res3) stat(res3) sigmaSq(res3) zStat(res3) pTwoSided(res3) pLeft(res3) pValues(res3) dim(xSet(res3)) @ There is also a \verb@npGSEA@ specific plot function (\verb@npGSEAPlot@) to visualize the results of your analysis. Highlighted in red on the plot is the corresponding \verb@zStat@ of our analysis. <>= npGSEAPlot(res3) @ \begin{figure} \centering \includegraphics{npGSEA-plot3} \caption{ \textbf{Set3 normal approximation results} This plot displays the standard normal curve and our observed zStat for set3 in this analysis. } \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Running npGSEA with the beta and chi-sq approximations} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% There are three types of approximation methods in \verb@npGSEA@: "norm", "beta", and "chiSq". Each method is discussed in brief in Section 3. The "norm" approximation method is the default. Note that each of these methods has the same $\hat\beta_{g}$ (see methods section). <>= res5_norm <- npGSEA(ALL, yFactor, set5, approx= "norm") res5_norm betaHats(res5_norm) npGSEAPlot(res5_norm) @ The beta approximation yields results quite similar to the normal approximation. <>= res5_beta <- npGSEA(ALL, yFactor, set5, approx= "beta") res5_beta betaHats(res5_beta) npGSEAPlot(res5_beta) @ The chi-sq approximation method is only available for the two-sided test. Here we call \verb@npGSEA@ and then show how the \verb@chiSqStat@ is related to \verb@C_Gw@. \verb@C_Gw@ is explained in more detail in Section 3.2. <>= res5_chiSq <- npGSEA(ALL, yFactor, set5, approx= "chiSq") res5_chiSq betaHats(res5_chiSq) chiSqStat(res5_chiSq) stat(res5_chiSq) stat(res5_chiSq)/sigmaSq(res5_chiSq) npGSEAPlot(res5_chiSq) @ Note that, as we expected, \verb@set5@ is a significantly enriched in all three methods. In each of the three corresponding plots, the observed statistic is a very rare event. \begin{figure} \centering \includegraphics[width=.3\textwidth]{npGSEA-runItNorm} \includegraphics[width=.3\textwidth]{npGSEA-runItBeta} \includegraphics[width=.3\textwidth]{npGSEA-runItChiSq} \caption{ \textbf{Set5 normal, beta, and chi-sq approximation results} These plots displays the reference normal, beta, and chi-sq curves, and our observed zStat, betaStat, and chiSqStat for set5 in this analysis. } \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Adding weights to the model} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Sometimes we do not want to weigh each gene in our set equally. We want to assign a larger weight to genes that are of a particular interest, and a lower weight to genes that we know may behave poorly. In this example, we weight the genes in \verb@set7@ by their variance. <>= res7_nowts <- npGSEA(x = ALL, y= yFactor, set = set7) res7_nowts wts <- apply(exprs(ALL)[match(geneIds(set7), featureNames(ALL)), ], 1, var) wts <- 1/wts res7_wts <- npGSEA(x = ALL, y = yFactor, set = set7, w = wts, approx= "norm") res7_wts @ By adding these weights, we get a slightly more significant result. We can add weights for the beta and chi-sq approximations, too. By default, \verb@npGSEA@ assigns a weight of 1 for all genes. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Adding covariates to model} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Often we want to correct for confounders in our model. To do this with \verb@npGSEA@, we provide a vector or matrix in the \verb@covars@ slot of our function. \verb@npGSEA@ then projects both the data (\verb@x@) and the outcome of interest (\verb@y@) against our covariate matrix/vector. The resulting residuals are used for further analysis. In this example, we correct for the age and sex of the subjects in our experiment. For more details on model selection and its relation to inference, please see the \verb@limma@ manual. <>= res3_age <- npGSEA(x = ALL, y = yFactor, set = set3, covars = ALL$age) res3_age res3_agesex <- npGSEA(x = ALL, y = yFactor, set = set3, covars = cbind(ALL$age, ALL$sex)) res3_agesex @ By adjusting for these variables, we get a slight different result than above. Note that we can adjust for covariates in the beta and chi-sq approximation methods, too. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Running npGSEA with multiple gene sets} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To explore multiple gene sets, we let \verb@set@ be a \verb@GeneSetCollection@. This returns a list of \verb@npGSEAResultNorm@ objects, called a \verb@npGSEAResultNormCollection@. We can access statistics for each \verb@GeneSet@ in our analysis through accessors of \verb@npGSEAResultNormCollection@. <>= resgsc_norm <- npGSEA(x = ALL, y = yFactor, set = gsc) unlist( pLeft(resgsc_norm) ) unlist( stat (resgsc_norm) ) unlist( zStat (resgsc_norm) ) @ Note how quick our method is. We get results as accurate as permutation methods in a fraction of the time, even for multiple gene sets. Using the \verb@ReportingTools@ package, we can publish these results to a HTML page for exploration. We first adjust for multiple testing. <>= pvals <- p.adjust( unlist(pTwoSided(resgsc_norm)), method= "BH" ) library(ReportingTools) npgseaReport <- HTMLReport (shortName = "npGSEA", title = "npGSEA Results", reportDirectory = "./reports") publish(gsc, npgseaReport, annotation.db = "org.Hs.eg", setStats = unlist(zStat (resgsc_norm)), setPValues = pvals) finish(npgseaReport) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Methods in brief} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Disadvantages to a permutation approach} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% There are three main disadvantages to permutation-based analyses: cost, randomness, and granularity. Testing many sets of genes becomes computationally expensive for two reasons. First, there are many test statistics to calculate in each permuted version of the data. Second, to allow for multiplicity adjustment, we require small nominal $p$-values to draw inference about our sets, which in turn requires a large number of permutations. Permutations are also subject random inference. Because permutations are based on a random shuffling of the data, there is a chance that we will obtain a different $p$-value for our set of interest each time we run our permutation analysis. Permutations also have a granularity problem. If we do $M$ permutations, then the smallest possible $p$-value we can attain is $1/(M+1)$. When it is necessary to adjust for multiplicity, the permutation approach becomes very computationally expensive. Another aspect of the granularity problem is that permutations give us no basis to distinguish between two gene sets that both have the same $p$-value $1/(M+1)$. There may be many such gene sets, and they have meaningfully different effect sizes. Because of each of these limitations of permutation testing, there is a need to move beyond permutation-based GSEA methods. The methods we present in \verb@npGSEA@ and discuss in brief below are not as computationally expensive, random, or granular than their permutation counterparts. More details on our method can be found in Larson and Owen (2015). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Test statistics} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% We present our notation using the language of gene expression experiments. Let $g$ and $h$ denote individual genes and $G$ be a set of genes. Our experiment has $n$ subjects. The subjects may represent patients, cell cultures, or tissue samples. The expression level for gene $g$ in subject $i$ is $X_{gi}$, and $Y_{i}$ is the target variable on subject $i$. $Y_{i}$ is often a treatment, disease, or genotype. We center the variables so that $\sum_{i=1}^nY_{i} = \sum_{i=1}^n X_{gi} = 0, \forall g$. Our measure of association for gene $g$ on our treatment of interest is $$\hat\beta_{g} = \frac1{n}\sum_{i=1}^nX_{gi}Y_{i}.$$ We consider the linear statistic $$ T_{G,w} = \sum_{g\in G}w_{g}\hat\beta_{g}$$ and the quadratic statistic $$C_{G,w} = \sum_{g\in G}w_{g}\hat\beta_{g}^2,$$ where $w_{g}$ corresponds to the weight given to gene $g$ in set $G$. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Moment based reference distributions} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% To avoid the issues discussed above, we approximate the distribution of the permuted test statistics $T_{G,w}$ by Gaussian or by rescaled beta distributions. For the quadratic statistic $C_{G,w}$ we use a distribution of the form $\sigma^2\chi^2_{(\nu)}$. For the Gaussian treatment of $T_{G,w}$ we calculate $\sigma^2 = Var(T_{G,w})$ under permutation, and then report the $p$-value $$ p = Pr( N( 0, \sigma^2 ) \le T_{G,w}). $$ The above is a left tail $p$-value. Two-sided and right tailed $p$-values are analogous. When we want something sharper than the normal distribution, we can use a scaled Beta distribution, of the form $A + (B-A)Beta(\alpha,\beta)$. The $Beta(\alpha,\beta)$ distribution has a continuous density function on $00$. We choose $A$, $B$, $\alpha$ and $\beta$ by matching the upper and lower limits of $T_{G,w}$ under permutation, as well as its mean and variance. The observed left tailed $p$-value is $$ p = Pr\Bigl( Beta(\alpha,\beta) \le \frac{T_{G,w}-A}{B-A}\Bigr). $$ For the quadratic test statistic $C_{G,w}$ we use a $\sigma^2\chi^2_{(\nu)}$ reference distribution reporting the $p$-value $$\Pr( \sigma^2\chi^2_{(\nu)}\ge C_{G,w}),$$ after matching the first and second moments of $\sigma^2\chi^2_{(\nu)}$ to $E(C_{G,w})$ and $E( C_{G,w}^2)$ under permutation, respectively. Additional details on how $\sigma^2$, $A$, $B$, $\alpha$, $\beta$, $E(C_{G,w})$, $E( C_{G,w}^2)$, and $\nu$ are derived can be found in Larson and Owen (2015). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Session Info} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% <>= toLatex(sessionInfo()) @ <>= options(prompt="> ", continue="+ ") @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{References} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Larson and Owen. (2015). Moment based gene set tests. \emph{BMC Bioinformatics.} \textbf{16}:132. \url{http://www.biomedcentral.com/1471-2105/16/132} \end{document}