--- title: Simulating and cleaning gene expression data using RUVcorr in the context of inferring gene co-expression output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` The R package `RUVcorr` allows the simulation and cleaning of gene expression data using global removal of unwanted variation (RUV) when the aim is the inference of gene co-expression. Besides the RUV procedure, the package offers extensive plotting options related to its application and can simulate realistic gene expression data with a known gene correlation structure. Although the procedures in the `RUVcorr` package have so far only been applied to microarray gene expression data it should be feasible to apply it to RNA-seq data as well, as long as suitable read-count summaries have been generated and the coverage is sufficient, however this remains untested. For more information on the methodology follow this [link](https://link.springer.com/article/10.1186/s12859-015-0745-3). Loading `RUVcorr` is achieved with the command: ```{r} library(RUVcorr) ``` # Simulating gene expression data with a known gene correlation structure The simulation of gene expression data relies on the linear model framework introduced by [Gagnon-Bartsch and Speed](https://academic.oup.com/biostatistics/article-abstract/13/3/539/248166). Briefly, they assume that any gene expression measurement can be expressed as a linear combination of biological signal $X\beta$, systematic noise $W\alpha$, and random noise $\epsilon$ (typically assumed to be iid normally distributed). \begin{equation} Y=X\beta+W\alpha+\epsilon \end{equation} * $Y$ is a $m \times n$ matrix of observed gene expression data * $X$ is a $m \times p$ matrix containing the unobserved factors of interest * $\beta$ is a $p \times n$ matrix of regression coefficients associated with $X$ * $W$ is a $m \times k$ matrix of unobserved covariates introducing systematic noise * $\alpha$ is a $k \times n$ matrix of regression coefficients associated with $W$ * $\epsilon$ is a $m \times n$ matrix of random noise In the context of this model and for the purposes of simulating gene expression data with a known gene correlation structure, the true underlying gene structure is assumed to be $\Sigma=Cor(X\beta)$. The size of the absolute value of the correlations can be somewhat controlled using the dimensionality of $X$ and $\beta$, $p$. When $p$ is increased the size of the absolute value of the correlations in $\Sigma$ is reduced. Note that some genes (negative controls) are unaffected by this, as their correlation with each other as well as other genes is defined to be 0. Negative control genes are genes that are believed to be unrelated to the factor of interest. ## Independence of biological signal and systematic noise The simplest simulation of gene expression data assumes that the biological signal and the systematic noise are uncorrelated with each other. So $X$ is simulated in a fashion that it renders it independent from $W$. After simulating the data, the `print` command allows you to get a useful overview of the simulated data as well as some meta data. ```{r} set.seed(400) Yind <- simulateGEdata(n=3000, m=1000, k=10, size.alpha=2, corr.strength=5, g=NULL, Sigma.eps=0.1, nc=2000, ne=1000, intercept=TRUE, check=TRUE) print(Yind) ``` Note that the parameter `corr.strength` refers to $p$. The parameters `nc` and `ne` refer to the number of negative control genes and truly expressed genes (i.e. with a mean true gene expression greater than 0.) The parameter `intercept` controls whether $W$ contains an offset or not. #Dependence of biological signal and systematic noise It is more realistic to assume that there is some dependence between $X$ and $W$. Using the parameter `g` ($0100$ arrays) gene expression data sets where visualisation of individual arrays becomes impractical. The option displayed here shows the boxplots for the 1st and 3rd quantile of the difference between the gene expression and the study median for all samples. ```{r message=FALSE, warning=FALSE, fig.height=7, fig.width=7} lapply(1:4, function(i) RLEPlot(expr, expr_AllRUV[[4+i]], name=c("Raw", "RUV"), title=paste("nu=", nu[i]), method="IQR.boxplots")) ``` The figure shows RLE plots comparing different options of $\nu$ for $\hat{k}=3$. The boxplots summarize the 25\% and 75\% quantile of all samples. The red boxplots display the raw data, while the black boxplots refer to the RUV applied with $\hat{k}=2$ and $\nu$ as in the title of the panel. A parameter choice of $\nu=500$ seems to offer the best choice. In order to check whether the selected parameter at least removes all the known sources of variation, there is yet another version of the RLE plot. Here we plot the median and the inter-quantile-range (IQR) of the difference between the gene expression and the study median for all samples. Furthermore, it is useful to color these plots according to a known source of unwanted variation, such as batches. ```{r message=FALSE, warning=FALSE, include=FALSE, fig.height=5, fig.width=5} par(mfrow=c(1,1)) RLEPlot(expr, expr_AllRUV[[6]], name=c("Raw", "RUV"), title="Batches", method="IQR.points", anno=expr.meta, Factor="batch", numeric=TRUE) ``` The fugure shows RLE plots for data cleaned with RUV using $\nu=500$ for $\hat{k}=2$. Every sample is represented by the median and inter-quantile-range of the difference between observed gene expressions and study mean. The samples are colored according to their batches. Principal component plots (`PCAPlot`) provide a similar way of assessing parameter choices for RUV. The figure demonstrates that at least most of the systematic noise introduced via the batch effect has been removed. Hence, it is now possible to examine gene-gene correlations, construct gene networks or else using this new dataset. ```{r} CleanData <- expr_AllRUV[[6]] ``` # Gene prioritisation One of the methods that can be applied given a cleaned version of the dataset is gene prioritisation. Gene prioritisation identifies candidate genes that are likely to be involved in the same biological pathways or related pathways than a set of known genes. The gene prioritisation method in this package is very similar to the approach described in the paper by [Oliver et al.](https://ng.neurology.org/content/2/1/e51.short) For demonstration purposes assume that the following genes involved in the synaptic vesicle cycle are in fact candidates: ```{r} cand_genes <- c("CACNA1A", "CACNA1B", "SNAP25", "STX1A") cand_affy <- names(which(unlist(lapply(xx, function(x) is.element(x, cand_genes)[1])))) cand_index <- which(is.element(colnames(CleanData),cand_affy)) ``` ## Finding the correlation threshold of significant co-expression In order to prioritise genes, typically a correlation threshold is determined. The absolute values of correlations between genes that exceed this threshold are considered to be truly co-expressed. Here, we use a threshold that corresponds to a proportion of prioritised random genes of 0.3. However, this requires extensive estimation for all possible thresholds. This can be achieved using the function `calculateThreshold`: ```{r fig.height=5, fig.width=5} Prop <- calculateThreshold(CleanData, exclude=c(nc_index, cand_index), index.ref=na_index, set.size=length(cand_index), Weights=NULL) threshold <- predict(Prop$loess.estimate, 0.3) threshold ``` It is important to exclude genes that could bias the estimation of the proportion of prioritised genes. ## Prioritising candidate genes Having determined the threshold we can use the function `prioritise` in order to establish which candidates are also likely to be involved in the sodium-channel: ```{r} prior<-prioritise(CleanData, na_index, cand_index, Weight=NULL, threshold=threshold) print(prior) xx[which(is.element(names(xx), prior[,1]))] ``` This analysis prioritises SNAP25 and CACNA1A.