%\VignetteIndexEntry{Gene Selection based on a mixture of marginal distributions} %\VignetteDepends{GeneSelectMMD} %\VignetteKeywords{Microarray experiment} %\VignettePackage{GeneSelectMMD} \documentclass[10pt]{article} \usepackage{url} \usepackage{amsmath} \usepackage{natbib} %\usepackage{isorot} \usepackage{fullpage} % standard 1 inch margins \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\textit{#1}}} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\email}[1]{\texttt{#1}} \newcommand{\mb}[1]{{\boldsymbol{#1}}} \newcommand\myC[1]{{\cal #1}\,} \def\lnit{{\rm lnit}\,} \def\ln{{\rm ln}\,} \def\N{{\rm N}\,} \def\E{{\rm E}\,} \def\vec{{\rm vec}\,} \def\vech{{\rm vech}\,} \def\Var{{\rm Var}\,} \def\Cov{{\rm Cov}\,} \def\wCov{\widehat{\rm Cov}\,} \def\Q{\mb{Q}} \def\I{\mb{I}} \def\V{\mb{V}} \def\B{\mb{B}} \def\D{\mb{D}} \def\C{\mb{C}} \def\W{\mb{W}} \def\A{\mb{A}} \def\H{\mb{H}} \def\Y{{\bf Y}} \def\y{{\bf y}} \def\Z{\mb{Z}} \def\thbf{\mb{\theta}} \def\mbS{\mb{\Sigma}} \def\mbmu{\mb{\mu}} \def\mbI{\mb{I}} \def\hmbS{\hat{\mb{\Sigma}}} \def\mbT{\mb{\Theta}} \def\hmbT{\hat{\mb{\Theta}}} \def\mbw{\mb{w}} \newcommand{\oo}{\infty} \def\diam{\,{\rm diam}\,} \def\chs{\,{\rm chs}\,} \def\join{\,{\rm join}\,} \def\x{{\bf x}} \def\X{{\bf X}} \def\mbx{{\bf x}} \def\mbX{{\bf X}} \def\diag{{\rm diag}} %%% Hyperlinks for ``PDF Latex'' : \ifx\pdfoutput\undefined%%--- usual ``latex'' : %% Stuff w/out hyperref \else%%---------------------- `` pdflatex '' : -- still gives funny errors \RequirePackage{hyperref} %% The following is R's share/texmf/hyperref.cfg : %% Stuff __with__ hyperref : \hypersetup{% %default: hyperindex,% colorlinks,% %default: pagebackref,% linktocpage,% %%plainpages=false,% linkcolor=Green,% citecolor=Blue,% urlcolor=Red,% pdfstartview=Fit,% pdfview={XYZ null null null}% } \RequirePackage{color} \definecolor{Blue}{rgb}{0,0,0.8} \definecolor{Green}{rgb}{0.1,0.75,0.1} \definecolor{Red}{rgb}{0.7,0,0} %% ESS JCGS v2 : %%\hypersetup{backref,colorlinks=true,pagebackref=true,hyperindex=true} %%\hypersetup{backref,colorlinks=false,pagebackref=true,hyperindex=true} \fi \usepackage{Sweave} \begin{document} \title{Gene Selection Using \Rpackage{GeneSelectMMD}} \author{ Jarrett Morrow\\ \email{remdj@channing.harvard.edu}, \\ Weilianq Qiu\\ \email{stwxq@channing.harvard.edu}, \\ Wenqing He\\ \email{whe@stats.uwo.ca}, \\ Xiaogang Wang\\ \email{stevenw@mathstat.yorku.ca}, \\ Ross Lazarus\\ \email{ross.lazarus@gmail.com}, } \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Introduction} This document demonstrates how to use the \Rpackage{GeneSelectMMD} package to detect significant genes and to estimate false discovery rate (FDR), false non-discovery rate (FNDR), false positive rate (FPR), and false negative rate (FNR), for a real microarray data set based on the method proposed by Qiu et al. (2008).\nocite{QiuEtAl:2008} It also illustrates how to visualize the fit of the model proposed in Qiu et al. (2008) to the real microarray data set when the marginal correlations among subjects are zero or close to zero. The \Rpackage{GeneSelectMMD} package is suitable for the case where there are only two tissue types and for the case where tissue samples within a tissue type are conditionally independent given the gene (c.f. Qiu et al., 2008). Note that starting from version 1.7.1, we reparameterize the model parameters. The input and output are the same as the previous versions. To improve speed, we use numerical optimization, instead of iteration based on explicit formula, to obtain parameter estimates. Since verions 2.0.7, the speed has also been improved by using Fortran 77 code for some of the less efficient operations. Lastly, in version 2.0.8 onward, the likelihood values provided to the user are based on the observed data, while in previous versions the likelihood was an expected value. %\clearpage %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Methods} MMD assumes that the marginal distribution of a gene profile is a mixture of 3-component multivariate normal distributions with special structure for mean vectors and covariance matrices. The 3 component distributions correspond to 3 gene clusters: (1) cluster of genes over-expressed in the case group; (2) cluster of genes non-differentially expressed; and (3) cluster of genes under-expressed in the case group. Specifically, the marginal density of a gene profile is assumed to be \begin{equation}\label{Model: mixture of 3-multivariate Normal} \begin{aligned} &f(\mb{x}|\thbf_1, \thbf_2, \thbf_3)=\pi_1 f_1(\mb{x}|\thbf_1)+\pi_2 f_2(\mb{x}|\thbf_2)+ \pi_3 f_3(\mb{x}|\thbf_3),\\ &\pi_1+\pi_2+\pi_3=1, \pi_i>0, i=1, 2, 3, \end{aligned} \end{equation} where $\pi_1$, $\pi_2$, $\pi_3$ are mixture proportions. The $m\times 1$ vector $\mb{x}$ is a realization of the random vector $\mb{X}$ that represents the transformed gene profile for a randomly selected gene over $m$ tissue samples ($m=m_c+m_n$, where $m_c$ is the number of case tissue samples and $m_n$ control tissue samples); $\thbf_k$, is the parameter set for the $k$-th component distribution $f_k$, $k=1, 2, 3$; and $f_1$, $f_2$, and $f_3$ are the density functions for multivariate Normal distributions with the mean vectors \begin{equation}\label{Formulae: mean vector structure} \mb{\mu}_1=\left(\begin{array}{c} \mu_{c1}\mb{1}_{m_c}\\ \mu_{n1}\mb{1}_{m_n} \end{array}\right),\quad \mb{\mu}_2=\mu_2\mb{1}_{m},\quad \mb{\mu}_3=\left(\begin{array}{c} \mu_{c3}\mb{1}_{m_c}\\ \mu_{n3}\mb{1}_{m_n} \end{array}\right). \end{equation} and covariance matrices \begin{equation}\label{Formulae: covariance matrix structure} \mb{\Sigma}_1=\left( \begin{array}{cc} \sigma^2_{c_1}\mb{R}_{c_1}&\mb{0}\\ \mb{0}& \sigma^2_{n_1}\mb{R}_{n_1} \end{array} \right),\quad \mb{\Sigma}_2=\sigma_2^2\mb{R}_{2},\quad \mb{\Sigma}_3=\left( \begin{array}{cc} \sigma^2_{c_3}\mb{R}_{c_3}&\mb{0}\\ \mb{0}& \sigma^2_{n_3}\mb{R}_{n_3} \end{array} \right), \end{equation} respectively, where correlation matrix \begin{equation}\label{Formula: Rt} \mb{R}_{t}=(1-\rho_{t})\left[\mb{I}_{n_t}+\frac{\rho_{t}}{(1-\rho_{t})}\mb{1}_{n_t}\mb{1}_{n_t}^T\right], \end{equation} $t=c_1$, $n_1$, $2$, $c_3$, or $n_3$. $n_t=m_c$ if $t=c_1$ or $c_3$; $n_t=m$ if $t=2$; $n_t=m_n$ if $t=n_1$, or $n_3$. Here we assume, without loss of generality, that the first $m_c$ elements of the random vector $\X$ are for the case tissue samples and the remaining $m_n$ elements are for the control tissue samples. Let $\thbf_1=(\mu_{c_1}, \sigma^2_{c_1}, \rho_{c_1}, \mu_{n_1}, \sigma^2_{n_1}, \rho_{n_1})^T$, $\thbf_2=(\mu_2, \sigma^2_2, \rho_{2})^T$, $\thbf_3=(\mu_{c_3}, \sigma^2_{c_3}, \rho_{c_3}, \mu_{n_3}, \sigma^2_{n_3}, \rho_{n_3})^T$. Note that $\mu_{c_1}>\mu_{n_1}$ for component $1$ in which genes are over-expressed in case tissue samples, and $\mu_{c_3}<\mu_{n_3}$ for component 3 where genes are under-expressed in case samples. Our prior belief is that the majority of genes are usually non-differentially expressed, so we assume $\pi_2>\pi_1$ and $\pi_2>\pi_3$. The model parameters can be estimated using the EM algorithm (Dempster et al., 1977).\nocite{DempsterEtAl:1977} The cluster membership of a gene is determined by the posterior probability that the gene belongs to a cluster given its gene profile. The posterior probability is a function of the 3 component marginal density functions and the mixing proportions: \begin{equation} \begin{aligned} Pr(\mbox{gene $i \in$ cluster $k$}|\mb{x}_i)=&\frac{\pi_k f_k(\mb{x}_i|\thbf_k)}{\pi_1 f_1(\mb{x}_i|\thbf_1) +\pi_2 f_2(\mb{x}_i|\thbf_2)+\pi_3 f_3(\mb{x}_i|\thbf_3)},\\ k=&1, 2, 3, \end{aligned} \end{equation} Specifically, a gene is assigned to the a gene cluster if the posterior probability that the gene belongs to that gene cluster given its gene profile is larger than the posterior probability that the gene belongs to other gene clusters given its gene profile: \begin{equation}\label{Formula: criterion for gene selection} k_0=\arg\max_{k=1,2,3}Pr(\mbox{gene $i \in$ cluster $k$}|\mb{x}_i). \end{equation} An important task is to assess the performance of a gene selection method so that different methods can be objectively compared. To evaluate the performance of a gene selection method, investigators usually compare the error rates, such as FDR, FNDR, FPR, and FNR, via simulation studies. However, when analyzing a real microarray data set, investigators are more interested in what the values of FDR, FNDR, FPR, and FNR are for this specific real microarray set. It is challenging to estimate FDR, FNDR, FPR, and FNR for real microarray data sets since the true gene cluster membership is unknown for real data sets. However, model-based gene clustering methods, such as Bayesian hierarchical models and MMD, can provide such estimates since these error rates can be expressed as functions of marginal density functions and mixing proportions, where the model parameters and mixing proportions can be estimated from the real microarray data. It is easy to use MMD to estimate the four error rates since MMD describes the distributions of gene expression levels directly via the marginal distributions, while it is usually difficult to derive the marginal density functions for Bayesian hierarchical models. It is of practical importance to evaluate if a model fits a real microarray data set well. If a model does not fit well for the real microarray data set, then it makes no sense to estimate the error rates based on the model. Although it is quite challenging to asses the goodness of fit for multivariate data, especially for non-normal multivariate data, it is possible to do so for some special cases. For example, when tissue samples are marginally independent, we could pool the gene expression levels across tissue samples for each type of tissue samples since they are all independent. We then could impose the theoretical density curve on the histogram of the pooled expression levels for each type of tissue samples. The parameters $\rho_{c1}$, $\rho_{n1}$, $\rho_{2}$, $\rho_{c3}$, and $\rho_{n3}$ indicates the marginal correlations among tissue samples. Assuming that the marginal correlations are ignorable, we then could produce such a plot to evaluate the goodness of fit of MMD to the real microarray data set. \section{Reparametrization} To guarantee the constraints $\mu_{c_1}>\mu_{n_1}$ and $\mu_{c_3}<\mu_{n_3}$ in the iteration of the EM algorithm, we re-parameterize the model parameters: \begin{equation*} \begin{aligned} \mu_{n_1}=&\mu_{c_1}-\exp(\triangle_{n_1}),\\ \mu_{n_3}=&\mu_{c_3}+\exp(\triangle_{n_3}). \end{aligned} \end{equation*} To make sure the marginal covariance matrices are positive definite, we set the constraints: \begin{equation*} \begin{aligned} -\frac{1}{n_c-1}<&\rho_{c_1}<1\\ -\frac{1}{n_n-1}<&\rho_{n_1}<1\\ -\frac{1}{n-1}<&\rho_{2}<1\\ -\frac{1}{n_c-1}<&\rho_{c_3}<1\\ -\frac{1}{n_n-1}<&\rho_{n_3}<1 \end{aligned} \end{equation*} To make sure the above inequalities hold, we reparametrize $\rho$s as follows: \begin{equation*} \begin{aligned} \rho_{c_1}=&\frac{\exp(r_{c_1})-1/(n_c-1)}{1+\exp(r_{c_1})}\\ \rho_{n_1}=&\frac{\exp(r_{n_1})-1/(n_n-1)}{1+\exp(r_{n_1})}\\ \rho_{2}=&\frac{\exp(r_{2})-1/(n-1)}{1+\exp(r_{2})}\\ \rho_{c_3}=&\frac{\exp(r_{c_3})-1/(n_c-3)}{1+\exp(r_{c_3})}\\ \rho_{n_3}=&\frac{\exp(r_{n_3})-1/(n_n-3)}{1+\exp(r_{n_3})}. \end{aligned} \end{equation*} Another set of constraints are: \begin{equation*} \begin{aligned} 0<&\pi_k<1,\quad k=1, 2, 3,\\ \pi_3=&1-\pi_1-\pi_2. \end{aligned} \end{equation*} Then the reparametrized MMD model is \begin{equation} f(\mb{x})=\pi_1 f_1(\mb{x}|\mb{\theta}_1)+\pi_2f_2(\mb{x}|\mb{\theta}_2) +(1-\pi_1-\pi_2) f_3(\mb{x}|\mb{\theta}_3), \end{equation} where \begin{equation} \begin{aligned} \mb{\theta}_1=&(\mu_{c_1}, \sigma_{c_1}^2, r_{c_1}, \triangle_{n_1}, \sigma_{n_1}^2, r_{n_1})^T,\\ \mb{\theta}_2=&(\mu_2, \sigma_2^2, r_{2})^T,\\ \mb{\theta}_3=&(\mu_{c_3}, \sigma_{c_3}^2, r_{c_3}, \triangle_{n_3}, \sigma_{n_3}^2, r_{n_3})^T, \end{aligned} \end{equation} and $f_1$, $f_2$, and $f_3$ are the density functions for multivariate normal distributions with mean vectors \begin{equation} \mb{\mu}_1=\left( \begin{array}{c} \mu_{c_1}\mb{1}_{n_c}\\ \left[\mu_{c_1}-\exp(\triangle_{n_1})\right]\mb{1}_{n_n} \end{array} \right),\; \mb{\mu}_2=\mu_2\mb{1}_{n},\; \mb{\mu}_3=\left( \begin{array}{c} \mu_{c_3}\mb{1}_{n_c}\\ \left[\mu_{c_3}+\exp(\triangle_{n_3})\right]\mb{1}_{n_n} \end{array} \right) \end{equation} and covariance matrices \begin{equation} \begin{aligned} \mb{\Sigma}_1=&\left( \begin{array}{cc} \sigma^2_{c_1}\frac{n_c}{(n_c-1)(1+\exp(r_{c_1}))}\mb{R}_{0,c_1}&\mb{0}\\ \mb{0}&\sigma^2_{n_1}\frac{n_n}{(n_n-1)(1+\exp(r_{n_1}))}\mb{R}_{0,n_1} \end{array} \right),\\ \mb{\Sigma_2}=&\sigma^2_2\frac{n}{(n-1)(1+\exp(r_{2}))}\mb{R}_{0,2},\\ \mb{\Sigma}_3=&\left( \begin{array}{cc} \sigma^2_{c_3}\frac{n_c}{(n_c-1)(1+\exp(r_{c_3}))}\mb{R}_{0,c_3}&\mb{0}\\ \mb{0}&\sigma^2_{n_3}\frac{n_n}{(n_n-1)(1+\exp(r_{n_3}))}\mb{R}_{0,n_3} \end{array} \right). \end{aligned} \end{equation} The matrices $\mb{R}_{0, c_1}$, $\mb{R}_{0, n_1}$, $\mb{R}_{0, 2}$, $\mb{R}_{0, c_3}$, and $\mb{R}_{0, n_3}$ are defined as below: \begin{equation} \begin{aligned} \mb{R}_{0, c_1}=&\mb{I}_{n_c}+\frac{(n_c-1)\exp(r_{c_1})-1}{n_{c_1}}\mb{1}_{n_c}\mb{1}_{n_c}^T,\\ \mb{R}_{0, n_1}=&\mb{I}_{n_n}+\frac{(n_n-1)\exp(r_{n_1})-1}{n_{n_1}}\mb{1}_{n_n}\mb{1}_{n_n}^T,\\ \mb{R}_{0, 2}=&\mb{I}_{n}+\frac{(n-1)\exp(r_{2})-1}{n_{2}}\mb{1}_{n}\mb{1}_{n}^T,\\ \mb{R}_{0, c_3}=&\mb{I}_{n_c}+\frac{(n_c-1)\exp(r_{c_3})-1}{n_{c_3}}\mb{1}_{n_c}\mb{1}_{n_c}^T,\\ \mb{R}_{0, n_3}=&\mb{I}_{n_n}+\frac{(n_n-1)\exp(r_{n_3})-1}{n_{n_3}}\mb{1}_{n_n}\mb{1}_{n_n}^T. \end{aligned} \end{equation} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Gene selection via \Rpackage{GeneSelectMMD}} \label{Section: Gene Selection Functions} The \Rpackage{GeneSelectMMD} includes four functions for gene selection: \code{gsMMD}, \code{gsMMD.default}, \code{gsMMD2}, and \code{gsMMD2.default}. The functions \code{gsMMD} and \code{gsMMD2} accept the object derived from the class of \Rpackage{Bioconductor}'s \code{ExpressionSet} as data input, while the functions \code{gsMMD.default} and \code{gsMMD2.default} accept data matrix as data input. The functions \code{gsMMD} and \code{gsMMD.default} will provide initial 3-cluster gene partitions (cluster of genes over-expressed in case group, cluster of genes non-differentially expressed, and cluster of genes under-expressed in case group) based on the gene-wise two-sample t-test or two-sample Wilcoxon rank-sum test. In situations where the user would like to provide the initial 3-cluster partition other than that provided by the gene-wise two-sample t-test or two sample Wilcoxon rank-sum test, the functions \code{gsMMD2} and \code{gsMMD2.default} could be used. The rows of the input data matrix (or the data matrix derived from the object derived from the class \code{ExpressionSet}) are genes, while the columns are tissue samples. The tissue type of a tissue sample is indicated by the argument \code{memSubjects}, which is a $m\times 1$ vector, $m=m_{c}+m_{n}$, $m_{c}$ is the number of tissue samples in the case group, and $m_{n}$ is the number of tissue samples in the control group. \code{memSubjects[j]=1} indicates the $j$-th tissue sample is from the case group. \code{memSubjects[j]=0} indicates the $j$-th tissue sample is from the control group. The output of gene selection functions include \code{dat}, \code{memGenes}, \code{memGenes2}, and \code{para}. \begin{itemize} \item \code{dat} is a data matrix with the same dimensions as the input data matrix. If no data transformation is performed, \code{dat} is the same as the input data matrix. Otherwise, it will be the transformed input data matrix. \item \code{memGenes} is a $G\times 1$ vector indicating the gene cluster membership, where $G$ is the number of genes (i.e., the number of rows of the input data matrix). \code{memGenes[g]=1} indicates the $g$-th gene is assigned to the cluster of genes over-expressed in case group; \code{memGenes[g]=2} indicates the $g$-th gene is assigned to the cluster of genes non-differentially-expressed; \code{memGenes[g]=3} indicates the $g$-th gene is assigned to the cluster of genes under-expressed in case group. \item \code{memGenes2} is a variant of \code{memGenes}. \code{memGenes2[g]=1} means the $g$-th gene is differentially expressed, while \code{memGenes2[g]=0} means the $g$-th gene is non-differentially expressed, $g=1,\ldots, G$. \item \code{para} is a $18\times 1$ vector of parameters $(\pi_1$, $\pi_2$, $\pi_3$, $\mu_{c1}$, $\sigma^2_{c1}$, $\rho_{c1}$, $\mu_{n1}$, $\sigma^2_{n1}$, $\rho_{n1}$, $\mu_2$, $\sigma^2_2$, $\rho_2$, $\mu_{c3}$, $\sigma^2_{c3}$, $\rho_{c3}$, $\mu_{n3}$, $\sigma^2_{n3}$, $\rho_{n3}$, $)$ for the model described in Equations (\ref{Model: mixture of 3-multivariate Normal})- (\ref{Formula: Rt}), which can be estimated by the EM algorithm. \code{para[1]}, \code{para[2]}, and \code{para[3]} are the cluster proportions for the 3 gene clusters (over-expressed, non-differentially expressed, and under-expressed). \code{para[4]}, \code{para[5]}, and \code{para[6]} are the marginal mean, variance, and correlation of gene expression levels of cluster 1 (over-expressed genes) for case subjects; \code{para[7]}, \code{para[8]}, and \code{para[9]} are the marginal mean, variance, and correlation of gene expression levels of cluster 1 (over-expressed genes) for control subjects; \code{para[10]}, \code{para[11]}, and \code{para[12]} are the marginal mean, variance, and correlation of gene expression levels of cluster 2 (non-differentially expressed genes); \code{para[13]}, \code{para[14]}, and \code{para[15]} are the marginal mean, variance, and correlation of gene expression levels of cluster 3 (under-expressed genes) for case subjects; \code{para[16]}, \code{para[17]}, and \code{para[18]} are the marginal mean, variance, and correlation of gene expression levels of cluster 3 (under-expressed genes) for control subjects. Note that genes in cluster 2 are non-differentially expressed across case and control tissue samples. Hence there are only 3 parameters for cluster 2. \end{itemize} For example, to obtain differentially expressed genes for the ALL data (Chiaretti et al., 2004),\nocite{ChiarettiEtAl:2004} we can run either <>= library(GeneSelectMMD) library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD(eSet1, memSubjects, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = TRUE) para <- obj.gsMMD$para print(round(para, 3)) @ or <>= library(GeneSelectMMD) library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mat <- exprs(eSet1) mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 obj.gsMMD <- gsMMD.default(mat, memSubjects, iniGeneMethod = "Ttest", transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet=TRUE) para <- obj.gsMMD$para print(round(para, 3)) @ If we would like to provide the initial 3-cluster partition via the two sample Wilcoxon rank-sum test, then we can run either <>= library(GeneSelectMMD) library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 myWilcox <- function(x, memSubjects, alpha = 0.05) { xc <- x[memSubjects == 1] xn <- x[memSubjects == 0] m <- sum(memSubjects == 1) res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha) res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2) names(res2) <- c("p.value", "statistic") return(res2) } mat <- exprs(eSet1) tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects)) colnames(tmp) <- c("p.value", "statistic") memIni <- rep(2, nrow(mat)) memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1 memIni[tmp[, 1] < 0.05 & tmp[, 2] < 0] <- 3 print(table(memIni)) obj.gsMMD <- gsMMD2(eSet1, memSubjects, memIni, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet = TRUE) para <- obj.gsMMD$para print(round(para, 3)) @ or <>= library(GeneSelectMMD) library(ALL) data(ALL) eSet1 <- ALL[1:100, ALL$BT == "B3" | ALL$BT == "T2"] mat <- exprs(eSet1) mem.str <- as.character(eSet1$BT) nSubjects <- length(mem.str) memSubjects <- rep(0,nSubjects) # B3 coded as 0, T2 coded as 1 memSubjects[mem.str == "T2"] <- 1 myWilcox <- function(x, memSubjects, alpha = 0.05) { xc <- x[memSubjects == 1] xn <- x[memSubjects == 0] m <- sum(memSubjects == 1) res <- wilcox.test(x = xc, y = xn, conf.level = 1 - alpha) res2 <- c(res$p.value, res$statistic - m * (m + 1) / 2) names(res2) <- c("p.value", "statistic") return(res2) } tmp <- t(apply(mat, 1, myWilcox, memSubjects = memSubjects)) colnames(tmp) <- c("p.value", "statistic") memIni <- rep(2, nrow(mat)) memIni[tmp[, 1] < 0.05 & tmp[, 2] > 0] <- 1 memIni[tmp[, 1] < 0.05 & tmp[, 2] < 0] <- 3 print(table(memIni)) obj.gsMMD <- gsMMD2.default(mat, memSubjects, memIni = memIni, transformFlag = TRUE, transformMethod = "boxcox", scaleFlag = TRUE, quiet=TRUE) para <- obj.gsMMD$para print(round(para, 3)) @ Actually, the two sample Wilcoxon rank-sum test is implemented in the \Rpackage{GeneSelectMMD}. The above code is used as an illustration on how to use the functions \code{gsMMD2} and \code{gsMMD2.default}. Note that the speed of the four functions can be slow for large data sets, however it has been improved since version 2.0.7 by using Fotran code. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Error rates estimated for real microarray data sets} For real microarray data sets, the classical gene selection methods, such as two-sample t-test, two-sample Wilcoxon rank-sum test, do not directly provide the estimates of error rates such as false discovery rate (denoted as FDR; it is the percentage of nondifferentially expressed genes among selected genes), false non-discovery rate (denoted as FNDR; it is the percentage of differentially expressed genes among unselected genes), false positive rate (denoted as FPR; it is the percentage of selected genes among nondifferentially expressed genes), and false negative rate (denoted as FNR; it is the percentage of un-selected genes among differentially expressed genes), since the true gene cluster membership is unknown. However, model-based gene selection methods (e.g., eLNN and eGG proposed by Lo and Gottardo (2007),\nocite{LoGottardo:2007} and the method proposed by Qiu et al.'s (2008)) could easily estimate these error rates, since these error rates are functions of some probabilities which are in turn the functions of model parameters. The function \code{errRates} is used to estimate FDR, FNDR, FPR, and FNR based on the objects returned by the four gene selection functions mentioned in the previous section. This function returns a $4\times 1$ vector with elements FDR, FNDR, FPR, and FNR. For example, <>= print(round(errRates(obj.gsMMD), 3)) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Visualizing the fit of the model to a real microarray data set} In general, it is difficult to visualize the fit of the model to a real microarray data set since the data are in high-dimensional space. However, it is possible to do so for the special case where the tissue samples within a tissue type are marginally independent. In this case, we can pool all gene expression levels together since they are all independent, and regard them as one-dimensional data. Next, we can impose the density estimate onto the histogram of this pooled data. The function \code{plotHistDensity} is used for such purpose. The following R code illustrates the usage of \code{plotHistDensity}: <>= plotHistDensity(obj.gsMMD, plotFlag = "case", mytitle = "Histogram of gene expression levels for T2\nimposed with estimated density (case)", plotComponent = TRUE, x.legend = c(0.8, 3), y.legend = c(0.3, 0.4), numPoints = 500, cex.main = 1, cex = 1) @ \begin{figure}[h] \caption{Plot produced using plotHistDensity} \label{Figure:plotHistDensity} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Efficiency improvements} Version 2.0.7 includes efficiency improvements to the EM optimization process. Specifically, Fortran 77 code for version 3.0 of the L-BFGS-B optimization algorithm from Morales and Nocedal (2011)\nocite{MoralesNocedal:2011} has been included. The objective functions for the EM optimization have been translated to Fortran 77 in order to facilitate the use of this optimization code. Additional operations within the EM optimization iterative loop have also been written in Fortran 77 to improve the speed, where the calculation of the weigths has been addressed. Lastly, although the t-test is an optional operation, and therefore, improvements in its speed may not always be of consequence, the two-sample t-test calculations have increased efficiency in version 2.0.7. Analysis involving simulated microarray data for 32,000 genes and a total of 2,000 samples is accomplished in under a minute on a notebook computer (Figure ~\ref{Figure:efficiency}), and therefore, the enhanced performance provided by GeneSelectMMD can be attained on larger datasets using typical hardware. \begin{figure}[ht] \begin{center} \includegraphics[width=0.6\textwidth,angle=0]{GS207runTimesSim1k.pdf} \caption{Plot of running time vs number of genes using an Asus notebook with Intel\textregistered{ }Core\texttrademark{ }i5-2450M CPU and 8GB memory} \label{Figure:efficiency} \end{center} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Discussion} The speeds of the four gene selection functions described in Section \ref{Section: Gene Selection Functions} can be slow, however the speed has been improved by embedding Fortran code in the R code for version 2.0.7 of the \Rpackage{GeneSelectMMD}. \bibliographystyle{plain} \bibliography{GeneSelectMMD} \end{document}