%\VignetteIndexEntry{SDAMS Vignette} %\VignettePackage{SDAMS} %\VignetteKeyword{Semi-parametric Differential Adundance Analysis} \documentclass[12pt]{article} \usepackage{float} \usepackage{Sweave} \usepackage{amsmath} \usepackage{amssymb} \RequirePackage{Bioconductor} \AtBeginDocument{\bibliographystyle{unsrturl}} \renewcommand{\baselinestretch}{1.3} \SweaveOpts{keep.source=TRUE,eps=FALSE,include=TRUE,width=4,height=4} \author{Yuntong Li$^{1}$, Chi Wang$^{2,3}$\footnote{to whom correspondence should be addressed}, Li Chen$^{2,3}$\footnote{to whom correspondence should be addressed}\\[1em] \small{$^{1}$Department of Statistics , University of Kentucky,Lexington, KY;}\\ \small{$^{2}$Markey Cancer Center, University of Kentucky, Lexington, KY;}\\ \small{$^{3}$Division of Cancer Biostatistics, Department of Internal Medicine;}\\ \small{\texttt{yuntong.li@uky.edu}}\\ \small{\texttt{chi.wang@uky.edu}}\\ \small{\texttt{lichenuky@uky.edu}}} \title{\textsf{\textbf{The SDAMS package}}} %\bibliographystyle{abbrv} \begin{document} \SweaveOpts{concordance=TRUE} \maketitle \begin{abstract} This vignette introduces the use of the Bioconductor package {\tt SDAMS}, which is designed for differential abundance analysis for metabolomics and proteomics data from mass spectrometry and differential expression analysis for single-cell RNA sequencing data. These data may contain a large fraction of zero values and the non-zero part may not be normally distributed. {\tt SDAMS} considers a two-part semi-parametric model, a logistic regression for the zero proportion and a semi-parametric log-linear model for the non-zero values. A kernel-smoothed likelihood method is proposed to estimate regression coefficients in the two-part model and a likelihood ratio test is constructed for differential abundant/expression analysis. \end{abstract} \newpage \tableofcontents \newpage \section{Citation} The package {\tt SDAMS} implements statistical methods from the following publication. If you use {\tt SDAMS} in the published research, please cite: \\ Li, Y., Fan, T.W., Lane, A.N. et al. SDA: a semi-parametric differential abundance analysis method for metabolomics and proteomics data. BMC Bioinformatics 20, 501 (2019).\\ Yuntong Li, Chi Wang and Li Chen: SDAMS: an R Package for differential expression analysis of single-cell RNA sequencing data (Manuscript). \section{Quick Start} This section show the most basic {\tt SDAMS} work flow for a differential abundance analysis for metabolomics and proteomics data from mass spectrometry or differential expression analysis for single-cell RNA sequencing data: \begin{enumerate} \item Create a {\tt SummarizedExperiment} object using function {\tt createSEFromMatrix} or {\tt createSEFromCSV}. In this section we use an example {\tt SummarizedExperiment} object directly, which is an object of {\tt SummarizedExperiment} class named {\tt exampleSumExp} contained in this package. \item Perform a differential abundance analysis or differential expression analysis using {\tt SDA}. \end{enumerate} <>= library("SDAMS") data("exampleSumExp") results <- SDA(exampleSumExp) @ Here, the {\tt SummarizedExperiment} class object {\tt exampleSumExp} contained in the package is the proteomics dataset, which a matrix-like container for proteomic features with experimental subject grouping information. There are 560 features for 202 experimental subjects with 49 prostate cancer subjects and 153 healthy subjects (0 for healthy control and 1 for patient in this case). This is a 10\% subsample of the original dataset. The features are stored as a matrix in the assay slot. Each row in this matrix represents a proteomic feature and each column represents a subject. See Reference~ \cite{siwy2011human} for detailed information regarding this dataset. <>= data("exampleSingleCell") results_SC <- SDA(exampleSingleCell) @ The {\tt SDAMS} package also provides an example for single-cell RNA sequencing data in a {\tt SummarizedExperiment} class object {\tt exampleSingleCell}. There are 92 single cells (48 mouse embryonic stem (ES) cells and 44 mouse embryonic fibroblasts (MEF) cells) that were analyzed. This example data in the form of TPM (transcripts per kilobase million) values contains 10\% of genes which are randomly sampled from the original dataset. See Reference~\cite{islam2011characterization} for detailed information regarding this dataset. \section{Data Input} \subsection{Create SummarizedExperiment object from csv files} The proteomics or metabolomics data is stored as a matrix with each row being a feature and each column corresponding to a subject. All data in this matrix are non-negative. Another information required is the phenotype covariates. Here we focus on the binary grouping information, for example, numeric 1 for control group and 0 for case group. But it can also be characters, such as "healthy" and "disease". To utilize {\tt SDAMS} package, we should have two separate csv files (for example 'feature.csv' and 'group.csv') as inputs for {\tt createSEfromCSV} to creat a {\tt SummarizedExperiment} object. Note: \begin{enumerate} \item The $1^{st}$ column in 'feature.csv' represents feature names and the $1^{st}$ row represents subject codes. \item The $1^{st}$ column in 'group.csv' represents subject codes, for example, Subject1, Subject2.... \end{enumerate} The format for "csv files" should look like as Figure~\ref{example feature} and Figure~\ref{example group}: \begin{figure}[h!] \centering \includegraphics{feature.png} \caption{Example of 'feature.csv' pattern} \label{example feature} \end{figure} \begin{figure}[ht] \centering \includegraphics[width=2cm]{group.png} \caption{Example of 'group.csv' pattern} \label{example group} \end{figure} After creating the two csv files, we need the paths for the two csv files: <>= path1 <- "/path/to/your/feature.csv/" path2 <- "/path/to/your/group.csv/" @ Here for demonstration purpose, we use the data stored in {\tt inst/extdata} directory. This is the csv format of the data in exampleSumExp which is a {\tt SummarizedExperiment} object we described before. <>= directory1 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path1 <- file.path(directory1, "ProstateFeature.csv") directory2 <- system.file("extdata", package = "SDAMS", mustWork = TRUE) path2 <- file.path(directory2, "ProstateGroup.csv") @ then use the function {\tt createSEFromCSV} after loading the {\tt SDAMS} package. <>= library("SDAMS") exampleSE1 <- createSEFromCSV(path1, path2) exampleSE1 @ The feature data and grouping information can be accessed using {\tt SummarizedExperiment} commands: <>= head(assay(exampleSE1)[,1:10]) head(colData(exampleSE1)$grouping) @ \subsection{Create SummarizedExperiment object from seperate matrix} If the two datasets have already been cleaned and loaded into R as matrices, then we can use {\tt createSEFromMatrix} to create a {\tt SummarizedExperiment} object. Note that {\tt groupInfo} is the design matrix. The first colomn of this design matrix is the cell subpopulation, and the following columns could be additional covariates. <>= set.seed(100) featureInfo <- matrix(runif(8000, -2, 5), ncol = 40) featureInfo[featureInfo<0] <- 0 rownames(featureInfo) <- paste("gene", 1:200, sep = '') colnames(featureInfo) <- paste('cell', 1:40, sep = '') groupInfo <- data.frame(grouping=matrix(sample(0:1, 40, replace = TRUE), ncol = 1)) rownames(groupInfo) <- colnames(featureInfo) exampleSE2 <- createSEFromMatrix(feature = featureInfo, colData = groupInfo) exampleSE2 head(assay(exampleSE2)[,1:10]) head(colData(exampleSE2)$grouping) @ \section{Data Analysis} \subsection{Proteomics example data} Finally, we perform differential abundance analyais using {\tt SummarizedExperiment} object created in previous section. This can be done by using function {\tt SDA}. The theory behind {\tt SDA} can be reached at section \ref{theory}. A list with point estimates, p-values, q-values and corresponding feature names is returned. Below are the results generated by using the {\tt SummarizedExperiment} object exampleSE1. <>= results <- SDA(exampleSE1) head(results$gamma[,1]) head(results$pv_gamma[,1]) head(results$qv_gamma[,1]) @ In this example , there is only one group covariate applied to each subject. Here $\textbf{X}_i$ is one dimension. The covariate effect on the fraction of zero values for certain feature is $\gamma$, which is estimated to be 0.11 for the first feature, and 0.86 for the second feature, etc. The corresponding hypothesis is $H_0$: $\gamma=0$ vs. $H_1$: $\gamma \ne 0$. The p-values calculated from likelihood ratio test are returned in {\tt pv\_gamma}. Users can determine their own significance level to make inference, such as 0.05 nominal level. We also provide a FDR adjustment method \cite{storey2003statistical} used in {\tt SDA} for multiple comparison issues. Those results for $\gamma$ are stored in {\tt qv\_gamma}. <>= head(results$beta[,1]) head(results$pv_beta[,1]) head(results$qv_beta[,1]) @ The model parameter $\beta$ is the log fold change in the non-zero abundance comparing different values of the single group covariate for certain feature. The corresponding two-sided hypothesis is $H_0$: $\beta=0$ vs. $H_1$: $\beta \ne 0$. Again, {\tt SDA} will return p-values and adjusted p-values (q-values) for parameter $\beta$, and they are stored in {\tt pv\_beta} and {\tt qv\_beta} respectively. <>= head(results$pv_2part[,1]) head(results$qv_2part[,1]) @ Hypothesis testing on overall effect of group covariate on certain feature is performed by assessing $\gamma$ and $\beta$. The null hypothesis $H_0:$ $\gamma=0$ and $\beta=0$ against alternative hypothesis $H_1:$ at least one of the two parameters is non-zero. The p-values are calculated based on chi-square distribution with 2 degrees of freedom. And the corresponding q-values are calculated using the same procedure as in one-part test. <>= head(results$feat.names) @ A vector of feature names is returned for convenience which corresponds to the results in the other components. \subsection{Single-cell RNA sequencing example data} SDAMS can also perform differential expression analysis for single-cell RNA sequencing data. % <>= % data(exampleSingleCell) % @ In this section, we will use the example data generated in Section 3.2. This toy data set has 200 genes and 40 cells. The first column of the design matrix is the cell subpopulation, and additional covariates can be added into the design matrix. We are interested in identifying genes that are differentially expressed in two different cell subpopulations, which is quantified by $\gamma_{Z}$ and $\beta_{Z}$. The $\gamma_{Z}$ is the log odds ratio comparing rate of expression, and $\beta_{Z}$ is the log fold change comparing the mean gene expression of the expressed cells between the two cell subpopulations. <>= results_SC <- SDA(exampleSE2) head(results_SC$pv_gamma[,1]) head(results_SC$qv_gamma[,1]) head(results_SC$pv_beta[,1]) head(results_SC$qv_beta[,1]) head(results_SC$pv_2part[,1]) head(results_SC$qv_2part[,1]) head(results_SC$feat.names) @ Three types of hypotheses can be tested through function {\tt SDA}: $H_0^1: \beta_{Z}=0$, $H_0^2: \gamma_{Z}=0$, and $H_0^3: \gamma_{Z}=0 \textrm{ and } \beta_{Z} = 0$. Likelihood ratio test is conducted for each test and p-value is returned by {\tt SDA} for each single gene. The corresponding q-values are also calculated to address multiple comparison correction. For Hypothesis $H_0^1$, the p-value and corresponding q-value for each gene are returned in {\tt pv\_beta} and {\tt qv\_beta}. For Hypothesis $H_0^2$, the p-value and corresponding q-value for each gene are returned in {\tt pv\_gamma} and {\tt qv\_gamma}. For example, the p-value for testing the covariate effect on comparing the proportion of drop-outs for Gene 1 is 0.496, which is not significant under 0.05 level. The corresponding q-value is 0.955. For testing whether there is difference in either the rate of expression or mean expression for the expressed cells, $H_0^3$, the p-value and corresponding q-value for each gene are returned in {\tt pv\_2part} and {\tt qv\_2part}. \section{Theory for SDAMS}\label{theory} As mentioned in the abstract, metabolomics and proteomics data from mass spectrometry or single-cell RNA sequencing data maybe a mixture of zero intensity values and possibly non-normally distributed non-zero intensity values. Therefore, the differential abundance/expression analysis needs to be performed to compare both the zero proportion and the mean of non-zero values between groups and also allows adjustment of covariates. SDA is a two-part model which addresses these issues that uses a logistic regression model to characterize the zero proportion and a semiparametric model to characterize non-zero values. \subsection{A two-part semi-parametric model} The differential abundance/expression analysis in SDAMS has the following forms. For binary part: \[\mathrm{log}(\frac{\pi_{i}}{1-\pi_{i}})=\gamma_{0}+\boldsymbol{\gamma} \boldsymbol{X}_{i} , \] For continuous non-zero part: \[\mathrm{log}(Y_{i})=\boldsymbol{\beta} \boldsymbol{X}_i + \varepsilon_{i}, \] where $Y_{i}$ $(i=1, 2, ..., N)$ is a random variable and $\pi_{i}=Pr(Y_{i}=0)$. $\boldsymbol{X}_i$ is a vector of covariates. The corresponding model parameters $\boldsymbol{\gamma}$ quantify the covariates effects on the fraction of zero values and $\gamma_{0}$ is the intercept. $\boldsymbol{\beta}$ are the model parameters quantifying the covariates effects on the non-zero values, and $\varepsilon_{i}$ $(i=1,2,..., N)$ are independent error terms with a common but completely unspecified density function $f$. Importantly, we do not impose any distributional assumption on $f$. Without assuming a specific parametric distribution for $\varepsilon_{i}$, this model is much more flexible to characterize data with unknown and possibly non-normal distribution. We replace $f$ by its kernel density estimator in the likelihood function. The maximum likelihood estimator is obtained through a trust region maximization algorithm. \subsection{Identification of differentially abundant features based on data from mass spectrometry} In this case, $Y_{i}$ represents the abundance of certain feature for subject $i$, $\pi_{i}=Pr(Y_{i}=0)$ is the probability of point mass. $\boldsymbol{X}_i=(X_{i1},X_{i2},...,X_{iQ})^T$ is a $Q$-vector covariates that specifies the treatment conditions applied to subject $i$. The corresponding $Q$-vector of model parameters $\boldsymbol{\gamma}=(\gamma_{1},\gamma_{2}, ...,\gamma_{Q})^T$ and $\boldsymbol{\beta}=(\beta_{1},\beta_{2},...,\beta_{Q})$ quantify the covariates effects for certain feature. For each feature, the likelihood ratio test is performed on the null hypothesis $H_0:$ $\gamma_{q}=0$ and $\beta_{q}=0$ against alternative hypothesis $H_1:$ at least one of the two parameters is non-zero. We also consider the hypotheses for testing $\gamma_{q}=0$ and $\beta_{q}=0$ separately. To adjust for multiple comparisons across features, the false discovery discovery rate (FDR) q-value is calculated based on the {\tt qvalue} function in {\tt qvalue} package in R/Bioconductor (See Reference~ \cite{storey2003statistical} for details). \subsection{Identification of differentially expressed genes based on single-cell RNA sequencing data} In this case, $Y_{i}$ represents the expression (TPM value) of certain gene in the $i$th cell, $1-\pi_{i}=Pr(Y_{i}>0)$ is the rate of expression. $\boldsymbol{X}_i=(Z_{i},\boldsymbol{W}_i)^T$ is a vector of covariates with $Z_{i}$ being a binary indicator of the cell population under comparison and $\boldsymbol{W}_i$ being a vector of other covariates, e.g. batch and cellular detection rate, and $\boldsymbol{\gamma}=(\gamma_Z, \boldsymbol{\gamma}_W)$ and $\boldsymbol{\beta}=(\beta_Z, \boldsymbol{\beta}_W)$ are model parameters. We are interested in identifying genes that are differentially expressed in two different cell subpopulations, as quantified by $\gamma_Z$ and $\beta_Z$. For each gene, three null hypotheses, i.e. $\beta_Z=0$, $\gamma_Z=0$, and $\beta_Z =0 \textrm{ and } \gamma_Z=0$, are examined based on likelihood ratio tests. Multiple comparison correction is addressed by calculating the false discovery discovery rate (q-values) (See Reference~\cite{storey2003statistical} for details). \section{Session Info} <>= toLatex(sessionInfo()) @ \bibliography{reference} \end{document}