--- title: "ANOVA-Like Differential Expression tool for high throughput sequencing data" shorttitle: "ALDEx2" author: - name: Greg Gloor affiliation: Dep't of Biochemistry, University of Western Ontario email: ggloor@uwo.ca bibliography: aldex.bib output: BiocStyle::html_document: toc_float: true BiocStyle::pdf_document: default package: ALDEx2 abstract: | Instructions on installing and using probabilisitic modelling and compositional data analysis using ALDEx2. vignette: | %\VignetteIndexEntry{ANOVA-Like Differential Expression tool for high throughput sequencing data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- # Introduction to `r Biocpkg("ALDEx2")` This provides an overview of the R package ALDEx version 2 (`r Biocpkg("ALDEx2")`) for differential (relative) abundance analysis of high throughput sequencing count compositional data^[all high throughput sequencing data are compositional [@gloorFrontiers:2017; @GloorAJS2023] because of constraints imposed by the instruments]. However, Nixon et al.^[not accounting for scale leads to both false positive and false negative inference and is also a principled method of accounting for asymmetry in the underlying environment @nixon2023scale] have found that the scale of the data can be modelled and `r Biocpkg("ALDEx2")` now includes scale modelling by default. The package was developed and used initially for multiple-organism RNA-Seq data generated by high-throughput sequencing platforms (meta-RNA-Seq)^[@macklaim:2013], but testing showed that it performed very well with traditional RNA-Seq datasets^[@Quinn:2018aa], 16S rRNA gene variable region sequencing^[@bian:2017] and selective growth-type (SELEX) experiments^[@mcmurrough:2014;@Wolfs:2016aa], and that the underlying principles generalize to single-cell sequencing^[@GloorAJS2023]. The analysis method should be applicable to any type of data that is generated by high-throughput^[@fernandes:2014]. Most recently, Nixon et al.^[@nixon2023scale] showed that the use of the CLR^[and all other normalizations which are based on ratios --- which is to say all of them] made strong assumptions about the scale of the underlying system. ALDEx2 now includes methods to include scale uncertainty and so incorporate both compositional and scale uncertainty together in the analysis. See the scaleSim vignette for more information and a complete walk-through. # Installation There are two ways to download and install the most current of `r Biocpkg("ALDEx2")`. The most recent version of the package will be found at [github.com/ggloor/ALDEx_bioc](https://github.com/ggloor/ALDEx_bioc). The package will run with only the base R packages, the standard tidyverse packages and a minimal Bioconductor install, and is capable of running several functions with the `parallel` package if installed. It is recommended that the package be run on the most up-to-date R and Bioconductor versions. - Install stable version from Bioconductor ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ALDEx2") ``` Here all dependencies should be installed automatically. - Install development version from github: ```{r, eval=FALSE} install.packages("devtools") devtools::install_github("ggloor/ALDEx_bioc") ``` In this case, you may need to install additional packages manually. ```{r basecode, eval=T, echo=F, message=F, error=F, warning=F} # this is all the data generated once # all other data will be displayed but not evaluated ##### RECODE THE SOURCE PRIOR TO PUSHING TO BIOCONDUCTOR # devtools::load_all('~/Documents/0_git/ALDEx_bioc') library(ALDEx2) set.seed(11) #### BASIC ALDEx2 #subset only the last 400 features for efficiency data(selex) selex.sub <- selex[1200:1600,] conds <- c(rep("NS", 7), rep("S", 7)) x.aldex <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE, include.sample.summary=FALSE, denom="all", verbose=FALSE, paired.test=FALSE, gamma=NULL) # using the individual functions # this is the un-scaled version x <- aldex.clr(selex.sub, conds, mc.samples=16, denom="all", verbose=F, gamma=1e-3) x.tt <- aldex.ttest(x, hist.plot=F, paired.test=FALSE, verbose=FALSE) x.effect <- aldex.effect(x, CI=T, verbose=F, include.sample.summary=F, paired.test=FALSE) x.all <- data.frame(x.tt,x.effect) # glm example covariates <- data.frame("A" = sample(0:1, 14, replace = TRUE), "B" = c(rep(0, 7), rep(1, 7))) mm <- model.matrix(~ A + B, covariates) # replace the conditions vector with the mm # data in glm and tt should be identical x.glm <- x x.glm@conds <- mm glm.test <- aldex.glm(x.glm, mm, fdr.method='holm') glm.eff<- aldex.glm.effect(x.glm) # now add scale x.u <- x x.u.e <- x.effect # scaled x.s <- aldex.clr(selex.sub, conds, mc.samples=16, gamma=1, verbose=FALSE) x.s.e <- aldex.effect(x.s, CI=T) x.s.tt <- aldex.ttest(x.s) x.s.all <- cbind(x.s.e, x.s.tt) ``` # What `r Biocpkg("ALDEx2")` does differently `r Biocpkg("ALDEx2")` is a different tool than others used for differential abundance testing, but that does not mean it is complicated. Instead, it is incredibly simple once you understand four concepts. - First, the data you collect is a single point-estimate of the data. That means, if you sequenced the same samples, or even the same library, you would get a slightly different answer. We know this because there are a number of early technical replication papers that show exactly this^[@Marioni:2008]. - Second, that the data collected are probabilistic; that the count data is a representation of the probability of sampling the gene fragment from the mixture of gene fragments in the library. Many other tools assume the data are counts from a multivariate distribution modelled by a Negative Binomial distribution. - Third, that we can pretty accurately estimate this probabilisitic technical variation. For this, we assume the data collected are from a multivariate Poisson distribution constrained to have the total number of counts that the instrument can deliver^[@fernandes:2014;@gloorAJS:2016]. This can be estimated by sampling from the appropriate distribution, in this case a multinomial Dirichlet distribution. - Fourth, that the use of the CLR (or any other normalization) makes strong assumptions about the scale of the sampled environment. Nixon et al.^[@nixon2023scale] demonstrated how to incorporate scale uncertainty so that the effect of scale on the analysis can be modelled and accounted for. More formally, the `r Biocpkg("ALDEx2")` package estimates per-feature technical variation within each sample using Monte-Carlo instances drawn from the Dirichlet distribution. Sampling from this distribution returns a posterior probability distribution of the observed data under a repeated sampling model. This distribution is converted to a log-ratio that linearizes the differences between features. This log-ratio can be modified to include uncertainty regarding the scale of the data when the log-ratio is calculated. Thus, the values returned by `r Biocpkg("ALDEx2")` are posterior estimates of test statistics calculated on log-ratio transformed distributions. `r Biocpkg("ALDEx2")` has been altered to ensure that the elements of the posterior statistical tests agree on their sign with the outcome that some edge cases with low predicted p-values are no longer counted as significant at high false discovery rates. Combining two-sample tests into a posterior estimate is somewhat intricate^[@van1967combination], but in practice the outcomes should be very similar to those seen before, but edge cases with expected false discovery rates greater than 0.05 have become more rigorous. `r Biocpkg("ALDEx2")` uses by default the centred log-ratio (clr) transformation which is scale invariant^[@Aitchison:1986]. While this is a convenient assumption for many cases, this can lead to Type 1 and Type 2 errors^[@nixon2023scale] and we now suggest that modest amounts of scale be built into the analysis by default. The sub-compositional coherence property ensures that the answers obtained are consistent when parts of the dataset are removed (e.g., removal of rRNA reads from RNA-seq studies or rare OTU species from 16S rRNA gene amplicon studies). In the absence of scale, all feature abundance values are expressed relative to the geometric mean abundance of other features in a sample. This is conceptually similar to a quantitative PCR where abundances are expressed relative to a standard: in the case of the clr transformation, the standard is the per-sample geometric mean abundance. See Aitchison (1986) for a complete description. If the question at hand is relative change, then `r Biocpkg("ALDEx2")` can do this. If the question at hand is scale dependent, then `r Biocpkg("ALDEx2")` can do that too! # Quick Start: `aldex` with 2 groups: The `r Biocpkg("ALDEx2")` package in Bioconductor is suitable for the comparison of many different experimental designs. The test dataset is a single gene library with 1600 sequence variants^[@mcmurrough:2014] cloned into an expression vector at approximately equimolar amounts. The wild-type version of the gene conferred resistance to a topoisomerase toxin. Seven independent growths of the gene library were conducted under selective and non-selective conditions and the resulting abundances of each variant was read out by sequencing a pooled, barcoded library on an Illumina MiSeq. The data table is accessed by `data(selex)`. In this data table, there are 1600 features and 14 samples. The analysis takes approximately 2 minutes and memory usage tops out at less than 1Gb of RAM on a mobile i7 class processor when we use 128 Dirichlet Monte-Carlo Instances (DMC). On an M2 Pro chip, this takes approximately 6 seconds. For speed concerns we use only the last 400 features and perform only 16 DMC. The command used for `r Biocpkg("ALDEx2")` is presented below: ```{r, echo=T, eval=F} library(ALDEx2) #subset only the last 400 features for efficiency data(selex) selex.sub <- selex[1200:1600,] conds <- c(rep("NS", 7), rep("S", 7)) x.aldex <- aldex(selex.sub, conds, mc.samples=16, test="t", effect=TRUE, include.sample.summary=FALSE, denom="all", verbose=FALSE, paired.test=FALSE, gamma=NULL) ``` ```{r three-plots, echo=T, message=F, error=F, warning=F, fig.cap="MA, Effect and Volcano plots of ALDEx2 output. The left panel is an Bland-Altman or MA plot that shows the relationship between (relative) Abundance and Difference. The middle panel is an effect plot that shows the relationship between Difference and Dispersion; the lines are equal difference and dispersion. The right hand plot is a volcano plot; the lines represent a posterior predictive p-value of 0.001 and 1.5 fold difference. In all plots features that are not significant are in grey or black. Features that are statistically significant are in red. The log-ratio abundance axis is the clr value for the feature."} # note default is FDR of 0.05 par(mfrow=c(1,3)) aldex.plot(x.aldex, type="MA", test="welch", xlab="Log-ratio abundance", ylab="Difference", main='Bland-Altman plot') aldex.plot(x.aldex, type="MW", test="welch", xlab="Dispersion", ylab="Difference", main='Effect plot') aldex.plot(x.aldex, type="volcano", test="welch", xlab="Difference", ylab="-1(log10(q))", main='Volcano plot') ``` Figure \@ref(fig:three-plots) shows the result of a two sample t-test and effect size calculation using `aldex()`. We set the test to 't', and effect to `TRUE`. The 't' option evaluates the data as a two-factor experiment using both the Welch's t and the Wilcoxon rank tests. The t-test includes an expected value of the false discovery rate correction. The data can be plotted onto Bland-Altmann^[@altman:1983] (MA) or effect (MW)^[@gloor:effect] or volcano^[@Cui:2003aa] plots for for two-way tests using the `aldex.plot()` function. See the end of the modular section for examples of the plots. The simple approach outlined above just calls `aldex.clr, aldex.ttest, aldex.effect` in turn and then merges the data into one dataframe called `x.all`. We will show these modules in turn, and then examine additional modules. There is a lot more information that can be obtained than with these three plots. One of the most powerful aspect of `r Biocpkg("ALDEx2")` is that *everything* is calculated on posterior distributions. The underlying posterior distribution can be viewed if you use the individual modules explained next. # Modular `r Biocpkg("ALDEx2")` is way more informative For many users, the `aldex` function may be sufficient. However, for power users there is a lot of value to the modular approach. There are several built-in tools that allow a much more powerful exploration of the data. In addition, all the intermediate values are exposed for the underlying entre log-ratio transformed Dirichlet Monte-Carlo replicate values. This makes it possible for anyone to add the specific R code for their experimental design --- a guide to these values is outlined below. We will use the same example and generate the same plots but include more information that can be useful. First we load the library and the included selex dataset. Then we set the comparison groups. This must be a vector of conditions in the same order as the samples in the input counts table. The `aldex` command is calling several other functions in the background, and each of them returns diagnostics. These diagnostics are suppressed for this vignette. The code block below shows the full set of command, but are not run. Each is run in an individual code block ```{r, echo=T, eval=F, fig.cap="First load the libraries and the data."} data(selex) #subset only the last 400 features for efficiency selex.sub <- selex[1200:1600,] conds <- c(rep("NS", 7), rep("S", 7)) x <- aldex.clr(selex.sub, conds, mc.samples=16, denom="all", verbose=F, gamma=1e-3) x.tt <- aldex.ttest(x, hist.plot=F, paired.test=FALSE, verbose=FALSE,) x.effect <- aldex.effect(x, CI=T, verbose=F, include.sample.summary=F, paired.test=FALSE, glm.conds=NULL, useMC=F) x.all <- data.frame(x.tt,x.effect) par(mfrow=c(1,2)) aldex.plot(x.all, type="MA", test="welch") aldex.plot(x.all, type="MW", test="welch") ``` ## The `aldex.clr` module The workflow for the modular approach first generates random instances of the centred log-ratio transformed values. There are three inputs: counts table, a vector of conditions, and the number of Monte-Carlo (DMC) instances; and two main parameters: the level of verbosity (TRUE or FALSE) and the scale parameter. We recommend 128 or more mc.samples for the t-test, 1000 for a rigorous effect size calculation, and at least 16 for ANOVA.^[in fact we recommend that the number of samples in the smallest group multiplied by the number of DMC be equal at least 1000 in order to generate a reasonably stable estimate of the posterior distribution] Additionally, while in this example, gamma is set to 1e-3 (i.e., almost nothing) we recommend including gamma in the range of 0.25 to 1 for most datasets. See below, and the the `scale_sim` vignette for more detail. This operation is fast. ```{r,echo=T, eval=F, message=F, error=F, warning=F, results="hide", fig.cap="the output is in the S3 object 'x'."} x <- aldex.clr(selex.sub, conds, mc.samples=16, verbose=F, gamma=1e-3) ``` The output in `x` contains a lot of information, and you should see the ALDEx2 R documentation for a complete explanation. The impatient can see the commands to access all the available data with `methods(class='aldex.clr')` and the individual methods can be accessed via `help(getFeatures)` substituting getFeatures for the appropriate method name. ## The `aldex.ttest` module The next operation performs the Welch's t and Wilcoxon rank test for the situation when there are only two conditions. There is only one input: the aldex object from `aldex.clr`. Several parameters modify this test: whether or not to perform a paired test `paired-test`; whether or not to display the histogram of p-values for the first MC instance `hist.plot`; whether or not to give verbose progress `verbose`. The significance tests are posterior p-values and are calculated as one-tailed tests with the test statistic doubled to approximate the two-tailed test^[@van1967combination]. This gets rid of a small number of false positive p-values that arose because of the sign of the test statistic being unstable in the underlying posterior distribution. This operation is reasonably fast thanks to Thom Quinn and Michelle Nixon. ```{r,echo=TRUE, eval=F, results="hide", fig.cap="the output is a dataframe x.tt"} x.tt <- aldex.ttest(x, hist.plot=F, paired.test=FALSE, verbose=FALSE) ``` ## The `aldex.effect` module We estimate and report effect sizes based on the within and between condition values in the case of two conditions. This step is required for plotting; in our lab we base our conclusions primarily on the output of this function^[@macklaim:2013;@mcmurrough:2014;@bian:2017]. There is one input: the aldex object from aldex.clr; and several useful parameters. It is possible to include the 95% confidence interval information for the effect size estimate with the boolean flag `CI=TRUE`. This can be helpful when deciding whether to include or exclude specific features from consideration. We find that a large effect that is driven by an outlier can be because of false positives. New is the option to calculate effect sizes for paired t-tests or repeated samples with the boolean `paired.test=TRUE` option. In this case the difference between and difference within values are calculated as the median paired difference and the median absolute deviation of that difference. The standardized effect size is their ratio and is usually slightly larger than the unpaired effect size. The supplied selex dataset is actually a paired dataset and can be used to explore this option. Other parameters are `include.sample.summary` which gives the expected clr values for each feature; `useMC` for multicore-testing shows this is usually slower than single core; `glm.conds` when calculating effects for the glm contrasts; and finally `verbose` if you want to more closely follow progress. The function has been heavily optimized for speed and includes a lot of vectorization under the hood. ```{r,echo=T, eval=F, results="hide",fig.cap="the output is a dataframe x.effect"} x.effect <- aldex.effect(x, CI=T, verbose=F, include.sample.summary=F, paired.test=FALSE) ``` ## The `aldex.plot` module We merge the data into one object for plotting. We see that the plotted data in Figure 1 and 2 are essentially the same. Note that plotting for the `aldex.glm` function is different and is shown later. ```{r three2-plots,fig=TRUE,echo=TRUE,width=7,height=4, fig.cap="Output from aldex.plot function. The left panel is the MA plot, the right is the MW (effect) plot. In both plots red represents features called as differentially abundant with q $<0.05$; grey are abundant, but not differentially abundant; black are rare, but not differentially abundant. This function uses the combined output from the aldex.ttest and aldex.effect functions above"} # merge into one output for convenience x.all <- data.frame(x.tt,x.effect) par(mfrow=c(1,3)) aldex.plot(x.all, type="MA", test="welch", main='MA plot') aldex.plot(x.all, type="MW", test="welch", main='effect plot') aldex.plot(x.all, type="volcano", test="welch", main='volcano plot') ``` The `aldex.plot()` function gives you a lot of control. Figure \@ref(fig:three2-plots) shows the three plot types; effect (type='MW'), MA or Bland-Altmann (type='MA') and volcano (type='volcano). They are nearly identical to those in \@ref(fig:three-plots), differences are due to the random Monte-Carlo sampling. You can choose to use statistical tests or only effect sizes (type='welch' or 'effect'). You can control the title, axis labels and scales as well. See `?aldex.plot` for all this can do! `r Biocpkg("ALDEx2")` also allows you to plot and explore the underlying posterior distributions for each part. This is with the `aldex.plotFeature()` function and the output is shown in Figure \@ref(fig:plot-feature). This is a great way to get intuition around the underlying distributions. We see that the posterior distributions are skewed, that there is considerable uncertainty in the difference and effect sizes, and that the tails are very long. ```{r plot-feature,fig=TRUE,echo=TRUE,width=7,height=4, fig.cap="Output from aldex.plotFeature function. This plots the posterior distribution, the difference between and within distributions (dispersion) and the effect size distributions. The values reported are the medians of those distributions, but exploration shows that for all but the most separated parts, there is essentially always some overlap or uncertainty in the posteriors. "} # merge into one output for convenience # we start with the ouput from aldex.clr # we provide the name of the row we wish to explore aldex.plotFeature(x, 'S:E:G:D') ``` ### The effect confidence interval As noted above, the `r Biocpkg("ALDEx2")` package generates a posterior distribution of the probability of observing the count given the data collected. Here we show the importance of this approach by examining the 95% CI of the effect size. Throughout, we use a standardized effect size, similar to the Cohen's d metric, although our effect size is more robust and more conservative (being approximately 0.7 Cohen's d when the data are Normally distributed)^[Fernandes, Gloor unpublished]. Examining the outputs of the effect, MA and the posterior distribution plots shows a very important point: there is a tremendous amount of latent variation in sequencing data \emph{simply from random sampling alone}. We observe that there are a few features that have an expected q value that is statistically significantly different, which are both relatively rare and which have a relatively small difference. Even when identifying features based on the expected effect size can be misleading. We find that the safest approach is to identify those features that where the 95% CI of the effect size does not cross 0. Examining the figures we see that the features that are the rarest are least likely to reproduce the effect size with simple random sampling. The behaviour of the 95% CI metric is exactly in line with our intuition: the precision of estimation of rare features is poor---if you want more confidence in rare features you must spend more money to sequence more deeply. Figure \@ref(fig:CI) shows that using both an effect size (or if you wish a p-value cutoff) along with the effect CI cutoff can remove some features that are not truly separable between groups. With this approach we are accepting the biological variation in the data as received; i.e. we are not inferring any additional biological variation: i.e., the experimental design is always as given. We are identifying those features where simple random sampling of the library would be expected to give the same result every time. This is the approach that was used in [@macklaim:2013], the results of which were independently validated and found to be very robust [@nelson:2015vaginal]. This approach requires that the CI=T parameter be set when calculating effect sizes. ```{r CI,fig=TRUE,echo=FALSE,width=7,height=4, fig.cap="Comparing the mean effect with the 95\\% CI. The left panel is MA plot, the right is the MW (effect) plot. In both plots red points indicate features that have an expected effect value of $>2$; those with a blue circle have the 95\\% CI that does not overlap 0; and grey are features that are not of interest. Both the raw effect size and the CI test exclude different features from being different. This plot uses only the output of the aldex.effect function. The grey lines in the effect plot show the effect=2 isopleth."} sgn <- sign(x.effect$effect.low) == sign(x.effect$effect.high) par(mfrow=c(1,2)) plot(x.effect$rab.all, x.effect$diff.btw, pch=19, cex=0.3, col="grey", xlab="Abundance", ylab="Difference", main="Bland-Altman") points(x.effect$rab.all[abs(x.effect$effect) >=2], x.effect$diff.btw[abs(x.effect$effect) >=2], pch=19, cex=0.5, col="red") points(x.effect$rab.all[sgn], x.effect$diff.btw[sgn], cex=0.7, col="blue") plot(x.effect$diff.win, x.effect$diff.btw, pch=19, cex=0.3, col="grey",xlab="Dispersion", ylab="Difference", main="Effect") points(x.effect$diff.win[abs(x.effect$effect) >=2], x.effect$diff.btw[abs(x.effect$effect) >=2], pch=19, cex=0.5, col="red") points(x.effect$diff.win[sgn], x.effect$diff.btw[sgn], cex=0.7, col="blue") abline(0,2, lty=2, col="grey") abline(0,-2, lty=2, col="grey") ``` ## Complex study designs and the aldex.glm module The aldex.glm module is included so that the probabilistic compositional approach can be used for complex study designs. This module is substantially slower than the two-comparison tests above, but we think it is worth it if you have complex study designs. Essentially, the approach is the modular approach above but using a model matrix and covariates supplied to the glm function in R. The values returned are the expected values of the glm function given the inputs. In the example below, we are measuring the predictive value of variables A and B independently. See the documentation for the R formula function, or http://faculty.chicagobooth.edu/richard.hahn/teaching/formulanotation.pdf for more information. Validation of features that are differential under any of the variables identified by the aldex.glm function should be performed by plotting and for this the aldex.glm.effect function as a post-hoc test. Note that `aldex.clr` will accept the `denom="all"` or a user-defined vector of denominator offsets when a model matrix is supplied. Therefore, when it is intended that the downstream analysis is the `aldex.glm` function only those two denominator options are available. This will be addressed in a future release, and will likely be deprecated because the `aldex.clr` and all downstream functions are fully compatible with the scale modelling which is both more efficient and more easily understood. ```{r, echo=T, eval=F} data(selex) selex.sub <- selex[1200:1600, ] covariates <- data.frame("A" = sample(0:1, 14, replace = TRUE), "B" = c(rep(0, 7), rep(1, 7)), "Z" = sample(c(1,2,3), 14, replace=TRUE)) mm <- model.matrix(~ A + Z + B, covariates) x.glm <- aldex.clr(selex.sub, mm, mc.samples=4, denom="all", verbose=T) glm.test <- aldex.glm(x.glm, mm, fdr.method='holm') glm.eff<- aldex.glm.effect(x.glm) ``` The `aldex.glm.effect` function will calculate the effect size for all models in the matrix that are binary. The effect size calculations for each binary predictor are output to a named list. These effect sizes, and outputs from `aldex.glm` can be plotted and displayed as in Figure \@ref(fig:glm) using the example code below. The `aldex.glm.plot` function, which mirrors the `adlex.plot` function for glm and glm.effect outputs. ```{r glm, fig=TRUE,echo=TRUE,width=7,height=4, message=F, error=F, warning=F, fig.cap="The glm output can be plotted with the aldex.glm.plot function. The values should be nearly identical to the t-test in the same dataset. Here we show an effect (MW) plot of the with significant expected p-values for variables output from the glm function with model B (actual study design groups) highlighted in red. This toy model shows how to use a generalized linear model within theALDEx2 framework. All values are the expected value of the test statistic."} # NEW plot the glm.test and glm.eff data for particular contrasts aldex.glm.plot(glm.test, eff=glm.eff, contrast='B', type='MW', test='fdr') ``` ## The `aldex.kw` module Alternatively, the user can perform the Kruskal Wallace tests for one-way ANOVA of two or more conditions. Here there are only two inputs: the aldex object from aldex.clr, and the vector of conditions. Note that this is s.l.o.w! and the `aldex.glm()` function should be used instead. The user is on their own for plotting as this function should be viewed as unsupported. ```{r,echo=TRUE, eval=FALSE, results="hide"} x.kw <- aldex.kw(x) ``` # Adding scale to ALDEx2 While the actual size or scale of the underlying data is lost during the process of sequencing, we can determine if any of the underlying results are dependent on scale. This theory behind this new functionality is outlined in Nixon et al.^[@nixon2023scale] and McGovern et al.^[@McGovern2023]. Essentially, we can think of the `aldex.clr` function being able to add both measurement error through Dirichlet Monte-Carlos sampling of the actual measurements and scale uncertainty through random sampling of the geometric mean of those measurements. Indeed, as shown below, we can infer the actual scale difference between groups. Scale uncertainty increase the robustness of the results, and can be used to properly centre the data. We recommend using scale instead of using the denominator adjustment approach given in Wu et al.^[@Wu2021]. The example below is only a small sample of the scale functionality, see the `scale_sim` vignette for full description. We will demonstrate the ALDEx2 scale function in two ways; with essentially no scale (`gamma=1e-3`) and by making a full scale model. If necessary scale can be ignored completely by setting `gamma=NULL`, in which case previous default behavior is replicated exactly. The minimally scaled version is simply the outputs we have been examining until now of the in vitro selection dataset. Here we have a standard of truth, and a theoretical grounding that the difference in scale is real. We have a situation where only a minority of the parts change in both relative and absolute abundance. Prior analyses[@mcmurrough:2014] suffered from a false positive bias because of this, and were not amenable to analysis with any other tool. ```{r, echo=TRUE,eval=T, warning=F, message=F, fig.cap="Including scale uncertainty in the posterior model."} set.seed(11) data(selex) selex.sub <- selex[1201:1600,] conds <- c(rep("NS", 7), rep("S", 7)) # unscaled x.u <- aldex.clr(selex.sub, conds, mc.samples=16, gamma=1e-3, verbose=FALSE) x.u.e <- aldex.effect(x.u, CI=T) # scaled # this represents and 8-fold difference in scale between groups mu.mod <- c(rep(1,7), rep(8,7)) scale.model <- aldex.makeScaleMatrix(gamma=0.5, mu=mu.mod, conditions=conds, mc.samples=16, log=FALSE) x.ss <- aldex.clr(selex.sub, conds, mc.samples=16, gamma=scale.model, verbose=FALSE) x.ss.e <- aldex.effect(x.ss, CI=T) x.ss.tt <- aldex.ttest(x.ss) x.ss.all <- cbind(x.ss.e, x.ss.tt) ``` First, an explanation of the scale model that was made. The `aldex.makeScaleMatrix()` function makes a full scale model of the data that depends on the ratios between the scales of the two groups. In the case of the SELEX dataset, we are explicitly modelling an 8-fold difference in scale between the two datasets. Now let's look at the unscaled posterior distributions in Figure \@ref(fig:scale-pre), ```{r scale-pre, fig=TRUE,echo=TRUE,width=7,height=4, message=F, error=F, warning=F, fig.cap="The unscaled posterior distribution plot"} aldex.plotFeature(x.u, 'S:D:S:D', pooledOnly=T) ``` and compare this to Figure \@ref(fig:post-scale). The unscaled posteriors are very different, mainly because the control posterior is very narrow. The scaled posterior distributions also very different but are both more dispersed because of the additional uncertainty introduced by the scale model (note the difference in x-axis scales). In addition, the clr values are different because we are supplying the denominator values directly (log2(1) and log2(8) rather than calculating the geometric means of the samples. We stress here that the actual values of the denominator used is irrelevant, instead it is the ratio between the denominators of the two groups that is key. ```{r post-scale, fig=TRUE,echo=TRUE,width=7,height=4, message=F, error=F, warning=F, fig.cap="The scaled posterior distribution plot. Note the wider dispersion on the x axis"} aldex.plotFeature(x.ss, 'S:D:S:D', pooledOnly=T) ``` We can see the effect of scale if we plot the `aldex.effect` outputs from scaled and unscaled data. We see that the difference within groups is larger, the effect size is smaller but the difference between groups is unchanged when scale is added as shown in Figure \@ref(fig:comp-scale). ```{r comp-scale, fig=T, echo=T, width=5, height=3, fig.cap="Comparing unscaled and scaled difference, dispersion and effect size."} par(mfrow=c(1,3)) plot(x.u.e$diff.btw, x.ss.e$diff.btw, main='diff.btw', xlab='unscaled diff', ylab='scaled diff', pch=19, col='red') abline(0,1) plot(x.u.e$diff.win, x.ss.e$diff.win, main='diff.win', xlab='unscaled dispersion', ylab='scaled dispersion', pch=19, col='red') abline(0,1) plot(x.u.e$effect, x.ss.e$effect, main='effect size', xlab='unscaled effect', ylab='scaled effect', pch=19, col='red') abline(0,1) ``` We can also examine the effect plots, showing those parts where the 95\% CI of the effect does not overlap 0 for the unscaled and scaled outputs as an example as in Figure \@ref(fig:scale-CI). We see that adding scale uncertainty to the posterior model results in an increase in the dispersion of the data, but changes the difference between measurement very little. In this dataset we have a standard of truth in an external growth assay for several of the parts displayed here. The scaled data is more similar to what we expect and the S:E:G:D part is the only one displayed on this plot that has appreciable growth in a single growth assay[@mcmurrough:2014]. Thus, adding scale uncertainty removes a large number of false positive identifications, without affecting the identification of the one true positive. Indeed, the true positive stands out more with scale than with no scale uncertainty added. ```{r scale-CI, fig=TRUE,echo=TRUE,width=8,height=4, message=F, error=F, warning=F, fig.cap="Compare the unscaled and scaled plots. Here we see that adding scale uncertainty increases the dispersion with little effect on the difference. As a result, we increase our confidence in the reproducibility of the S:E:G:D feature, which has been shown to have in vitro enzymatic activity. In the third plot, which shows features with a BH valued less than 0.05, the S:E:G:D feature is the clear outlier."} par(mfrow=c(1,3)) sgn.u <- sign(x.u.e$effect.low) == sign(x.u.e$effect.high) sgn.s <- sign(x.ss.e$effect.low) == sign(x.ss.e$effect.high) plot(x.u.e$diff.win, x.u.e$diff.btw, pch=19, cex=0.3, col="grey",xlab="Dispersion", ylab="Difference", main="unscaled-CI", xlim=c(0.2,5), ylim=c(-2.5,10)) points(x.u.e$diff.win[abs(x.u.e$effect) >=2], x.u.e$diff.btw[abs(x.u.e$effect) >=2], pch=19, cex=0.5, col="red") points(x.u.e$diff.win[sgn.u], x.u.e$diff.btw[sgn.u], cex=0.7, col="blue") text(x.u.e['S:E:G:D', 'diff.win'], x.u.e['S:E:G:D', 'diff.btw']-0.6, labels='S:E:G:D') abline(0,2, lty=2, col="grey") abline(0,-2, lty=2, col="grey") plot(x.ss.e$diff.win, x.ss.e$diff.btw, pch=19, cex=0.3, col="grey",xlab="Dispersion", ylab="Difference", main="scaled-CI", xlim=c(0.2,5), ylim=c(-2.5,10)) points(x.ss.e$diff.win[abs(x.ss.e$effect) >=2], x.ss.e$diff.btw[abs(x.ss.e$effect) >=2], pch=19, cex=0.5, col="red") points(x.ss.e$diff.win[sgn.u], x.ss.e$diff.btw[sgn.u], cex=0.7, col="blue") text(x.ss.e['S:E:G:D', 'diff.win'], x.ss.e['S:E:G:D', 'diff.btw']-0.6, labels='S:E:G:D') abline(0,2, lty=2, col="grey") abline(0,-2, lty=2, col="grey") aldex.plot(x.ss.all, test='welch', main="scaled-BH", xlim=c(0.2,6)) abline(0,2, lty=2, col="grey") abline(0,-2, lty=2, col="grey") text(x.ss.all['S:E:G:D', 'diff.win'], x.ss.all['S:E:G:D', 'diff.btw']-0.6, labels='S:E:G:D') ``` \clearpage # ALDEx2 outputs ## Expected values `r Biocpkg("ALDEx2")` returns expected values for summary statistics. It is important to note that ALDEx uses Bayesian sampling from a Dirichlet distribution to estimate the underlying technical variation. This is controlled by the number of `mc.samples`, in practice we find that setting this to 16 or 128 is sufficient for most cases as `r Biocpkg("ALDEx2")` is estimating the expected value of the distributions^[@fernandes:2013;@fernandes:2014;@gloorAJS:2016]. In practical terms, `r Biocpkg("ALDEx2")` takes the biological observations as given, but infers technical variation (sequencing the same sample again) multiple times using the `aldex.clr` function. Thus, the expected values returned are those that would likely have been observed \emph{if the same samples had been run multiple times}. The user is cautioned that the number of features called as differential will vary somewhat between runs because of this sampling procedure. However, only features with values close to the chosen significance cutoff will vary between runs. The addition of scale allows the model to include potential biological variation that is lost during the sequencing library and data acquisition process. Several papers have suggested that `r Biocpkg("ALDEx2")` is unable to properly control for the false discovery rate since the p-values returned do not follow a random uniform distribution, but rather tend to cluster near a value of 0.5^[@hawinkel2017;@Thorsen:2016aa]. This is expected behaviour. These studies also show that point estimate approaches are very sensitive to particular experimental designs and differences in sparsity and read depth. However, `r Biocpkg("ALDEx2")` is not sensitive to these characteristics of the data, but seem to under-report the true FDR. The criticisms misses the mark on `r Biocpkg("ALDEx2")` because `r Biocpkg("ALDEx2")` reports the \emph{expected} p-value across the Dirichlet Monte-Carlo replicates. Features that are differential simply because of the vagaries of random sampling will indeed have a random uniform p-value as point estimates, but will have an expected p-value after repeated random sampling of 0.5. In contrast, features that are differential because of true biological variation are robust to repeated random sampling. Thus, `r Biocpkg("ALDEx2")` identifies as differential only those features where simple random sampling (the minimal NULL hypothesis) cannot explain the difference. In our experience, we observe that `r Biocpkg("ALDEx2")` returns a set of features that is very similar to the set returned as the intersect of multiple independent tools---a common recommendation when examining HTS datasets^[@Soneson:2013] ## Explaining the outputs ### `aldex.effect(x)` (rowname) | rab.all | rab.win.NS | rab.win.S |diff.btw | diff.win | effect | overlap -|-|-|-|-|-|-|-| A:D:A:D | 1.42494 | 1.30886 | 2.45384| 1.12261 | 1.72910 | 0.471043 | 0.267260701 A:D:A:E | 1.71230 | 1.49767 | 4.23315 | 2.73090 | 2.38134 | 1.034873 | 0.135857781 A:E:A:D | 3.97484 | 1.41163 | 11.02154| 9.64287 | 2.85008 | 3.429068 | 0.000156632 ### `aldex.ttest(x)` (rowname) | we.ep | we.eBH | wi.ep | wi.eBH -|-|-|-|- A:D:A:D | 4.030e-01 | 0.63080 | 0.239383 | 0.43732 A:D:A:E | 1.154e-01 | 0.34744 | 0.040901 | 0.15725 A:E:A:D | 8.987e-05 | 0.00329 | 0.000582 | 0.00820 In the list below, the `aldex.ttest` function returns the values highlighted with $\ast$ and the `aldex.effect` function returns the values highlighted with $\diamondsuit$. Note that when scale is applied that the clr is only used with the default scale model, and the log-ratio for the full scale model is calculated with a user-defined denominator instead of the geometric mean of each sample. In practice, the ratio between the denominators is more important than the actual way the denominator is calculated. 1. $\diamondsuit$ rab.all - median clr value for all samples in the feature 2. $\diamondsuit$ rab.win.NS - median clr value for the NS group of samples 3. $\diamondsuit$ rab.win.S - median clr value for the S group of samples 4. $\diamondsuit$ rab.X1\_BNS.q50 - median expression value of features in sample X1\_BNS if `include.item.summary=TRUE` 5. $\diamondsuit$ dif.btw - median difference in clr values between S and NS groups 6. $\diamondsuit$ dif.win - median of the largest difference in clr values within S and NS groups 7. $\diamondsuit$ effect - median effect size: diff.btw / max(diff.win) for all instances 8. $\diamondsuit$ overlap - proportion of effect size distribution that overlaps 0 (i.e. no effect) 9. $\ast$ we.ep - Expected p-value of Welch's t-test, a posterior predictive p-value 10. $\ast$ we.eBH - Expected Benjamini-Hochberg corrected p-value of Welch's t test 11. $\ast$ wi.ep - Expected p-value of Wilcoxon rank test 12. $\ast$ wi.eBH - Expected Benjamini-Hochberg corrected p-value of Wilcoxon test ### `aldex.glm(x)` The output from the `aldex.glm` function is somewhat different, although it still returns the expected values of the test statistics. For example, using the `selex` dataset and the model matrix from the `aldex.glm` help information: ```{r, eval=F, echo=TRUE,width=7,height=4, message=F, error=F, warning=F} conds <- data.frame("A" = sample(0:1, 14, replace = TRUE), "B" = c(rep(0, 7), rep(1, 7))) mm <- model.matrix(~ A + B, conds) x <- aldex.clr(selex, mm, mc.samples=4) glm.test <- aldex.glm(x) glm.eff <- aldex.glm.effect(x) ``` In this example the column names from `head(glm.test)`: 1. (rowname) - empty header, but rownames displayed 2. Intercept::Est - the expected intercept term 3. Intercept::SE - the expected intercept standard error 4. Intercept::t.val - the expected intersept t statistic 5. Intercept::pval - the expected intersept p-value 6. A:Est - the expected A term 7. A:SE - the expected A standard error 8. A:t.val - the expected A term t statistic 9. A:pval - the expected A term p-value term 10. B:Est - the expected B term 11. B:SE - the expected B standard error 12. B:t.val - the expected B term t statistic 13. B:pval - the expected B term p-value term 14. Intercept::pval.holm - the expected intersept adjusted p-value 15. A:pval.holm - the expected A term adjusted p-value 16. B:pval.holm - the expected B term adjusted p-value Note that we have generated both a random model ``A'' and the correct model ``B''. We anticipate that model A will give no statistically significant outputs, and that B will give similar results to a t-test, and indeed this is the case. The outputs are expected values of the intercept and its coefficients, and the estimates for each model and their coefficients. Also calculated is the Family Wide Error Rate (FWER) for each p-value (by default reported in the `model.x Pr(>|t|).holm` output. This is a Holm-Bonferroni FWER correction that is more powerful than the Bonferroni correction, and is a step-down correction that is more stringent than a false discovery rate correction such as Benjamini-Hochberg. Note that for a generalized linear model, Tukey's HSD is not applicable to control the FWER. This is because Tukey's HSD depends on an ANOVA in the background in R and packages that calculate Tukey's HSD essentially all do an `aov`, thus the FWER is calculated on a different test statistic than is calculated by the glm. The expected value of the Holm-Bonferroni test can be used as the post-hoc standin for Tukey's HSD. Note, that this is true only when there are a relatively small number of contrasts (up to 10) in the model^[@kim2015RWER]. For convenience the user can also select the Benjamini-Hochberg correction. ## A word about effect size and overlap The effect size metric used by `r Biocpkg("ALDEx2")` is a standardized distributional effect size metric developed specifically for this package. The measure is somewhat robust, allowing up to 20% of the samples to be outliers before the value is affected, returns an effect size that is 71% the size of Cohen's d on a Normal distribution, and requires at worst twice the number of samples as fully parametric methods (which are not robust) would to estimate values with the same precision. The metric is equally valid for Normal, random uniform and Cauchy distributions^(@Gloor submitted). We prefer to use the effect size whenever possible rather than statistical significance since an effect size tells the scientist what they want to know---"what is reproducibly different between groups"; this is not something that p-values deliver. We find that using the effect size returns a consistent set of true positive features irregardless of sample size, unlike p-value based methods. Furthermore, over half of the the false positive features that are observed at low sample sizes have and effect size $> 0.5 \times \mathrm{E}$ the chosen effect size cutoff $\mathrm{E}$. This is true regardless of the source of the dataset (@Gloor submitted). We suggest that an effect size cutoff of 1 or greater be used when analyzing HTS datasets. If preferred the user can also set a fold-change cutoff as is commonly done with p-value based methods, and volcano plots are now included as an option to help guide the user. ### Alternative plotting of outputs The built-in aldex.plot function described above will usually be sufficient, but for more user control the example in Figure 10 shows a plot that shows which features are found by the Welchs' or Wilcoxon test individually (blue) or by both (red). ```{r, fig=TRUE,echo=TRUE,fig.small=TRUE, fig.cap="Differential abundance in the selex dataset using the Welch's t-test or Wilcoxon rank test. Features identified by both tests shown in red. Features identified by only one test are shown in blue dots. Non-significant features represent rare features if black and abundant features if grey dots."} # identify which values are significant in both the t-test and glm tests found.by.all <- which(x.all$we.eBH < 0.05 & x.all$wi.eBH < 0.05) # identify which values are significant in fewer than all tests found.by.one <- which(x.all$we.eBH < 0.05 | x.all$wi.eBH < 0.05) # plot the within and between variation of the data plot(x.all$diff.win, x.all$diff.btw, pch=19, cex=0.3, col=rgb(0,0,0,0.3), xlab="Dispersion", ylab="Difference") points(x.all$diff.win[found.by.one], x.all$diff.btw[found.by.one], pch=19, cex=0.7, col=rgb(0,0,1,0.5)) points(x.all$diff.win[found.by.all], x.all$diff.btw[found.by.all], pch=19, cex=0.7, col=rgb(1,0,0,1)) abline(0,1,lty=2) abline(0,-1,lty=2) ``` \clearpage # Troubleshooting datasets In some cases we observe that the data can be asymmetric. This occurs when the data are extremely asymmetric, such as when one group is largely composed of features that are absent in the other group or when the direction of change in the experiment is not symmetric. In this case the geometric mean will not accurately represent the appropriate basis of comparison for each group. An asymmetry can arise for many reasons: in RNA-seq it could arise because samples in one group contain a plasmid and the samples in the other group do not; in metagenomics or 16S rRNA gene sequencing it can arise when the samples in the two groups are taken from different environments; in a selex experiment it can arise because the two groups are under different selective constraints. The asymmetry can manifest either as a difference in sparsity (i.e., one group contains more 0 value features than the other) or as a systematic difference in abundance. When this occurs the geometric mean of each sample and group can be markedly different, and thus an inherent skew in the dataset can occur that leads to false positive and false negative feature calls. Asymmetry generally shows as a the centre of mass of the histogram for the x.all\$diff.btw or x.all\$effect being not centred around zero^[@Wu2021]. We recommend that all datasets be examined for asymmetry. ## Using scale to correct for asymmetric datasets The preferred approach is to build a full scale model of the asymmetry and to use that as an input to `aldex.clr`. Fundamentally, asymmetry will often arise because of a difference in size of the underlying environment. For example, one cell type may grow faster than another, or may have a different total number of mRNA molecules than another, or one environment may have a different microbial carrying capacity than another. This is the situation that is best taken care of by a scale model^[@nixon2023scale;@McGovern2023]. The example dataset below is the simulated synth2 dataset from from Wu et al[@Wu2021]. In this example there are 1000 features with 20 (2\%) of those being asymmetrically sparse; i.e., 20 features are 0 in one group and non-zero in the other. A further 50 have an approximate 2-fold change (FC is normally distributed). The asymmetry has the effect of changing the geometric mean of the two groups. In addition, the data have an unrealistically low dispersion. The three plots show the behaviour of unmodified and scale corrected `ALDEx2`. The base output shows that some features are differentially abundant because of the inappropriate centre of the data. The approximate centre is the median difference of the points and is noted. Adding in uncertainty about the scale of the data has a minimal effect on the median difference, but does remove all false positives. However, now there is an asymmetry in the false negatives, in that those in the 'S' category are more likely to be called negatives than those in the 'N' category. Finally, adding in uncertainty about scale, and information on the fraction of the features that are asymmetric results in the desired output. We recommend the scale approach whenever possible, especially if there is some underlying biology that could drive the asymmetry. See the `scale_sim` vignette for more information on these new capabilities. The default `ALDEx2` scale correction includes only uncertainty and not asymmetry. ```{r scale, message=F, error=F, warning=F, fig.cap="Differential abundance in a synthetic dataset using different scale parameters for the clr calculation. In this data 2\\% of the features are modelled to be sparse in one group but not the other and a further 5\\% of the features are differentially abundant. Features determined to be different between two groups are shown in red. Features that are non-significant are in black. Lines of constant effect are drawn at 0, and $\\pm$ 1." } set.seed(11) data(synth2) # basic ALDEx2 blocks <- c(rep("N", 10),rep("S", 10)) x <- aldex.clr(synth2, blocks, gamma=NULL) x.e <- aldex.effect(x) # Add in asymmetry and scale variance # gamma can be smaller for the same output dispersion # in the full scale model built here # make a base scale for group 1 and 2 mu.vec = c(log2(rep(1,10)), log2(rep(1.02,10))) scale_samples <- aldex.makeScaleMatrix(gamma=0.25, mu=mu.vec, conditions=blocks) xs <- aldex.clr(synth2, blocks, gamma=scale_samples) xs.e <- aldex.effect(xs) # Add in only scale variance with sd=1 xg <- aldex.clr(synth2, blocks, gamma=0.5) xg.e <- aldex.effect(xg) par(mfrow=c(1,3)) aldex.plot(x.e, test='effect', main='base', xlim=c(0.1,3)) text(1, 4, labels=paste('median diff =',round(median(x.e$diff.btw),3))) aldex.plot(xg.e, test='effect', main='ratio=1, sd=0.5', xlim=c(0.1,3)) text(1.6, 4, labels=paste('median diff =',round(median(xg.e$diff.btw),3))) aldex.plot(xs.e, test='effect', main='ratio=1.02:1, sd=0.5', xlim=c(0.1,3)) text(1.6, 4, labels=paste('median diff =',round(median(xs.e$diff.btw),3))) ``` ## Subsetting features to correct for asymmetric datasets The other approach that can be taken by `r Biocpkg("ALDEx2")` is to identify those features that are relatively invariant in all features in the entire dataset even though many features could be asymmetric between the groups. Fundamentally, the log-ratio approach requires that the denominator across all samples be comparable. The output of `aldex.clr` contains the offset of the features used for the denominator in the `@denom` slot. These methods should be considered deprecated and may be removed in future releases since their outcomes can be completely reproduced using scale models which are much more easily explained and which provide insights into the underlying environment from which the data were collected. ### Parameter values for `denom` The `aldex.clr` function incorporates multiple approaches to select the denominator that can best deal with asymmetric datasets: _IMPORTANT: no rows should contain all 0 values as they will be removed by the `aldex.clr` function_ 1. _all_: The default is to calculate the geometric mean of all features using the centred log-ratio of Aitchison^[Aitchison:1986]. This is the default method for the compositional data analysis approach. 2. _iqlr_: The _iqlr_ method identifies those features that exhibit reproducible variance in the entire dataset. This is called the inter-quartile log-ratio or _iqlr_ approach. For this, a uniform prior of 0.5 is applied to the dataset, the clr transformation is applied, and the variance of each feature is calculated. Those features that have variance values that fall between the first and third quartiles of variance for all features in all groups in the dataset are retained. When aldex.clr is called, the geometric mean abundance of only the retained features is calculated and used as the denominator for log-ratio calculations. Modelling shows that this approach is effective in dealing with datasets with up to 25% of the features being asymmetric. The approach has the advantage it has little or no effect on symmetric datasets and so is a safe approach if the user is unsure if the data is mildly asymmetric. 3. _lvha_: This method identifies those features that in the bottom quartile for variance in each group and the top quartile for relative abundance for each sample and across the entire dataset. This method is appropriate when the groups are very asymmetric, but there are some features that are expected to be relatively constant. The basic idea here is to identify those features that are relatively constant across all samples, similar to the features that would be chosen as internal standards in qPCR. Experience suggests that meta-genomic and meta-transcriptomic datasets can benefit from this method of choosing the denominator. This method does not work with the selex dataset, since no features fit the criteria. 4. _zero_: This approach identifies and uses only those features that are non-zero in each group. In this approach the per-group non-zero features are used when `aldex.clr` calculates the geometric mean in the clr transformation. This method is appropriate when the groups are very asymmetric, but the experimentalist must ask whether the comparison is valid in these extreme cases. 5. _user_: The last new approach is to let the user define the set of `invariant' features. In the case of meta-rna-seq, it could be argued that the levels of housekeeping genes should be standard for all samples. In this case the user could define the row indices that correspond to the particular set of housekeeping genes to use as the standard. \emph{It is important that no row in the dataset contains all 0 values for any feature when this method is used.} 6. _iterate_: This method identifies those features that are not statistically significantly different between groups using the statistical test of choice. It may be combined with other methods. ## Very large datasets On occasion `r Biocpkg("ALDEx2")` may overflow the memory available. This is seen when very large datasets containing thousands of samples and tens of thousands of features are used. The workaround is simple. Run the `'aldex` function with `MC.samples=2` multiple times, and then take the average of the values that you want to use for inference. We recommend taking the mean of at least 16 such replicates. Testing shows that this recapitulates the expected result, and is not (too far off) what ALDEx2 is doing under the hood. \clearpage # Contributors I am grateful that `r Biocpkg("ALDEx2")` has taken on a life of its own. * Andrew Fernandes wrote the original ALDEx code, and designed `r Biocpkg("ALDEx2")`. He developed the effect size metric and the concept of reporting posterior outcomes in all cases. * Jean Macklaim found and squished many bugs, performed unit testing, did much of the original validation. Jean also made seminal contributions to simplifying and explaining the output of `r Biocpkg("ALDEx2")`. * Michelle Pistner Nixon and Justin Silverman implemented the scale idea and code and the `scale_sim` vignette. More scale corrections will be forthcoming. * Michelle Pistner Nixon and Justin Silverman implemented the rigorous posterior p-value idea and code * Matt Links incorporated several `r Biocpkg("ALDEx2")` functions into a multicore environment. * Adrianne Albert wrote the correlation and the one-way ANOVA modules in aldex.kw. * Ruth Grace Wong added function definitions and made the parallel code functional with BioConducter. * Jia Rong Wu developed and implemented the alternate denominator method to correct for asymmetric datasets. * Andrew Fernandes, Jean Macklaim and Ruth Grace Wong contributed to the Sum-FunctionsAitchison.R code. * Thom Quinn rewrote the t-test and Wilcoxon functions to make them substantially faster. He also wrote much of the current aldex.glm function. Tom's `r CRANpkg("propr")`^[@Quinn:2017] R package is able to integrate with the output from aldex.clr. * Vladimir Mikryukov and everyone named above have contributed bug fixes * Greg Gloor currently maintains `r Biocpkg("ALDEx2")` and had and has roles in documentation, design, testing and implementation. # Version information Version 1.04 of ALDEx was the version used for the analysis in Macklaim et al.^[@macklaim:2013;@fernandes:2013]. This version was suitable only for two-sample two-group comparisons, and provided only effect size estimates of difference between groups. ALDEx v1.0.4 is available at: \begin{verbatim}https://github.com/ggloor/ALDEx2/blob/master/ALDEx\_1.0.4.tar.gz\end{verbatim}. No further changes are expected for that version since it can be replicated completely within `r Biocpkg("ALDEx2")` by using only the `aldex.clr` and `aldex.effect` commands. Versions 2.0 to 2.05 were development versions that enabled p-value calculations. Version 2.06 of `r Biocpkg("ALDEx2")` was the version used for the analysis in^[@fernandes:2014]. This version enabled large sample comparisons by calculating effect size from a random sample of the data rather than from an exhaustive comparison. Version 2.07 of `r Biocpkg("ALDEx2")` was the initial the modular version that exposed the intermediate calculations so that investigators could write functions to analyze different experimental designs. As an example, this version contains an example one-way ANOVA module. This is identical to the version submitted to Bioconductor as 0.99.1. Future releases of `r Biocpkg("ALDEx2")` now use the Bioconductor versioning numbering. ```{r session} sessionInfo() ```