---
title: "Two-sample tests based on the 2-Wasserstein distance"
output: rmarkdown::html_vignette
vignette: >
    %\VignetteIndexEntry{wasserstein_test}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```


## Testing procedures

The `waddR` package provides two testing procedures using the 2-Wasserstein
distance to test whether two distributions $F_A$ and $F_B$ given in the form of
samples are different by specifically testing the null hypothesis
$H_0: F_A = F_B$ against the alternative
$H_1: F_A \neq F_B$.


The first, semi-parametric (SP), procedure uses a test based on permutations
combined with a generalized Pareto distribution approximation to estimate
small p-values accurately.


The second procedure (ASY) uses a test based on asymptotic theory which is
valid only if the samples can be assumed to come from continuous distributions.

## Examples

To demonstrate the capabilities of the testing procedures, we consider models
based on normal distributions here.
We exemplarily construct three cases in which two distributions (samples)
differ with respect to location, size, and shape, respectively, and one case
without a difference.
For convenience, we focus on the p-value, the value of the (squared)
2-Wasserstein distance, and the fractions of the location, size, and shape
terms (in \%) with respect to the (squared) 2-Wasserstein distance here, while
the functions also provide additional output.


```{r setup}
library(waddR)

spec.output<-c("pval","d.wass^2","perc.loc","perc.size","perc.shape")
```

We start with an example, in which the two distributions (samples) are supposed to only differ
with respect to the location, and show the results for the two testing
procedures (SP and ASY).

Note that the semi-parametric method "SP" with a permutation number of 10000 is used by default in 
`wasserstein.test` if nothing else is specified. The permutation procedure in the "SP" method uses a random group assignment step, and to obtain reproducible results, a seed can be set from the user environment before calling `wasserstein.test`.

```{r diff_loc}
set.seed(24)
ctrl<-rnorm(300,0,1)
dd1<-rnorm(300,1,1)
set.seed(32)
wasserstein.test(ctrl,dd1,method="SP",permnum=10000)[spec.output]
wasserstein.test(ctrl,dd1,method="ASY")[spec.output]
```
We obtain a very low p-value, clearly pointing at the existence of a difference, and
see that differences with respect to location make up by far the major part of
the 2-Wasserstein distance.


Analogously, we look at a case in which the two distributions (samples) are supposed to only
differ with respect to the size.

```{r diff_size}
set.seed(24)
ctrl<-rnorm(300,0,1)
dd2<-rnorm(300,0,2)
set.seed(32)
wasserstein.test(ctrl,dd2)[spec.output]
wasserstein.test(ctrl,dd2,method="ASY")[spec.output]
```


Similarly, we consider an example in which the two distributions (samples) are supposed to only
differ with respect to the shape.
```{r diff_shape}
set.seed(24)
ctrl<-rnorm(300,6.5,sqrt(13.25))
sam1<-rnorm(300,3,1)
sam2<-rnorm(300,10,1)
dd3<-sapply(1:300, 
              function(n) {
                sample(c(sam1[n],sam2[n]),1,prob=c(0.5, 0.5))})
set.seed(32)                
wasserstein.test(ctrl,dd3)[spec.output]
wasserstein.test(ctrl,dd3,method="ASY")[spec.output]
```

Finally, we show an example in which the two distributions (samples) are supposed to be equal. We obtain a high p-value, indicating that the null hypothesis cannot be rejected.
```{r no_diff}
set.seed(24)
ctrl<-rnorm(300,0,1)
nodd<-rnorm(300,0,1)
set.seed(32)
wasserstein.test(ctrl,nodd)[spec.output]
wasserstein.test(ctrl,nodd,method="ASY")[spec.output]
```

## See also

* The [`waddR` package](waddR.html)

* [Calculation of the  Wasserstein distance](wasserstein_metric.html)

* Detection of 
[differential gene expression distributions](wasserstein_singlecell.html)
in scRNAseq data

## Session info

```{r session-info}
sessionInfo()
```