---
title: "Analyzing proportions with the Arrington et al. 2002 example"
bibliography: "../inst/REFERENCES.bib"
csl: "../inst/apa-6th.csl"
output: 
  rmarkdown::html_vignette
description: >
  This vignette describes how a real dataset with 4 factors can be analyzed.
vignette: >
  %\VignetteIndexEntry{Analyzing proportions with the Arrington et al. 2002 example}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, echo = FALSE, message = FALSE, results = 'hide', warning = FALSE}
cat("this is hidden; general initializations.\n")
```

_DISCLAIMER: This example is not terrific because there are empty cells. If you know 
of a better example of proportions with three factors, do not hesitate to let me know._

@a02 published a data set available from the web.
It presents species of fish and what proportion of them were empty stomached 
when caugth. The dataset contained 36000+ catches, which where identified by
their Location (Africa, North America, rest of America), by their Trophism
(their diet, Detrivore, Invertivore, Omnivore, Piscivore) and by the moment
of feeding (Diel: Diurnal or Nocturnal).

The compiled scores can be consulted with
```{r}
library(ANOPA)

ArringtonEtAl2002
```

One first difficulty with this dataset is that some of the cells are missing
(e.g., African fish that are Detrivore during the night). As is the case for 
other sorts of analyses (e.g., ANOVAs), data with missing cells cannot be 
analyzed because the error terms cannot be computed.

One solution adopted by @wh11 was to impute the missing
value. We are not aware if this is an adequate solution, and if so, what imputation
would be acceptable. Consider the following with adequate care.

Warton imputed the missing cells with a very small proportion. In ANOPA, both the 
proportions and the group sizes are required. We implemented a procedure that 
impute a count of 0.05 (fractional counts are not possible from observations, but
are not forbidden in ANOPA) obtained from a single observation.

Consult the default option with 

```{r}
getOption("ANOPA.zeros")
```

The analysis is obtained with

```{r}
w <- anopa( {s; n} ~  Trophism * Location * Diel, ArringtonEtAl2002)
```

The `fyi` message lets you know that cells are missing; the `Warning` message lets you 
know that these cells were imputed (you can suppress messages with 
`options("ANOPA.feedback"="none")`.

To see the result, use `summary(w)` (which shows the corrected and uncorrected statistics)
or `uncorrected(w)` (as the sample is quite large, the correction will be immaterial...), 

```{r}
uncorrected(w)
```

 
These suggests an interaction Diel : Trophism close to significant.

You can easily make a plot with all the 2 x 3 x 4 cells of the design using

```{r, message=FALSE, warning=FALSE, fig.width=5, fig.height=5, fig.cap="**Figure 1**. The proportions in the Arrington et al. 2002 data. Error bars show difference-adjusted 95% confidence intervals."}
anopaPlot(w)
```


To highlight the interaction, restrict the plot to

```{r, message=FALSE, warning=FALSE, fig.width=5, fig.height=3, fig.cap="**Figure 1**. The proportions as a function of class and Difficulty. Error bars show difference-adjusted 95% confidence intervals."}
anopaPlot(w, ~ Trophism * Location)
```

which shows clearly massive difference between Trophism, and small differences between
Omnivorous and Piscivorous fishes with regards to Location.

This can be confirmed by examining simple effects (a.k.a. expected marginal analyzes):

```{r}
e <- emProportions( w, ~ Location * Trophism | Diel  ) 
uncorrected(e)
```

As seen, we get a table with effects for each levels of Diel.

For the Diurnal fishes, there is a strong effect of Trophism. However, there is
no detectable effect of Location. Finally, there is no interaction.

For the Nocturnal fishes, nothing is significant.

These results don't quite match the proportions illustrated in the simple effect 
plot below. The reason is that the simple effect analyses uses a pooled 
measure of error (the Mean squared error). Sadly, this pooled measured 
incorporate cells for which the imputations gave tiny sample sizes (n=1).
The presence of these three empty cells are very detrimental to the analyses,
because if we except these cells, the sample is astonishinly large.



The missing cells are merged with regular cells in the simple effect plots. 

```{r, message=FALSE, warning=FALSE, fig.width=15, fig.height=5, fig.cap="**Figure 2**. The proportions with arrows highlighting the missing data"}
library(ggplot2)   # for ylab(), annotate()
library(ggh4x)     # for at_panel()
library(gridExtra) # for grid.arrange()

#Add annotations to show where missing cells are...
annotationsA <- list(
    annotate("segment", linewidth=1.25, x = 1.5, y = 0.45, xend = 1, yend = 0.15, color="green", arrow = arrow()),
    annotate("segment", linewidth=1.25, x = 1.5, y = 0.45, xend = 1, yend = 0.10, color="red", arrow = arrow()),
    annotate("segment", linewidth=1.25, x = 1.5, y = 0.45, xend = 2.8, yend = 0.25, color="red", arrow = arrow()),
    annotate("text", x = 1.5, y = 0.5, label = "No observations")
)
annotationsB <- list(
    annotate("segment", linewidth=2, x = 2.2, y = 0.08, xend = 1.95, yend = 0.04, color="red", arrow = arrow()),
    annotate("segment", linewidth=2, x = 2.2, y = 0.12, xend = 2.05, yend = 0.36, color="blue", arrow = arrow()),
    annotate("text", x = 2.0, y = 0.1, label = "Merge an imputed cell\nwith a regular cell")
)
annotationsC <- list(
    annotate("segment", linewidth=2, x = 1.4, y = 0.275, xend = 1.9, yend = 0.285, color="red", arrow = arrow()),
    annotate("segment", linewidth=2, x = 1.4, y = 0.275, xend = 2.0, yend = 0.16, color="blue", arrow = arrow()),
    annotate("text", x = 1., y = 0.275, label = "Merge an imputed cell\nwith a regular cell")
)

pla <- anopaPlot(w, ~ Trophism * Location * Diel)+  
  ylab("Proportion empty stomach") +
  lapply(annotationsA,  \(ann) {at_panel(ann, Diel == "Nocturnal")})

plb <- anopaPlot(w, ~ Diel * Trophism ) +  
  ylab("Proportion empty stomach") +
  annotationsB

plc <- anopaPlot(w, ~ Diel * Location ) +  
  ylab("Proportion empty stomach") +
  annotationsC

grid.arrange( pla, plb, plc, ncol = 3 )
```

Suppose we wish to impute the missing cells with the harmonic mean number
of success (21.6) and the harmonic mean number of observations per cell (286.9), 
we could use from the package `psych` 

```{r, message=FALSE, warning=FALSE}
library(psych)

options("ANOPA.zeros" = c(harmonic.mean(ArringtonEtAl2002$s), harmonic.mean(ArringtonEtAl2002$n)))
```

prior to running the above analyses. In which case it is found that
all the effects are massively significant, another way to see that
the sample size is huge (over 36,000 fishes measured).






# References