---
title: "Wrangling simulated outbreak data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Wrangling simulated outbreak data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

The {simulist} R package can generate line list data (`sim_linelist()`), contact tracing data (`sim_contacts()`), or both (`sim_outbreak()`). By default the line list produced by `sim_linelist()` and `sim_outbreak()` contains 12 columns. Some amount of post-simulation data wrangling may be needed to use the simulated epidemiological case data to certain applications. This vignette demonstrates some common data wrangling tasks that may be performed on simulated line list or contact tracing data.

```{r setup}
library(simulist)
library(epiparameter)
library(dplyr)
library(epicontacts)
```

This vignette provides data wrangling examples using both functions available in the R language (commonly called "base R") as well as using [tidyverse R packages](https://www.tidyverse.org/), which are commonly applied to data science tasks in R. The tidyverse examples are shown by default, but select the "Base R" tab to see the equivalent functionality using base R. There are many other tools for wrangling data in R which are not covered by this vignette (e.g. [{data.table}](https://rdatatable.gitlab.io/data.table/)).

::: {.alert .alert-info}
See these great resources for more information on general data wrangling in R:

* [R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund](https://r4ds.hadley.nz/)
* [{dplyr} R package](https://dplyr.tidyverse.org/)
* [{tidyr} R package](https://github.com/tidyverse/tidyr)
* [Wrangling data frames chapter in An Introduction to R by Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau](https://intro2r.com/wrangling-data-frames.html)
:::

## Simulate an outbreak

To simulate an outbreak we will use the `sim_outbreak()` function from the {simulist} R package.

::: {.alert .alert-info}
If you are unfamiliar with the {simulist} package or the `sim_outbreak()` function [Get Started vignette](simulist.html) is a great place to start.
:::

First we load in some data that is required for the outbreak simulation. Data on epidemiological parameters and distributions are read from the {epiparameter} R package.

```{r read-epidist}
# create contact distribution (not available from {epiparameter} database)
contact_distribution <- epiparameter(
  disease = "COVID-19",
  epi_name = "contact distribution",
  prob_distribution = create_prob_distribution(
    prob_distribution = "pois",
    prob_distribution_params = c(mean = 2)
  )
)

# create infectious period (not available from {epiparameter} database)
infectious_period <- epiparameter(
  disease = "COVID-19",
  epi_name = "infectious period",
  prob_distribution = create_prob_distribution(
    prob_distribution = "gamma",
    prob_distribution_params = c(shape = 1, scale = 1)
  )
)

# get onset to hospital admission from {epiparameter} database
onset_to_hosp <- epiparameter_db(
  disease = "COVID-19",
  epi_name = "onset to hospitalisation",
  single_epiparameter = TRUE
)

# get onset to death from {epiparameter} database
onset_to_death <- epiparameter_db(
  disease = "COVID-19",
  epi_name = "onset to death",
  single_epiparameter = TRUE
)
```

The seed is set to ensure the output of the vignette is consistent. When using {simulist}, setting the seed is not required unless you need to simulate the same line list multiple times.

```{r, set-seed}
set.seed(123)
```

```{r, sim-outbreak}
outbreak <- sim_outbreak(
  contact_distribution = contact_distribution,
  infectious_period = infectious_period,
  prob_infection = 0.5,
  onset_to_hosp = onset_to_hosp,
  onset_to_death = onset_to_death
)
linelist <- outbreak$linelist
contacts <- outbreak$contacts
```

## Censoring dates

The date event columns in simulated line lists are stored to double point precision, meaning they are the exact event times. It is unusual to not store `<Date>` objects as integers, as explained in `?Dates`, and the `print()` function for `<Date>`s does not show that they may be part way through a day.

Here we show this by printing the date of symptom onset for the simulated data, and then unclass it to show how it is stored internally.

```{r}
linelist$date_onset
unclass(linelist$date_onset)
```

The `censor_linelist()` function can be used after `sim_linelist()` to censor the event dates to a given precision. Here we show censoring the dates to daily and weekly intervals. The daily censoring dates will look that same as before, but the dates will have any value after the decimal point set to zero. The weekly censored dates will be printed differently.

```{r}
daily_cens_linelist <- censor_linelist(linelist, interval = "daily")
head(daily_cens_linelist)

weekly_cens_linelist <- censor_linelist(linelist, interval = "weekly")
head(weekly_cens_linelist)
```

See `?censor_linelist()` for more information on how to use this function.

By using `censor_linelist()` it avoids common mistakes when working with `<Date>` objects. For example, rounding a date that is over half way through a day will mistakenly result in the next day. Using `censor_linelist()` avoids this and other common mistakes.

```{r}
linelist$date_onset
round(linelist$date_onset)
daily_cens_linelist$date_onset
```

The censored line list dates can be used with methods that account for censoring when fitting delay distributions such as [{primarycensored}](https://primarycensored.epinowcast.org/).

## Under-reporting of cases and contacts {.tabset}

In this section we'll show how case line lists and contact tracing data sets can be subset to represent under-reporting, a common feature of real-world outbreak data, especially in resource-limited settings.

In the line list each case in unlinked (i.e. information on each row is independent of information on every other row). This means we can remove rows in the line list without having to augment any remaining rows. We assume for this example that the probability of being reported, and thus included in the line list, is independent on case type, sex, age and the phase of the outbreak. 

For this example we'll assume the case reporting probability in the line list is 50%.

### Tidyverse

```{r}
linelist %>%
  filter(as.logical(rbinom(n(), size = 1, prob = 0.5)))
```

### Base R

```{r}
idx <- as.logical(rbinom(n = nrow(linelist), size = 1, prob = 0.5))
linelist[idx, ]
```

## {-}

The above example randomly sample rows in the line list using the reporting probability resulting in different number of cases being kept each time the code is run. To subset the line list data and get the same number rows (i.e. cases) returned `slice_sample()` can be used instead.

```{r}
linelist %>%
  dplyr::slice_sample(prop = 0.5) %>%
  dplyr::arrange(id)
```

`slice_sample()` can reorder rows so we order by ID to keep the cases in order of symptom onset date.

On to under-reporting in contact tracing data. Unlike line list data, contact tracing data is linked. The direction of contact and possibly transmission is recorded in the `$from` and `$to` columns. 

For this example we will assume that the contact tracing under-reporting is applicable to infections and contacts that were not infected. However, the same method could be applied for under-reporting of the transmission chain by first subsetting to infections only (see `vis-linelist.Rmd` vignette for example).

We plot the full contact network so it can be compared to the contact networks with under-reporting plotted below.

```{r}
epicontacts <- make_epicontacts(
  linelist = linelist,
  contacts = contacts,
  id = "case_name",
  from = "from",
  to = "to",
  directed = TRUE
)
plot(epicontacts)
```

First we randomly sample who is not reported in the outbreak data. For this example we assume the pool of people that can be unreported is everyone in the contact network (infections and contacts), and assume a 50% reporting probability.

```{r}
all_contacts <- unique(c(contacts$from, contacts$to))
not_reported <- sample(x = all_contacts, size = 0.5 * length(all_contacts))
not_reported
```

Next we subset the contact tracing data by removing infectees if that are not reported. Because the contact tracing data is linked across rows, we also need to set any unreported infectees to `NA` for any secondary infections they cause.

```{r}
# make copy of contact tracing data for under-reporting
contacts_ur <- contacts
for (person in not_reported) {
  contacts_ur <- contacts_ur[contacts_ur$to != person, ]
  contacts_ur[contacts_ur$from %in% person, "from"] <- NA
}
head(contacts_ur)
```

We can plot this new contact network with {epicontacts}. We'll need to subset the line list to have the same unreported cases. 

```{r}
linelist_ur <- linelist[!linelist$case_name %in% not_reported, ]
epicontacts <- make_epicontacts(
  linelist = linelist_ur,
  contacts = contacts_ur,
  id = "case_name",
  from = "from",
  to = "to",
  directed = TRUE
)
plot(epicontacts)
```

The above example can be thought of as resulting from incomplete recording or recall of contacts. A second method for under-reporting of contact tracing data is to assume that if a case is unreported then all of the cases and contacts stemming from the unreported case are lost.  

For this example we'll sample a single individual not to report and then prune all cases and contacts from that individual in the network.

```{r}
all_contacts <- unique(c(contacts$from, contacts$to))
not_reported <- sample(x = all_contacts, size = 1)
not_reported
```

Then we can recursively pruned all cases and contacts that are the result from this individual (this can be zero if the person had no secondary cases or contacts).

```{r}
# make copy of contact tracing data for under-reporting
contacts_ur <- contacts
while (length(not_reported) > 0) {
  contacts_ur <- contacts_ur[!contacts_ur$to %in% not_reported, ]
  not_reported_ <- contacts_ur$to[contacts_ur$from %in% not_reported]
  contacts_ur <- contacts_ur[!contacts_ur$from %in% not_reported, ]
  not_reported <- not_reported_
}
head(contacts_ur)
```

Just as above we can plot the new contact network using {epicontacts}.

```{r}
# subset line list to match under-reporting in contact tracing data
linelist_ur <- linelist[linelist$case_name %in% unique(contacts$from), ]

epicontacts <- make_epicontacts(
  linelist = linelist_ur,
  contacts = contacts_ur,
  id = "case_name",
  from = "from",
  to = "to",
  directed = TRUE
)
plot(epicontacts)
```

There are more complex under-reporting depending on covariates in the line list and contact tracing data such as `$case_type` in the line list, with suspected cases most likely to go unreported, or `$status` in the contact tracing data, with `unknown` or `lost_to_followup` more likely to be under-reported.

## Removing a line list column {.tabset}

Not every column in the simulated line list may be required for the use case at hand. In this example we will remove the `$ct_value` column. For instance, if we wanted to simulate an outbreak for which no laboratory testing (e.g Polymerase chain reaction, PCR, testing) was available and thus a Cycle threshold (Ct) value would not be known for confirmed cases.

### Tidyverse

```{r, rm-ct-col-tidyverse}
# remove column by name
linelist %>% # nolint one_call_pipe_linter
  select(!ct_value)
```

### Base R

```{r, rm-ct-col-base}
# remove column by numeric column indexing
# ct_value is column 12 (the last column)
linelist[, -12]

# remove column by column name
linelist[, colnames(linelist) != "ct_value"]

# remove column by assigning it to NULL
linelist$ct_value <- NULL
linelist
```

## {-}