---
title: "Quick Start Guide: Estimating Epidemiological Parameters with `mitey`"
author: "Kylie Ainslie"
date: "`r Sys.Date()`"
bibliography: quick_start_guide_refs.bib
csl: nature.csl
output: 
  html_document:
    toc: true
    toc_float: true
    toc_depth: 3
    number_sections: true
vignette: >
  %\VignetteIndexEntry{Quick Start Guide: Estimating Epidemiological Parameters with `mitey`}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7, 
  fig.height = 5,
  dpi = 300,
  warning = FALSE,
  message = FALSE
)
```

## Setup

First, let's load the required packages:

```{r setup}
library(mitey)
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
library(fdrtool)
library(cowplot) 
library(here)
library(outbreaks)
```

# Introduction

This vignette demonstrates how to use the `mitey` package to estimate two key epidemiological parameters:

 - **Serial interval:** The time between symptom onset in an infector and symptom onset in an infectee
 - **Time-varying reproduction number (Rt):** The average number of secondary cases generated by each infected individual at time t

Understanding these parameters is crucial for characterizing disease transmission dynamics, evaluating intervention effectiveness, and designing control strategies. They provide the foundation for mathematical models that inform public health decision-making.

We'll use both simulated and real-world examples for each parameter, with a focus on scabies (a skin infestation caused by the mite Sarcoptes scabiei) for the real-world applications.

# Estimating the Serial Interval

The serial interval distribution is crucial for understanding disease transmission dynamics. The `mitey` package implements the method developed by Vink et al. (2014) [@vink2014] to estimate the mean and standard deviation of the serial interval from outbreak data. Accurate serial interval estimates are essential because they directly inform how we interpret the spread of disease over time and are key inputs for estimating the reproduction number.

## Simulated Data Example

First, let's generate simulated data with a known serial interval distribution to validate our method. We'll create Index Case-to-Case (ICC) intervals, which represent the time in days between symptom onset in the first identified case (index case) and symptom onset in subsequent cases:

```{r simulate-si-data}
set.seed(1234)

# Parameters for simulation
N <- 500           # Number of observations
true_mean <- 15    # True mean serial interval (days)
true_sd <- 3       # True standard deviation (days)
route_weights <- c(0.2, 0.5, 0.2, 0.1)  # Weights for transmission routes

# Generate data for different transmission routes
CP <- rhalfnorm((route_weights[1]*N), theta=sqrt(pi/2)/(sqrt(2)*true_sd))  # Co-Primary
PS <- rnorm(route_weights[2]*N, mean=true_mean, sd=true_sd)                # Primary-Secondary
PT <- rnorm(route_weights[3]*N, mean=2*true_mean, sd=sqrt(2)*true_sd)      # Primary-Tertiary
PQ <- rnorm(route_weights[4]*N, mean=3*true_mean, sd=sqrt(3)*true_sd)      # Primary-Quaternary

# Combine and round to days
sim_icc_intervals <- round(c(CP, PS, PT, PQ))

# Visualize the simulated data
hist(sim_icc_intervals, 
     breaks = seq(min(sim_icc_intervals)-0.5, max(sim_icc_intervals)+0.5, by=1),
     main = "Simulated ICC Intervals", 
     xlab = "Days since index case onset",
     col = "lightblue")
```

In this simulation, we generate different types of transmission routes:

- **Co-Primary (CP):** Cases infected by the same source around the same time
- **Primary-Secondary (PS):** Direct transmission from index case to secondary cases
- **Primary-Tertiary (PT):** Unobserved secondary case between two cases
- **Primary-Quaternary (PQ):** Unobserved secondary and tertiary cases between two cases

Now, let's estimate the serial interval using the `si_estim()` function:

```{r si-estimate-sim}
# Estimate serial interval assuming Normal distribution
si_results <- si_estim(sim_icc_intervals, dist = "normal")
si_results
```

```{r si_estimate-sim-res, echo=FALSE}
# Display results
cat("True parameters:\n")
cat("Mean:", true_mean, "days\n")
cat("SD:", true_sd, "days\n\n")

cat("Estimated parameters:\n")
cat("Mean:", round(si_results$mean[1], 2), "days\n")
cat("SD:", round(si_results$sd[1], 2), "days\n")
```

The `si_estim()` function uses a mixture model to distinguish between different transmission routes. The output includes:

- `mean`: Estimated mean serial interval (days)
- `sd`: Estimated standard deviation (days)
- `wts`: Weights for each transmission component in the mixture model

The choice between normal and gamma distributions for the serial interval depends on the disease characteristics. Normal distributions can accommodate negative serial intervals (which can occur when an infectee shows symptoms before their infector), while gamma distributions are strictly positive and often better suited for diseases with minimal pre-symptomatic transmission.

Let's visualize how well our estimated distribution fits the data:

```{r plot-sim-fit, fig.height=5, fig.width=7}
# Extract weights
weights <- c(si_results$wts[1], 
            si_results$wts[2] + si_results$wts[3],
            si_results$wts[4] + si_results$wts[5], 
            si_results$wts[6] + si_results$wts[7])

# Plot the fitted distribution
plot_si_fit(
  dat = sim_icc_intervals,
  mean = si_results$mean[1],
  sd = si_results$sd[1],
  weights = weights,
  dist = "normal"
) +
  ggtitle("Fitted Serial Interval Distribution (Simulated Data)") +
  theme(plot.title = element_text(hjust = 0.5))
```

The red curve shows the fitted mixture density, with the dashed vertical line indicating the estimated mean serial interval. Our estimation accurately recovers the true parameters from the simulated data.

## Real Data Example: Scabies Outbreaks

Now, let's analyze real data from scabies outbreaks. We'll use data from multiple studies published in the literature:

```{r load-real-si-data}
# Load scabies ICC interval data
file_path <-  here("vignettes", "data", "si_data.rds")
scabies_si_data <- readRDS(file_path)
```

Scabies has an unusually long serial interval compared to many other infectious diseases due to its biology and transmission dynamics. The mite requires time to establish on a new host, reproduce, and build up sufficient numbers to cause symptoms and become transmissible. This contrasts sharply with respiratory viruses like influenza, which typically have serial intervals of only 2-4 days.

We'll use `si_estim()` to estimate the serial interval for each study:

```{r si-estimate-real}
# Estimate serial interval for each study
result_by_study <- scabies_si_data %>%
  group_by(study) %>%
  summarise(result = list(si_estim(icc_interval))) %>%
  mutate(
    mean = map_dbl(result, "mean"),
    sd = map_dbl(result, "sd"),
    wts = map(result, "wts")
  ) %>%
  select(-result)

# Display results
result_by_study %>%
  select(study, mean, sd) %>%
  mutate(across(c(mean, sd), round, 2)) %>%
  arrange(mean) %>%
  knitr::kable(caption = "Estimated mean and standard deviation of serial interval (days) by study")
```

The notable variation in estimated serial intervals between studies (ranging from 98 to 167 days) likely reflects differences in study populations, settings (schools vs. nursing homes), and data collection methods. This heterogeneity underscores the importance of context-specific estimates for epidemiological modeling.

Let's process the results for visualization:

```{r process-si-results}
# Process weights for plotting
result_wide <- result_by_study %>%
  unnest(wts) %>%
  pivot_longer(
    cols = c(mean, sd, wts),
    names_to = "statistic",
    values_to = "value"
  ) %>%
  group_by(study, statistic) %>%
  mutate(
    occurrence = row_number(),
    statistic = if_else(statistic == "wts", paste0("weight_", occurrence), statistic)
  ) %>%
  filter(statistic != "mean" | occurrence == 1) %>%
  filter(statistic != "sd" | occurrence == 1) %>%
  select(-occurrence) %>%
  ungroup() %>%
  pivot_wider(
    names_from = statistic,
    values_from = value
  )

# Merge with original data for plotting
df_merged <- scabies_si_data %>%
  left_join(result_wide, by = "study", relationship = "many-to-many")
```

Let's visualize the fitted serial interval distribution for each study:

```{r plot-si-multi, fig.height=8, fig.width=10}
# Create a function to generate plot for each study
plot_study <- function(study_data) {
  study_name <- unique(study_data$study)
  
  plot_si_fit(
    dat = study_data$icc_interval,
    mean = study_data$mean[1],
    sd = study_data$sd[1],
    weights = c(study_data$weight_1[1], 
                study_data$weight_2[1] + study_data$weight_3[1],
                study_data$weight_4[1] + study_data$weight_5[1], 
                study_data$weight_6[1] + study_data$weight_7[1]),
    dist = "normal",
    scaling_factor = 0.25
  ) +
    ggtitle(study_name) +
    theme(plot.title = element_text(hjust = 0.5, size = 11))
}

# Generate plots for each study
study_plots <- df_merged %>%
  group_by(study) %>%
  group_split() %>%
  map(plot_study)

# Combine plots
combined_plot <- plot_grid(
  plotlist = study_plots,
  labels = "AUTO",
  ncol = 2
)

# Display combined plot
combined_plot
```

These plots show considerable variation in estimated serial intervals between studies (ranging from ~98 days to ~167 days). The differences may be due to variations in study settings, population characteristics, and data collection methods.

# Estimating Time-Varying Reproduction Number (Rt)

## Introduction

The time-varying reproduction number (Rt) is a key parameter in infectious disease epidemiology that quantifies how transmission potential changes over time. The `mitey` package implements the Wallinga-Lipsitch method for estimating Rt from time series of case counts. This vignette demonstrates how to use the `wallinga_lipsitch()` function with both simulated and real data.

### Method details

The Wallinga-Lipsitch method [@wallinga2007] establishes a relationship between the growth rate of an epidemic and the reproduction number. This approach:

- Uses the serial interval distribution to connect observed cases over time
- Calculates relative likelihoods of transmission between cases
- Adjusts for right-truncation (recent cases that haven't had time to infect others)
- Produces estimates of the case reproduction number (Rt)

The `wallinga_lipsitch()` function in `mitey` takes the following parameters:

```{r func_arguments, eval=FALSE}
wallinga_lipsitch(
  incidence,          # Vector of case counts
  dates,              # Vector of dates corresponding to incidence
  si_mean,            # Mean of serial interval distribution (days)
  si_sd,              # Standard deviation of serial interval 
  si_dist = "gamma",  # Distribution type ("gamma" or "normal")
  smoothing = 0,      # Window size for smoothing (0 for no smoothing)
  bootstrap = FALSE,  # Whether to compute bootstrap CIs
  n_bootstrap = 1000, # Number of bootstrap samples
  conf_level = 0.95,  # Confidence level for intervals
  shift = FALSE       # Whether to shift estimates by one serial interval
)
```

## Simulated Data Example

Let's generate epidemic data with a known time-varying reproduction number to validate our implementation:

```{r sim-rt}
set.seed(42)

# Simulation parameters
t_end <- 100        # Simulation duration (days)
si_mean <- 7        # Mean serial interval (days)
si_sd <- 2          # SD of serial interval (days)

# Define a time-varying reproduction number function
# This creates a pattern where Rt starts high, decreases below 1, then increases again
true_rt <- function(t) {
  if(t < 20) return(2.5)                  # Initial high Rt
  if(t < 40) return(2.5 - 0.075 * (t - 20)) # Linear decrease
  if(t < 60) return(1.0)                  # Stable period at Rt=1
  if(t < 80) return(1.0 + 0.05 * (t - 60))  # Linear increase
  return(2.0)                             # Final high Rt
}

# Create vector of true Rt values for plotting
true_rt_values <- sapply(1:t_end, true_rt)

# Initialize with some seed cases
cases <- numeric(t_end)
cases[1:5] <- c(1, 2, 3, 5, 8)

# Serial interval distribution (discretized normal)
si_pmf <- dnorm(0:30, mean = si_mean, sd = si_sd)
si_pmf <- si_pmf / sum(si_pmf)  # Normalize to sum to 1

# Generate incidence using renewal equation model
for(t in 6:t_end) {
  # Calculate expected new cases
  lambda <- 0
  for(s in 1:min(t-1, length(si_pmf))) {
    lambda <- lambda + cases[t-s] * true_rt(t-s) * si_pmf[s]
  }
  
  # Add randomness (negative binomial distribution)
  cases[t] <- rnbinom(1, mu = lambda, size = 10)
}

# Create dates sequence
sim_dates <- seq.Date(as.Date("2023-01-01"), by = "day", length.out = t_end)

# Create data frame
sim_epidemic <- data.frame(
  date = sim_dates,
  cases = cases,
  true_rt = true_rt_values
)

# Plot the simulated epidemic curve with true Rt
p1 <- ggplot(sim_epidemic, aes(x = date, y = cases)) +
  geom_col(fill = "steelblue") +
  labs(
    x = "Date",
    y = "Daily Cases",
    title = "Simulated Epidemic Curve"
  ) +
  theme_minimal()

p2 <- ggplot(sim_epidemic, aes(x = date, y = true_rt)) +
  geom_line(color = "red", linewidth = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  labs(
    x = "Date",
    y = "Reproduction Number",
    title = "True Time-Varying Rt"
  ) +
  ylim(0, 3) +
  theme_minimal()

# Combine plots
plot_grid(p1, p2, ncol = 1)
```

Note that the epidemic curve directly reflects changes in the reproduction number:

- When Rt > 1 (days 1-50, then after day 80), the epidemic grows
- When Rt = 1 (around days 50-60), the epidemic plateaus
- When Rt < 1 (days 60-80), the epidemic declines

Now, let's estimate Rt using the `wallinga_lipsitch()` function:

```{r estimate-rt}
# Estimate Rt using wallinga_lipsitch
rt_estimates <- wallinga_lipsitch(
  incidence = sim_epidemic$cases,
  dates = sim_epidemic$date,
  si_mean = si_mean,
  si_sd = si_sd,
  si_dist = "normal",
  smoothing = 7,        # 7-day smoothing window
  bootstrap = TRUE,
  n_bootstrap = 100     # Use more in practice
)

# Convert to data frame for plotting
rt_est_df <- as.data.frame(rt_estimates)

# Compare estimated vs true Rt
ggplot() +
  # True Rt
  geom_line(
    data = sim_epidemic, 
    aes(x = date, y = true_rt, color = "True Rt"),
    linewidth = 1
  ) +
  
  # Estimated Rt (with right-truncation correction)
  geom_line(
    data = rt_est_df, 
    aes(x = date, y = R_corrected, color = "Estimated Rt"),
    linewidth = 1
  ) +
  
  # Confidence intervals
  geom_ribbon(
    data = rt_est_df,
    aes(x = date, ymin = R_corrected_lower, ymax = R_corrected_upper),
    fill = "blue", alpha = 0.2
  ) +
  
  # R=1 threshold
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  
  # Styling
  scale_color_manual(
    values = c("True Rt" = "red", "Estimated Rt" = "blue"),
    name = ""
  ) +
  labs(
    x = "Date",
    y = "Reproduction Number (Rt)",
    title = "Comparison of True vs. Estimated Rt",
    subtitle = "Using Wallinga-Lipsitch method with 7-day smoothing"
  ) +
  ylim(0, 3) +
  theme_minimal() +
  theme(legend.position = "bottom")
```

Note that the estimated Rt (blue line) seems to lag behind the true Rt (red line). This is expected because the case reproduction number we're estimating is based on when cases are observed, while the instantaneous reproduction number (our true Rt) represents current transmission potential. The choice of smoothing window is important - too small and the estimates will be noisy, too large and they may miss important changes in transmission dynamics.

### Understanding the 'shift' Parameter

The `wallinga_lipsitch()` function includes a `shift` parameter that can help align case reproduction number estimates with instantaneous reproduction number estimates. When set to `TRUE`, this parameter shifts Rt estimates forward by one mean serial interval to better match the timing of changes in transmission:
```{r rt-shift}
# Re-estimate with shift=TRUE
rt_shifted <- wallinga_lipsitch(
  incidence = sim_epidemic$cases,
  dates = sim_epidemic$date,
  si_mean = si_mean,
  si_sd = si_sd,
  si_dist = "normal",
  smoothing = 7,
  bootstrap = FALSE,  # Skip bootstrap for speed
  shift = TRUE        # Use shift parameter
)

# Compare true Rt with both original and shifted estimates
ggplot() +
  # True Rt
  geom_line(
    data = sim_epidemic, 
    aes(x = date, y = true_rt, color = "True Rt"),
    linewidth = 1
  ) +
  
  # Original estimated Rt
  geom_line(
    data = rt_est_df, 
    aes(x = date, y = R_corrected, color = "Estimated Rt (original)"),
    linewidth = 1
  ) +
  
  # Shifted estimated Rt
  geom_line(
    data = as.data.frame(rt_shifted), 
    aes(x = shifted_date, y = R_corrected, color = "Estimated Rt (shifted)"),
    linewidth = 1
  ) +
  
  # R=1 threshold
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  
  # Styling
  scale_color_manual(
    values = c("True Rt" = "red", 
               "Estimated Rt (original)" = "blue",
               "Estimated Rt (shifted)" = "green"),
    name = ""
  ) +
  labs(
    x = "Date",
    y = "Reproduction Number (Rt)",
    title = "Effect of the 'shift' parameter on Rt estimates",
    subtitle = "Shifting forward by one serial interval (7 days)"
  ) +
  ylim(0, 3) +
  theme_minimal() +
  theme(legend.position = "bottom")
```

The shifted estimates align better with the true Rt values. The shift parameter moves estimates forward by one mean serial interval, which helps address the inherent lag in case-based Rt estimation.

## Real Data Example

### Zika outbreak in Giradot, Colombia (2015)
To illustrate how to apply the Wallinga-Lipsitch method to real data, we'll use data on daily incidence of Zika virus disease in Giradot, Colombia from October 2015 to January 2016. The data is available from the `outbreaks` package [@kamvar2019].

The Zika epidemic in Colombia was part of a larger outbreak across the Americas in 2015-2016. Zika virus is primarily transmitted by Aedes mosquitoes, and the outbreak in Colombia was notable for its rapid spread and the associated health impacts, including an increase in cases of microcephaly in newborns and Guillain-Barré syndrome [@rojas2016].

```{r get_zika_data}
data(zika_girardot_2015)

# Examine the data structure
str(zika_girardot_2015)
```

Let's have a look at the epidemic curve.
```{r zike_epicurve, echo=FALSE}
zika_epicurve <- ggplot(zika_girardot_2015, aes(x = date, y = cases)) +
  geom_col(fill = "steelblue") +
  scale_x_date(date_breaks = "14 days", date_labels = "%d-%m-%Y") +
  labs(
    x = "Date of onset",
    y = "Number of cases",
    title = "Daily Zika virus cases in Girardot, Colombia (2015-2016)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

zika_epicurve
```

Now we'll estimate the reproduction number using the `wallinga_lipsitch()` function we created. For Zika virus, we'll use a mean serial interval of 7 days with a standard deviation of 1.5 days, based on literature estimates for mosquito-borne diseases:

```{r rt-estimate-zika}
# Parameters for Zika
zika_si_mean <- 7  # days
zika_si_sd <- 1.5     # days

# Estimate Rt using the Wallinga-Lipsitch method
rt_zika <- wallinga_lipsitch(
  incidence = zika_girardot_2015$cases,
  dates = zika_girardot_2015$date,
  si_mean = zika_si_mean,
  si_sd = zika_si_sd,
  si_dist = "gamma",
  smoothing = 0,
  bootstrap = TRUE,
  n_bootstrap = 100,
  conf_level = 0.95
)

head(rt_zika)
```

Now let's visualise the reproduction number estimates.

```{r rt_viz}
# Prepare data for visualization
rt_plot_data <- rt_zika %>%
  filter(!is.na(R_corrected)) %>%
  # Skip the first 7 days of unstable estimates
  filter(date > min(date) + 7)

# Plot Rt over time
rt_plot <- ggplot(rt_plot_data, aes(x = date)) +
  geom_ribbon(aes(ymin = R_corrected_lower, ymax = R_corrected_upper), 
              fill = "#21908C", alpha = 0.2) +
  geom_line(aes(y = R_corrected), color = "#21908C", size = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "black") +
  #coord_cartesian(ylim = c(0, 3)) +
  scale_x_date(date_breaks = "14 days", date_labels = "%d-%m-%Y") +
  labs(
    x = "Time", 
    y = "Reproduction number (Rt)",
    title = "Estimated reproduction number for Zika in Girardot, Colombia",
    subtitle = paste0("Serial interval: ", zika_si_mean, " days (SD: ", zika_si_sd, " days)")
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Show Rt plot with epicurve
plot_grid(zika_epicurve, rt_plot, ncol = 1, rel_heights = c(1, 1.5))
```

The reproduction number plot tells an important epidemiological story:

- Initially, Rt was well above 1, indicating rapid epidemic growth
- Around November-December, Rt crosses below 1, suggesting that control measures or other factors were beginning to reduce transmission
- By January, Rt remains below 1, suggesting the epidemic was being controlled

This analysis demonstrates how Rt estimation can provide valuable insights for evaluating the effectiveness of interventions and predicting the future course of an epidemic.

# References