The reconstruction of the individual transmissions events that led to an outbreak gives valuable information on the causes of spread of an infectious disease. In recent years, methods have been developed to infer transmission trees from onset dates and genetic sequences. Nevertheless, sequences can be uninformative for pathogens with slow mutation rates, or sequencing can be scarce. We developed the package *o2geosocial* to integrate more variables from routinely collected surveillance data to reconstruct transmission trees when genetic sequences are not informative. The model introduced in *o2geosocial* takes the reported age-group, onset date, location and genotype of infected cases to infer probabilistic transmission trees. In this vignette, we describe the structure and the different functions of the package. We also provide a tutorial on a simulated measles outbreak showing how to run a model, evaluate the output and visualise the results of the inference. In the second part of the tutorial, we customise the likelihood functions to introduce a different mobility model.

The risk of outbreak in a given region is higher if there is low immunity against the virus in the population. Regional immunity against infectious diseases is built up by past infections and, if a vaccine is available, vaccination campaigns. Social and spatial heterogeneity in disease incidence or vaccine coverage lead to under-immunised areas, also called pockets of susceptibles. Importation of cases into these areas can cause large transmission clusters and long-lasting outbreaks. The most vulnerable areas of a country could be identified using historical data on local vaccine coverage and incidence, but these data are often scarce, or unreliable. Another solution is to infer probabilistic transmission trees and clusters to identify in which regions importations repeatedly caused large clusters of local transmission.

The *Wallinga-Teunis method* was developed to infer probabilistic transmission trees from onset dates, serial intervals and latent periods in a maximum likelihood framework. As genetic sequencing of pathogens during an outbreak became more common, new tools such as the R package *outbreaker2* showed that combining the timing of infection and the genetic sequences could improve the accuracy of inferred transmission trees. Nevertheless, sequencing pathogens remains costly, and the efficacy of the reconstruction methods depends on the proportion of sequenced cases, the quality of the sequences, and on the characteristics of the pathogen. For instance, the measles virus evolves very slowly, so sequences from unrelated cases can be very similar, which makes methods combining onset dates and genetic sequences ineffective.

Building upon the framework presented in *outbreaker2*, we developed the R package *o2geosocial* to estimate the cluster size distribution from the onset date, age, location and genotype of the cases. Those variables are routinely collected by surveillance systems and often well reported. Using age-stratified contact matrices and mobility models, we combined the different variables into a likelihood of connection between cases. The package *o2geosocial* is ideal to study outbreaks where sequences are uninformative, either because only a small proportion of cases were sequenced or because the virus evolves too slowly. In this vignette, we first introduce the overall structure of *o2geosocial*, the components of the individual likelihood and the parameters; then we present a tutorial on how to develop a model to reconstruct the cluster size distribution with *o2geosocial*.

The general implementation of *o2geosocial* follows the structure of *outbreaker2* and incorporates C++ functions into a R framework using *Rcpp*. The main function of the package, called `outbreaker()`

, uses Monte Carlo Markov Chains (MCMC) to sample from the posterior distribution. For each case, it infers the infection date, the infector, and the number of missing generations between the case and their infector. It takes five lists as inputs: i) ‘moves’, ii) ‘likelihoods’, iii) ‘priors’, iv) ‘data’, and v) ‘config’. These five lists can be generated and modified using the functions `custom_moves()`

, `custom_likelihoods()`

, `custom_priors()`

, `create_config()`

and `outbreaker_data()`

. Examples of how these functions are used to customize the model will be presented in the Tutorial section.

The package *o2geosocial* was developed for the study of outbreaks where full information on the onset date, location and age group of the cases is available. It builds upon *outbreaker2* by adding the effect of the location and the age-stratified contact data on the reconstruction of the transmission clusters. However, unlike *outbreaker2*, *o2geosocial* does not take genetic sequences as input. Genetic information (or genotype) is only used to exclude connections between cases, i.e. two cases cannot be from the same cluster if they belong to different genetic groups. Therefore, *o2geosocial* is adapted to reconstructing transmission clusters from large datasets (up to several thousand cases) where genetic sequencing is not informative, either because the mutation rate of the virus is slow, or because sequencing is scarce. On the other hand, *outbreaker2* is more adapted to reconstructing transmission chains if genetic sequences are informative and available for a high proportion of the cases.

In *o2geosocial* the genetic component of the likelihood is a binary value depending on the genotype of the cases. There can be only one genotype reported per transmission tree, hence the number of trees will be at least equal to the number of unique genotypes reported in the dataset. The genotype of a tree T is the genotype reported for at least one of the cases belonging to T. For each genotyped case \(i_{gen}\) and at every iteration, only cases from trees with the same genotype as \(i_{gen}\), or without reported genotype should be listed as potential infectors. Therefore, ungenotyped cases belonging to a tree with a different genotype will not be potential infectors of \(i_{gen}\) at this iteration. Ungenotyped cases can be placed on any tree.

Routinely collected surveillance data may include thousands of cases from unrelated outbreaks. Therefore, it was crucial to develop the algorithm to be computationally efficient. We added a pre-clustering step to the function `outbreaker()`

to reduce the pool of potential infectors for each case, which can greatly reduce the computing time of the MCMC (Figure 1). In the pre-clustering step, the potential infectors for each case are listed. Cases with overlapping potential infectors, and their potential infectors, are grouped together. Cases from different groups cannot infect one another. The group each case belongs to is listed in the variable ‘is_cluster’, which is an element of the list returned by the function `outbreaker_data()`

. We used three criteria to group together cases that may be connected: For each case \(i\), only cases reported in a radius of \(\gamma\) km, less than \(\delta\) days before \(i\), and from similar or unreported genotype can be classified as potential infectors of \(i\). The parameter \(\gamma\) and \(\delta\) are defined as inputs in the function `create_config()`

.

Cases classified in the same group after the pre-clustering stage may still be unrelated (e.g. if several importations were reported simultaneously in a nearby region). Because of this, we defined a likelihood threshold \(\lambda\) which allows connections to be discarded as unlikely. If the likelihood of connection between cases \(i\) and \(j\) is less than \(\lambda\), we consider it to be more likely that \(i\) was an import and unrelated to \(j\). In *o2geosocial*, the variable \(\lambda\) can either be an absolute (the log-likelihood threshold will be \(log(\lambda)\)) or relative value (a quantile of the likelihood of all connections in all samples), and is defined by the variables ‘outlier_threshold’ and ‘outlier_relative’ in `create_config()`

. If the initial number of importations is too small, unlikely connections between cases can alter the inferred infection dates of cases and bias the generated transmission trees. Therefore, we first run a short MCMC and compute \(n\), the minimum number of connections with a likelihood lower than \(\lambda\) in the samples; then we add \(n\) imports to the starting tree and run a long MCMC. Finally, we remove the connections with likelihood lower than \(\lambda\) in the final samples and return the infector, infection date and probability of being an import for each case (Figure 1).

The functions `custom_likelihoods()`

and `custom_priors()`

can be used to edit each component of the likelihood and priors.

The genetic component of the likelihood that a case \(i\) of genotype \(g_i\) was infected by a case \(j\) belonging to the tree \(\tau_j\) is defined as a binary value: \[G(g_{i},g_{\tau_{j}})= \left\{\begin{array}{l} 1\text{ if }g_{i}\text{ unknown}\\ 1\text{ if }g_{\tau_{j}}\text{ unknown }\\ 1\text{ if }g_{i}\text{ and }g_{\tau_{j}}\text{ both known and }g_i=g_{\tau_{j}} \\ 0\text{ otherwise} \end{array} \right. \]

Therefore, ungenotyped cases may not be potential infector if they belong to a tree that contains another genotype than \(g_i\). The algorithm must consider the genotype of each tree containing potential infectors.

As in the packages *outbreaker* and *outbreaker2*, we allow for missing cases in transmission chains. The number of generations between linked cases \(i\) and \(j\) \(\kappa_{ji}\) is equal to one if \(j\) infected \(i\). We define \(\rho\) as the conditional report ratio of the trees. The conditional report ratio is not the same as the overall report ratio of an outbreak because entirely unreported clusters, or missing cases before the ancestor of a tree will not change the value of \(\rho\). Only unreported cases within transmission chains will impact the conditional report ratio. By default, the probability of observing \(\kappa_{ji}\) missing generation between \(i\) and \(j\) from the conditional report ratio \(p(\kappa_{ji} | \rho)\) follows an exponential distribution.

The conditional report ratio is estimated during the MCMC runs using a beta distribution prior. The two parameters of the beta prior can be changed using the variable `prior_pi`

in `create_config()`

(default to \(Beta(10, 1)\)).

The time component of the likelihood is similar to what is used in the default version of *outbreaker2*. The probability of \(t_i\) being the infection date of the case \(i\) reported at time \(T_i\) depends on the distribution of the incubation period \(f\). The incubation period should be defined using the variable `f_dens`

in the function `outbreaker_data()`

.

The probability that \(i\) was infected by \(j\) given their respective inferred dates of infection \(t_i\) and \(t_j\) is defined by the generation time of the disease \(w^{\kappa_{ji}}(t_i-t_j)\) (variable `w_dens`

in `outbreaker_data()`

), and the number of generations \(\kappa_{ji}\) between \(i\) and \(j\). The operator \(w^{\kappa_{ji}}\) was defined as \(w^{\kappa_{ji}}= \Pi_{\kappa_{ji}}w\), where \(\Pi\) is the convolution operator.

In *o2geosocial*, we introduce a spatial component to the likelihood of connection between cases. By default, we implemented a gravity model to illustrate the probability of connection between two regions \(k\) and \(l\). Gravity models depend on the population sizes \(m_k\) and \(m_l\), and the distance between regions \(d_{kl}\). Given spatial parameters \(a\) and \(b\), the probability that a case in the region \(k\) was infected by a case reported in \(l\) is \(s(k,l)\):

\[s(k,l)= \frac{p_{kl}}{\Sigma_hp_{hl}}=\frac{F(d_{kl}, b)* m_k^a*m_l^c}{\Sigma_h(F(d_{hl}, b)*m_h^a *m_l^c)}=\frac{F(d_{kl}, b)* m_k^a}{\Sigma_h(F(d_{hl}, b)*m_h^a)}\]

If we use an exponential gravity model, \(F(d_{kl},a)=e^{-b*d_{kl}}\); for a power-law gravity model: \(F(d_{kl},a)=(\frac{1}{d_{kl}})^b\). The exponential gravity model has been shown to perform well to represent short-distance mobility patterns. As *o2geosocial* aims at reconstructing transmission in a community or a region, the default model in the function `outbreaker()`

is an exponential gravity model. The type of gravity model can be changed by setting the parameter `spatial_method`

to “power-law”: `create_config(spatial_method = "power_law")`

. Other mobility models can be used by developing modules. We give an example of module in the tutorial where we replace the exponential gravity model by a Stouffer’s rank model.

The parameters \(a\) and \(b\) are estimated during the MCMC run. This requires re-computing the probability matrix twice per iteration and can be time-consuming. Therefore, if either \(a\) or \(b\) is not fixed, we allow for a maximum of 1 missing generation between cases (\(max(\kappa_{ji})=2\)) and only compute \(s^1(k,l)\) and \(s^2(k,l)\) for regions that could potentially be connected. By default, the prior distribution of \(a\) and \(b\) are uniform.

Social contact matrices provided by large scale quantitative investigations such as the *POLYMOD study* can be used to calculate the probability of contact between cases of different age groups in different countries. These social contact matrices are imported using the R package *socialmixr*. In *o2geosocial*, given the age group of each case \(\alpha_{(1,..,N)}\) and the age-stratified social contact matrix, we introduced \(a^{\kappa_{ji}} (\alpha_i,\alpha_j)\), the probability that a case aged \(\alpha_j\) infected a case aged \(\alpha_i\). This corresponds to the proportion of contacts to \(\alpha_i\) that came from individuals of age \(\alpha_j\). The contact matrix can be modified using the variable `a_dens`

in `outbreaker_data()`

.

The overall likelihood that a case \(i\) was infected by the case \(j\) is equal to \(L_i(t_i, j, t_j, \theta ) = log(f(t_i - T_i)) + L_{ji}(t_i, t_j, \theta)\) where \(L_ji(t_i,t_j,\theta)\) is the likelihood of connection between \(i\) and \(j\) and is defined as:

\[L_ji(t_i,t_j,\theta)=log(\rho(\kappa_{ji}|\rho)*w^{\kappa_{ji}}(t_i-t_j)*a^{\kappa_{ji}} (\alpha_i,\alpha_j )*G(g_i,g_{\tau_j})*s^{\kappa_{ji}} (r_i,r_j | a,b))\]

At every iteration of the MCMC, a set of movements is used to propose an update of the transmission trees. This update is then accepted or rejected depending on the posterior density of the new trees. In *o2geosocial*, eight default movements are tested at each iteration. Three of them were already part of *outbreaker2* and were not modified (`cpp_move_t_inf()`

to change the infection date of cases; `cpp_move_pi()`

to change the conditional report ratio; `cpp_move_kappa()`

to change the number of generations between connected cases); two were edited to scan of the transmissions trees to prevent from having different genotypes in the same tree (`cpp_move_alpha()`

. `cpp_move_swap_cases()`

); The last three are new movements and were not part of *outbreaker2*:

`cpp_move_a()`

and`cpp_move_b()`

propose new values of the spatial parameters \(a\) and \(b\) and compute the matrix of probability of connection between regions.`cpp_move_ancestor()`

: An ancestor is defined as the first case of a transmission tree. As only non-ancestors are moved in`cpp_move_alpha()`

, we added this function to ensure good mixing of the ancestors of the transmission trees. For each ancestor \(i\), an index case is drawn from the poll of potential infectors, while another link is randomly picked and deleted. We then compare the new posterior value, and accept or reject the change.

There are two simulated datasets attached to *o2geosocial*: `toy_outbreak_short`

and `toy_outbreak_long`

. Both are lists describing simulated outbreaks and include 3 elements: i)`cases`

: a data table with the ID, location, onset date, genotype, age group, import status, cluster, generation and infector of each case; ii) `dt_regions`

: a data table with the ID, population, longitude and latitude of each region; iii)`age_contact`

: a numeric matrix of the proportion of contact between age groups. Both simulations were ran using distributions of the serial interval and the latent period typically associated with measles outbreaks The dataset `toy_outbreak_long`

contains 1940 cases simulated in the United States between 2003 and 2016, it can be used to reproduce the results described in https://github.com/alxsrobert/datapaperMO.

In this tutorial, we will analyse `toy_outbreak_short`

. We will run a first model where the import status is inferred, and a second model that takes the import status from the reference dataset and only estimates the transmission trees. We will then evaluate the agreement between the inferred and the reference transmission clusters, and observe the added value of knowing the import status of the cases. Finally, we will give insight into the interpretation of the results and the geographical heterogeneity of the reconstructed transmission dynamics.

We used the package *data.table* for handling data through the tutorial as it is optimised to deal with large datasets. The methods defined in `o2geosocial`

would work similarly if we had used the `data.frame`

syntax and format.

The results presented in this tutorial were generated using only 5,000 iterations to ensure a short run-time for the vignette. We recommend running longer chains to reach the convergence and sample from the posterior distribution, and using the package *coda* to evaluate the convergence of the MCMC chains.

The dataset contains 75 simulated cases from different census tracts of Ohio in 2014 (variable `cens_tract`

). The variable `cluster`

describes the transmission tree of each case, and “generation” is equal to the number of generations between the first case of the tree (`generation = 1`

) and the case. We import *o2geosocial* and extract the dataset `cases`

from `toy_outbreak_short`

.

```
library(o2geosocial)
## We used the data.table syntax throughout this example
library(data.table)
data("toy_outbreak_short")
print(toy_outbreak_short$cases, nrows = 10)
```

```
## ID State Date Genotype Cens_tract age_group import cluster
## 1: 112 Ohio 2014-01-01 B3 39005970100 6 TRUE 16
## 2: 75 Ohio 2014-01-06 D8 39139002400 11 TRUE 14
## 3: 116 Ohio 2014-01-12 B3 39101000400 11 TRUE 17
## 4: 113 Ohio 2014-01-13 B3 39005970100 6 FALSE 16
## 5: 145 Ohio 2014-01-13 D8 39117965300 8 TRUE 26
## ---
## 71: 55 Ohio 2014-05-27 B3 39147962500 9 FALSE 6
## 72: 129 Ohio 2014-05-30 G3 39139001500 3 TRUE 20
## 73: 142 Ohio 2014-05-31 B3 39101000600 4 TRUE 25
## 74: 143 Ohio 2014-06-10 B3 39101000200 4 FALSE 25
## 75: 144 Ohio 2014-06-12 B3 39101000501 13 FALSE 25
## generation infector_ID
## 1: 1 <NA>
## 2: 1 <NA>
## 3: 1 <NA>
## 4: 2 112
## 5: 1 <NA>
## ---
## 71: 3 54
## 72: 1 <NA>
## 73: 1 <NA>
## 74: 2 142
## 75: 2 142
```

First we plot the cluster size distribution of the reference data (Figure 2). \(95\%\) of the clusters contain less than 5 cases, \(47.6\%\) of the clusters are isolated (also called singletons). One larger cluster includes \(31\) cases (Figure 2).

```
# Reference cluster size distribution
hist(table(dt_cases$cluster), breaks = 0:max(table(dt_cases$cluster)),
ylab = "Number of clusters", xlab = "Cluster size", main = "", las=1)
```

We set up the distributions the model will use to reconstruct the transmission trees. We define `f_dens`

as the duration of the latent period, and `w_dens`

as the number of days between the infection dates of two connected cases, also called the serial interval. These distributions have previously been described for measles outbreaks.

```
# Distribution of the latent period
f_dens <- dgamma(x = 1:100, scale = 0.43, shape = 26.83)
# Distribution of the generation time
w_dens <- dnorm(x = 1:100, mean = 11.7, sd = 2.0)
```

The age specific social contact patterns can be imported from the element `age_contact`

of the list `toy_outbreak_short`

. Alternatively, one can use the R package *socialmixr* to import a social contact matrix from the POLYMOD survey.

```
# From the list toy_outbreak_short
a_dens <- toy_outbreak_short$age_contact
# Alternatively, from POLYMOD:
polymod_matrix <-
t(socialmixr::contact_matrix(socialmixr::polymod,
countries="United Kingdom",
age.limits=seq(0, 70, by=5),
missing.contact.age="remove",
estimated.contact.age="mean",
missing.participant.age="remove")$matrix)
polymod_matrix <-data.table::as.data.table(polymod_matrix)
# Compute the proportion of connection to each age group
a_dens <- t(t(polymod_matrix)/colSums(polymod_matrix))
```

Finally, the distance matrix between regions is set up from the data table `dt_regions`

, element of `toy_outbreak_short`

. We use the column `population`

to set up the population vector `pop_vect`

. We compute the distance between each region into the distance matrix `dist_mat`

using the package *geosphere*.

```
# Extract all regions in the territory
dt_regions <- toy_outbreak_short[["dt_regions"]]
# Extract the population vector
pop_vect <- dt_regions$population
# Create the matrices of coordinates for each region (one "from"; one "to")
mat_dist_from <- matrix(c(rep(dt_regions$long, nrow(dt_regions)),
rep(dt_regions$lat, nrow(dt_regions))), ncol = 2)
mat_dist_to <- matrix(c(rep(dt_regions$long, each = nrow(dt_regions)),
rep(dt_regions$lat, each = nrow(dt_regions))),
ncol = 2)
# Compute all the distances between the two matrices
all_dist <- geosphere::distGeo(mat_dist_from, mat_dist_to)
# Compile into a distance matrix
dist_mat <- matrix(all_dist/1000, nrow = nrow(dt_regions))
# Rename the matrix columns and rows, and the population vector
names(pop_vect) <- rownames(dist_mat) <- colnames(dist_mat) <-
dt_regions$region
```

Now that all the distributions and matrices are set up, we create the lists `data`

, `config`

, `moves`

, `likelihoods`

and `priors`

to run the main function of the package. In this example, we use the default parameters to build `moves`

, `likelihoods`

and `priors`

with the same elements as described in section “Implementation”. The list `data`

contains the distributions `f_dens`

and `w_dens`

, the population vector and the distance matrix, along with the onset dates, age group, location and genotype of the cases.

We implement two different models: in `out1`

the import status of the cases is inferred by the model, whereas in `out2`

the import status is set before the MCMC. The first short run in `out1`

is run with \(1,500\) iterations to find the import status, and the main run lasts for \(5,000\) iterations in both models. As the import status of the cases is inferred in `out1`

, we have to set a threshold to quantify what is an unlikely likelihood of transmission between cases. We use a relative outlier threshold at 0.9, which means that the threshold will be the \(9^{th}\) decile of the negative log-likelihoods \(L_i(t_i, j, t_j, \theta)\) in every sample.

```
# Default moves, likelihoods and priors
moves <- custom_moves()
likelihoods <- custom_likelihoods()
priors <- custom_priors()
# Data and config, model 1
data1 <- outbreaker_data(dates = dt_cases$Date, #Onset dates
age_group = dt_cases$age_group, #Age group
region = dt_cases$Cens_tract, #Location
genotype = dt_cases$Genotype, #Genotype
w_dens = w_dens, #Serial interval
f_dens = f_dens, #Latent period
a_dens = a_dens, #Age stratified contact matrix
population = pop_vect, #Population
distance = dist_mat #Distance matrix
)
config1 <- create_config(data = data1,
n_iter = 5000, #Iteration number: main run
n_iter_import = 1500, #Iteration number: short run
burnin = 500, #burnin period: first run
outlier_relative = T, #Absolute / relative threshold
outlier_threshold = 0.9 #Value of the threshold
)
# Run model 1
out1 <- outbreaker(data = data1, config = config1, moves = moves,
priors = priors, likelihoods = likelihoods)
# Set data and config for model 2
data2 <- outbreaker_data(dates = dt_cases$Date,
age_group = dt_cases$age_group,
region = dt_cases$Cens_tract,
genotype = dt_cases$Genotype, w_dens = w_dens,
f_dens = f_dens, a_dens = a_dens,
population = pop_vect, distance = dist_mat,
import = dt_cases$import #Import status of the cases
)
config2 <- create_config(data = data2,
find_import = FALSE, # No inference of import status
n_iter = 5000,
sample_every = 50, # 1 in 50 iterations is kept
burnin = 500)
# Run model 2
out2 <- outbreaker(data = data2, config = config2, moves = moves,
priors = priors, likelihoods = likelihoods)
```

The matrices `out1`

and `out2`

now contain the posterior, likelihood, and prior of the trees generated at every iteration, along with the values of the spatial parameters `a`

and `b`

, the conditional report ratio `pi`

, and the index, estimated infection date and number of generations for each case.

The function `summary`

prints a summary of the features of the output matrix generated by `outbreaker()`

. The function generates a list with the number of steps, the distributions of the posterior, likelihood and priors, the parameter distributions, the most likely infector and the probability of being an import for each case, and the cluster size distribution. For example, we can use it to describe the distribution of the inferred parameters. We remove the burn-in period, which corresponds to the number of iterations before the MCMC converged.

```
burnin <- config1$burnin
# Summary parameters a and b
#Model 1
print(summary(out1, burnin = burnin)$a)
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2293 0.6110 0.8198 0.8482 1.0952 1.3956
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.07378 0.08785 0.09573 0.09639 0.10441 0.13056
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3292 0.7156 0.8961 0.9330 1.1760 1.4921
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1071 0.1235 0.1351 0.1348 0.1441 0.1704
```

In order to evaluate the fit of the model, we plot the median inferred cluster size distribution in `out1`

and `out2`

and the credible intervals, and compare it with the reference data. First, we group together clusters of similar sizes. The breaks of each group is written in the vector `group_cluster`

. In this example, we defined the size categories as \(1\); \(2\); \(3-4\); \(5-9\); \(10-15\); \(15-40\) and \(40+\). The inferred cluster size distributions are shown in the element `cluster`

of the list generated by `summary(out1)`

, and can be aggregated using the input variable `group_cluster`

. The generated barplot shows that the number of isolated cases in `out1`

is lower than in the data. Therefore, when the import status of cases was inferred, the model underestimated the number of clusters and tended to link together unrelated cases (Figure 3). On the other hand, the cluster size distribution in `out2`

is very similar to the data.

Nevertheless, the cluster size distribution when the import status of the cases is inferred depends on the likelihood threshold set in `outlier_threshold`

and `outlier_relative`

. Using a looser or stronger definitions of \(\lambda\) would impact the cluster size distribution inferred in `out1`

.

```
# We create groups of cluster size: initialise the breaks for each group
group_cluster <- c(1, 2, 3, 5, 10, 15, 40, 100) - 1
# Reference data: h$counts
h <- hist(table(dt_cases$cluster), breaks = group_cluster, plot = FALSE)
# Grouped cluster size distribution in each run
clust_infer1 <- summary(out1, burnin = burnin,
group_cluster = group_cluster)$cluster
clust_infer2 <- summary(out2, group_cluster = group_cluster,
burnin = burnin)$cluster
# Merge inferred and reference cluster size distributions into one matrix
clust_size_matrix <- rbind(clust_infer1["Median",], clust_infer2["Median",],
h$counts)
# Plot
b <- barplot(clust_size_matrix, names.arg = colnames(clust_infer1), las=1,
ylab = "Number of clusters", xlab = "Cluster size", main = "",
beside = T, ylim = c(0, max(c(clust_infer1, clust_infer2))))
# Add the 50% CI
arrows(b[1,], clust_infer1["1st Qu.",], b[1,], clust_infer1["3rd Qu.",],
angle = 90, code = 3, length = 0.1)
arrows(b[2,], clust_infer2["1st Qu.",], b[2,], clust_infer2["3rd Qu.",],
angle = 90, code = 3, length = 0.1)
# Add legend
legend("topright", fill = grey.colors(3), bty = "n",
legend = c("Inferred import status",
"Known import status", "Simulated dataset"))
```

Although the aggregated cluster size distributions are close to the reference data, we have to investigate the reconstructed transmission trees to ensure the index assigned to each case is in agreement with the reference dataset. We write two functions: in `index_infer`

we compute the proportion of iterations where the inferred index is the actual index for each case (perfect match); In `index_clust`

we compute the proportion of iterations where the inferred index is from the same reference cluster as the actual index (close match).

```
#' Title: Compute the proportion of iterations in the outbreaker() output
#` where the inferred index matches the actual index in dt_cases
#'
#' @param dt_cases: reference dataset
#' @param out: Matrix output of outbreaker()
#' @param burnin: Numeric, length of the burnin phase
#'
#' @return Numeric vector showing the proportion of iterations pointing to
#' the correct index case
index_infer <- function(dt_cases, out, burnin){
out_index <- out[out$step > burnin, grep("alpha", colnames(out))]
ID_index <- matrix(dt_cases[unlist(out_index), ID], ncol = nrow(dt_cases))
match_infer_data <- t(ID_index) == dt_cases$infector_ID
match_infer_data[is.na(t(ID_index)) & is.na(dt_cases$infector_ID)] <- TRUE
prop_correct <- rowSums(match_infer_data, na.rm = T)/ncol(match_infer_data)
return(prop_correct)
}
# Same as index_infer, except it returns the proportion of inferred indexes
# who are in the same reference cluster as the case
index_clust <- function(dt_cases, out, burnin){
out_index <- out[out$step > burnin, grep("alpha", colnames(out))]
clust_index <- matrix(dt_cases[unlist(out_index), cluster],
ncol = nrow(dt_cases))
match_infer_data <- t(clust_index) == dt_cases$cluster
match_infer_data <- match_infer_data[!is.na(dt_cases$infector_ID),]
prop_correct <- rowSums(match_infer_data, na.rm = T)/ncol(match_infer_data)
return(prop_correct)
}
# Run index_infer for each model
index_infer1 <- index_infer(dt_cases = dt_cases, out = out1, burnin = burnin)
index_infer2 <- index_infer(dt_cases = dt_cases, out = out2, burnin = burnin)
# Run index_clust for each model
index_clust1 <- index_clust(dt_cases = dt_cases, out = out1, burnin = burnin)
index_clust2 <- index_clust(dt_cases = dt_cases, out = out2, burnin = burnin)
```

The plots show that inferring the import status of cases decreased the proportion of perfect and close match for most cases (Figure 4). This highlights the importance of using reliable contact tracing investigations to investigate the import status of cases.

```
# Plot the sorted proportion in each model
oldpar <- par(no.readonly =TRUE)
par(bty = "n", mfrow = c(1, 2), mar = c(5,4,2,0), oma = c(0, 0, 0, 0))
# Panel A
plot(sort(index_infer1), type = "l", ylab = "Proportion", xlab = "Case",
main = "A", las=1, col = grey.colors(3)[1], lwd = 3)
lines(sort(index_infer2), col = grey.colors(3)[2], lwd = 3)
# Panel B
plot(sort(index_clust1), type = "l", xlab = "Case", ylab = "",
main = "B", las=1, col = grey.colors(3)[1], lwd = 3)
lines(sort(index_clust2), col = grey.colors(3)[2], lwd = 3)
legend("bottomright", col = grey.colors(3)[1:2], lwd = 3, bty = "n",
legend = c("Inferred import status",
"Known import status"))
```

We now generate two maps to investigate the geographical distribution of the importations, and the average number of secondary cases per region in `out1`

and `out2`

. The maps are generated using the package *ggplot2*

First, we retrieve the boundary files of the census tract in Ohio. They will be used to generate the background of the maps, and are available using the library *tigris*. The boundary files are also available on the website *census.gov*. We import them in a format compatible with the package *sf* and create one background map for each model.

```
library(ggplot2)
# Read the shapefile and create one map for each model
map1 <- tigris::tracts(state = "Ohio", class = "sf", progress_bar = FALSE)
map1$INTPTLON <- as.numeric(map1$INTPTLON)
map1$INTPTLAT <- as.numeric(map1$INTPTLAT)
map2 <- map1
map1$model <- "Model 1"
map2$model <- "Model 2"
```

We are interested in two outputs of the models: i) the number of imports per region to show the regions where importations of cases are most likely, and ii) the geographical distribution of the number of secondary cases per case, which give insight into the areas most susceptible to the spread of the disease.

We compute the proportion of iterations where each case was an import in out1 and out2. The element `tree`

of `summary(out1)`

gives the most likely infector, the proportion of iterations where the index is the most likely infector (i.e. the support of the connection) and the median number of generations between the two cases, the most likely infection date and the chances of being an import for each case. We therefore add the columns `prop_infer1`

and `prop_infer2`

to `dt_cases`

. As the import status is not inferred in `out2`

, `prop_infer2`

is a binary value, and is the same as `dt_cases$import`

.

```
# Add the proportion of iteration in model 1 where each case is an import
dt_cases[, prop_infer1 := summary(out1, burnin = burnin)$tree$import]
# Add the proportion of iteration in model 2 where each case is an import
dt_cases[, prop_infer2 := summary(out2, burnin = burnin)$tree$import]
```

We then aggregate the number of imports per region in each model, and name these vectors `prop_reg1`

and `prop_reg2`

. We then add the number of imports in each region to the matrices describing the maps.

```
# Number of import per region in model 1
prop_reg1 <- dt_cases[, .(prop_per_reg = sum(prop_infer1)),
by = Cens_tract]$prop_per_reg
# Number of import per region in model 2
prop_reg2 <- dt_cases[, .(prop_per_reg = sum(prop_infer2)),
by = Cens_tract]$prop_per_reg
names(prop_reg1) <- names(prop_reg2) <- unique(dt_cases$Cens_tract)
# Add the number of imports in each region to the maps
map1$prop_reg <- prop_reg1[as.character(map1$GEOID)]
map2$prop_reg <- prop_reg2[as.character(map2$GEOID)]
```

We now plot the number of imports per region in each model (Figure 5). Regions where no cases were reported are shown in grey. The right panel (Model `out2`

) shows the geographical distribution of importations in the data.

We observe discrepancies between the two panels: although some regions have the correct number of imports inferred in most iterations, there are uncertainties for some imports in the left panel. Furthermore, the right panel shows there should be one import in 4 of the regions in the central area, which seems to be underestimated in the left panel. These maps highlight the uncertainty added by the inference of the import status of each case.

```
# Merge maps
maps <- rbind(map1, map2)
limits_long <- c(-84, -82)
limits_lat <- c(40, 41.5)
maps <- maps[maps$INTPTLON > limits_long[1] & maps$INTPTLON < limits_long[2],]
maps <- maps[maps$INTPTLAT > limits_lat[1] & maps$INTPTLAT < limits_lat[2],]
# Plot: number of imports per region, two panels
ggplot(maps) + geom_sf(aes(fill = prop_reg)) + facet_grid(~model) +
scale_fill_gradient2(na.value = "lightgrey", midpoint = 0.8,
breaks = c(0, 0.5, 1, 1.5), name = "Nb imports",
low = "white", mid = "lightblue", high = "darkblue") +
coord_sf(xlim = c(-83.8, -82.2), ylim = c(40.2, 41.3)) +
theme_classic(base_size = 9)
```

We now compute the average number of secondary cases per case in each region. We define the function `n_sec_per_reg¬, which computes the number of secondary cases per case and aggregates it per region at each iteration of the model. We then extract the median number of secondary cases per case in each region. The dispersion of the number of secondary cases per region can be generated by taking another quantile than the median. For example,`

n_sec1 <- apply(n_sec_tot1[,-1], 1, function(X) quantile(X, 0.25))` would show the first quartile in each region.

```
#' Title: Compute the number of secondary cases per case in each region
#'
#' @param dt_cases: reference dataset
#' @param out: Matrix output of outbreaker()
#' @param burnin: Numeric, length of the burnin phase
#'
#' @return A numeric matrix: the first column is the census tract ID, the
#' other columns show the number of secondary cases per case. Each row
#' corresponds to a different iteration.
n_sec_per_reg <- function(dt_cases, out, burnin){
## Number of secondary cases per case
n_sec <- apply(out[out$step > burnin, grep("alpha", colnames(out))], 1,
function(X){
X <- factor(X, 1:length(X))
return(table(X))})
## Aggregate by region
tot_n_sec_reg <- aggregate(n_sec, list(dt_cases$Cens_tract), sum)
## Divide by the number of cases in each region
tot_n_sec_reg <- cbind(tot_n_sec_reg[, 1],
tot_n_sec_reg[, -1] / table(dt_cases$Cens_tract))
return(tot_n_sec_reg)
}
n_sec_tot1 <- n_sec_per_reg(dt_cases = dt_cases, out = out1, burnin = burnin)
n_sec_tot2 <- n_sec_per_reg(dt_cases = dt_cases, out = out2, burnin = burnin)
n_sec1 <- apply(n_sec_tot1[,-1], 1, median)
n_sec2 <- apply(n_sec_tot2[,-1], 1, median)
names(n_sec1) <- names(n_sec2) <- unique(dt_cases$Cens_tract)
```

We then add the number of secondary cases per region to the matrices describing the maps.

```
map1$n_sec <- as.numeric(n_sec1[as.character(map1$GEOID)])
map2$n_sec <- as.numeric(n_sec2[as.character(map2$GEOID)])
```

We can now plot the geographical distribution of the median number of secondary cases in each region (Figure 6). The maps generated by the two models are very similar. Both of them show the spatial heterogeneity of the number of secondary infections. Some regions seem to consistently cause more secondary cases. There are minor discrepancies between the maps, but they show the same pattern. If we change the vectors `n_sec1`

and `n_sec2`

to plot different deciles, we can get an idea of the dispersion of the number of secondary cases in the different iterations of the models.

```
# Merge maps
maps <- rbind(map1, map2)
limits_long <- c(-84, -82)
limits_lat <- c(40, 41.5)
maps <- maps[maps$INTPTLON > limits_long[1] & maps$INTPTLON < limits_long[2],]
maps <- maps[maps$INTPTLAT > limits_lat[1] & maps$INTPTLAT < limits_lat[2],]
# Plot the geographical distribution of the number of secondary cases
ggplot(maps) + geom_sf(aes(fill = n_sec)) + facet_grid(~model) +
scale_fill_gradient2(na.value = "lightgrey", mid = "lightblue",
low = "white", midpoint = 1, high = "darkblue",
breaks = seq(0, 5, 0.5),name = "Sec cases") +
coord_sf(xlim = c(-83.8, -82.2), ylim = c(40.2, 41.3)) +
theme_classic(base_size = 9)
```

In the previous section, we ran and evaluated two different models to reconstruct transmission clusters from simulated surveillance data, and highlighted the spatial heterogeneity of measles transmission in the region. These models were run using the default likelihood, prior and movement functions. Now we develop a third model, where the spatial connection between regions is based on the Stouffer’s rank method.

The Stouffer’s rank model does not take absolute distance into account to compute the probability of connection between regions. In this model, the connectivity between the regions \(k\) and \(l\) only depends on the summed population of all the towns closer to \(l\) than \(k\). If we define this collection of towns \(\Omega_{k,l} = \{i: 0 \leq d(i,l) \leq d(k,l) \}\), the distance of Stouffer is then \(p_{kl} = m_l^c * \left(\frac{m_k}{\sum_{i \in \Omega_{k,l}} m_i}\right)^a\). From this, we deduce the probability that a case from region \(l\) was infected by a case from region \(k\) as \[s(k,l)= \frac{p_{kl}}{\Sigma_hp_{hl}}=\frac{\left(\frac{m_k}{\sum_{i \in \Omega_{k,l}} m_i}\right)^a}{\Sigma_h\left(\frac{m_h}{\sum_{i \in \Omega_{h,l}} m_i}\right)^a}\]

This model is similar to the power-law gravity model with two main differences:

Each cell of the distance matrix should be equal to \(\sum_{i \in \Omega_{k,l}} m_i\).

There is only one spatial parameter \(a\) to estimate.

First, we create the initial distance matrix in the Stouffer’s rank model `dist_mat_stouffer`

:

```
# For every column of the distance matrix, use the cumulative sum of the
# population vector ordered by the distance. Remove the values where
# the distance between the regions is above gamma
dist_mat_stouffer <- apply(dist_mat, 2, function(X){
pop_X <- cumsum(pop_vect[order(X)])
omega_X <- pop_X[names(X)]
# omega_X is set to -1 if the distance between two regions is above gamma
omega_X[X > config1$gamma] <- -1
return(omega_X)
})
# The new value of gamma is equal to the maximum of dist_mat_stouffer + 1
gamma <- max(dist_mat_stouffer) + 1
# The values previously set to -1 are now set to the new value of gamma
dist_mat_stouffer[dist_mat_stouffer == -1] <- max(dist_mat_stouffer) * 2
```

We write the function `cpp_stouffer_move_a`

to replace the default movement function `cpp_move_a`

. As the formula used to compute the Stouffer’s rank values is similar to the power law gravity models, we do not need to re-write `cpp_log_like`

, the default function to compute the probability matrix. Other distance models may require a rewriting of `cpp_log_like`

, which should then be called in the new movement function. In the Stouffer’s rank model, there is no parameter \(b\), both spatial parameters are equal to \(a\). We use the package *Rcpp* to source the new movement function.

```
// [[Rcpp::depends(o2geosocial)]]
#include <Rcpp.h>
#include <Rmath.h>
#include <o2geosocial.h>
// [[Rcpp::export()]]
Rcpp::List cpp_stouffer_move_a(Rcpp::List param, Rcpp::List data, Rcpp::List config,
Rcpp::RObject custom_ll, Rcpp::RObject custom_prior) {
// Import parameters
Rcpp::List new_param = clone(param);
double gamma = config["gamma"];
int max_kappa = config["max_kappa"];
Rcpp::String spatial = config["spatial_method"];
Rcpp::IntegerVector region = data["region"];
Rcpp::NumericMatrix distance = data["distance"];
Rcpp::NumericMatrix can_be_ances_reg = data["can_be_ances_reg"];
Rcpp::NumericVector population = data["population"];
Rcpp::NumericVector limits = config["prior_a"];
// Size of the probability matrix
Rcpp::List new_log_s_dens = new_param["log_s_dens"];
Rcpp::NumericMatrix probs = new_log_s_dens[0];
int nb_cases = pow(probs.size(), 0.5);
// Draw new value of a
Rcpp::NumericVector new_a = new_param["a"];
double sd_a = static_cast<double>(config["sd_a"]);
double old_logpost = 0.0, new_logpost = 0.0, p_accept = 0.0;
// proposal (normal distribution with SD: config$sd_a)
new_a[0] += R::rnorm(0.0, sd_a); // new proposed value
if (new_a[0] < limits[0] || new_a[0] > limits[1]) {
return param;
}
// Generate new probability matrix
new_param["log_s_dens"] = o2geosocial::cpp_log_like(population, distance,
can_be_ances_reg,
new_a[0], new_a[0],
max_kappa, gamma,
spatial, nb_cases);
// Compare old and new likelihood values
old_logpost = o2geosocial::cpp_ll_space(data, config, param,
R_NilValue, custom_ll);
new_logpost = o2geosocial::cpp_ll_space(data, config, new_param,
R_NilValue, custom_ll);
// Add prior values
old_logpost += o2geosocial::cpp_prior_a(param, config, custom_prior);
new_logpost += o2geosocial::cpp_prior_a(new_param, config, custom_prior);
// Accept or reject proposal
p_accept = exp(new_logpost - old_logpost);
if (p_accept < unif_rand()) {
return param;
}
return new_param;
}
```

We modify the element \(a\) of the list `moves`

, and replace it with `cpp_stouffer_move_a`

. As `b`

is not estimated in this model, we create the null function `f_null`

, and modify the list `priors`

.

```
moves3 <- custom_moves(a = cpp_stouffer_move_a)
f_null <- function(param) {
return(0.0)
}
priors3 <- custom_priors(b = f_null)
```

Finally, we set up the lists `data`

and `config`

: the distance matrix for this run is `dist_mat_stouffer`

; we do not move the parameter `b`

, and we change the value of `gamma`

. We then run `out_stouffer`

.

```
# Set data and config lists
data3 <- outbreaker_data(dates = dt_cases$Date, #Onset dates
age_group = dt_cases$age_group, #Age group
region = dt_cases$Cens_tract, #Location
genotype = dt_cases$Genotype, #Genotype
w_dens = w_dens, #Serial interval
f_dens = f_dens, #Latent period
a_dens = a_dens, #Age stratified contact matrix
population = pop_vect, #Population
distance = dist_mat_stouffer #Distance matrix
)
config3 <- create_config(data = data3,
gamma = gamma,
move_b = FALSE, # b is not estimated
init_b = 0,
spatial_method = "power-law",
n_iter = 5000, #Iteration number: main run
n_iter_import = 1500, #Iteration number: short run
burnin = 500, #burnin period: first run
outlier_relative = T, #Absolute / relative threshold
outlier_threshold = 0.9 #Value of the threshold
)
# Run the model using the Stouffer's rank method
out_stouffer <- outbreaker(data = data3, config = config3, moves = moves3,
priors = priors3, likelihoods = likelihoods)
```

As we did in the previous section, we plot the inferred cluster size distribution and compare it to the reference data (Figure 7). We observe discrepancies between the inferred distribution and the data: the model over-estimates the number of larger clusters (above 5 cases).

```
# Grouped cluster size distribution in the Stouffer's rank model
clust_infer_stouf <- summary(out_stouffer, burnin = burnin,
group_cluster = group_cluster)$cluster
# Merge inferred and reference cluster size distributions
clust_size_matrix <- rbind(clust_infer_stouf["Median",], h$counts)
# Plot the two distributions
b <- barplot(clust_size_matrix, names.arg = colnames(clust_infer_stouf),
beside = T)
# Add CIs
arrows(b[1,], clust_infer_stouf["1st Qu.",], b[1,],
clust_infer_stouf["3rd Qu.",], angle = 90, code = 3, length = 0.1)
legend("topright", fill = grey.colors(2), bty = "n",
legend = c("Inferred import status, Stouffer's rank method",
"Simulated dataset"))
```

Finally, we plot the proportion of perfect and close match for each case (Figure 8). We observe that the fit obtained with the Stouffer’s rank method is consistently worse than the first two models. The Stouffer’s rank method did not improve the agreement between the inferred trees and the reference data.

The reference data used in the study were simulated using an exponential gravity model, which explains why introducing the Stouffer’s rank method did not improve the inference. This is not representative of the performance of each mobility model on real-world data.

```
index_infer_stouf <- index_infer(dt_cases = dt_cases, out = out_stouffer,
burnin = config1$burnin)
index_clust_stouf <- index_clust(dt_cases = dt_cases, out = out_stouffer,
burnin = config1$burnin)
# Plot the sorted proportion in each model
par(bty = "n", mfrow = c(1, 2), mar = c(5,4,2,0), oma = c(0, 0, 0, 0))
# Panel A
plot(sort(index_infer_stouf), type = "l", xlab = "Case", ylab = "",
main = "A", las=1, col = grey.colors(2)[1], lwd = 3)
# Panel B
plot(sort(index_clust_stouf), type = "l", ylab = "Proportion", xlab = "Case",
main = "B", las=1, col = grey.colors(2)[1], lwd = 3)
```

The R package *o2geosocial* is a new tool for data analysis building upon the framework developed in *outbreaker2*. It is highly flexible and only requires routinely collected surveillance data. It can be used to identify large transmission clusters and highlight the dynamics of transmission in different regions. An application of the algorithm on a small geographical scale was presented in the tutorial, it can also be used to study datasets of cases scattered across larger areas. The methods presented in the tutorial can be applied to the second dataset included in the package `toy_outbreak_long`

, which includes more than 1900 cases simulated in the United States. The computation time increases with the number of regions and the number of cases.

We showed how the model could be edited to implement a range of mobility models. Describing human mobility during infectious diseases outbreaks can be challenging, and the performances of the models depend on the setting studied. The package can be extended to take into account the variety of existing mobility models. We encourage the development of extensions and the use of *o2geosocial* to study different pathogens and settings where sequence data cannot be used to reconstruct transmission clusters. We hope that wider use of *o2geosocial* can help maximise what can be reconstructed from routinely collected data and epidemiological investigations to improve our understanding of outbreak dynamics.