PandemicLP - Sum of Regions

library(PandemicLP)

In this vignette, we present an advanced use of the PandemicLP package for adjusting pandemic models for multiple regions. The package uses the theory presented in http://est.ufmg.br/covidlp/home/en/ and the main goal of this vignette is to present a way to make predictions for 2 or more regions in which the future values are calculated as the sum of each individual forecast.

If the user wishes to use any other epidemic or pandemic data, it is his responsibility to prepare the database for each of the regions considered. To see how to prepare the data check ?covid19BH.

Determining the data settings

Considering that the wish is to run the model for Covid-19 data, the first step is to define the vector that contains all the desired regions and define the final date to be considered for the available data. Besides that, the case_type can be ‘confirmed’ or ‘deaths’ and it determines which type of Covid data will be considered.

To exemplify, this vignette will make the prediction for the South Region of Brazil Covid-19 deaths using the data until October 1st.

regions <- c("PR", "SC","RS")
last_date <- "2020-10-01"
case_type <- "deaths"

If the user needs to run this example for other regions, he needs to change these variables above (regions, last_date and case_type) and then run the rest of the code without needing to change anything.

Loading data, estimating model and making predictions

With the regions well defined, a loop is created to download the covid data from online repositories using function load_covid() and then run the model and make predictions using functions pandemic_model() and posterior_predict(), respectively. The explanations about each of these functions can be accessed through the help of the package.

Inside the loop, it is possible to see that an “ifesle” statement is done to check if the regions specified are some Brazilian states. The check is done comparing the regions provided with the available states, listed by function state_list(). Besides that, it is important to mention that with the function pandemic_model(), it is possible to adjust models with different configurations (check ?pandemic_model) so the user can change it to their need. In this vignette, however, the configuration will be equal to the one used on the app (http://www.est.ufmg.br/covidlp/), which can be accessed through the argument ‘covidLPconfig = TRUE’ at pandemic_model() function.

The data, outputs, and predictions for each region are stored in lists.

data <- list()
outputs <- list()
preds <- list()
states <- state_list()
# Load precomputed MCMC
download.file("http://github.com/CovidLP/PandemicLP/raw/master/temp/PandemicLP_SumRegions.rda","./PandemicLP_SumRegions.rda")
load("./PandemicLP_SumRegions.rda")
for(i in 1:length(regions)) { 
  if (is.na(match(regions[i],states$state_abb))){
    data[[i]] <- load_covid(country_name=regions[i],last_date=last_date)
  } else {
    data[[i]] <- load_covid(country_name="Brazil", state_name = regions[i],last_date=last_date)
  }
  # Commenting to reduce vignette runtime
  # outputs[[i]] <- pandemic_model(data[[i]],case_type = case_type, covidLPconfig = TRUE)
  preds[[i]] <- posterior_predict(outputs[[i]])
}
# Load precomputed MCMC
download.file("http://github.com/CovidLP/PandemicLP/raw/master/temp/PandemicLP_SumRegions.rda","./PandemicLP_SumRegions.rda")
load("./PandemicLP_SumRegions.rda")

Preparing the data

After doing all the sums, we want the final object (data_base) to be pandemicPredicted class and to contain the predictions, the data, the name of the regions considered, the type of case used, and the past and future mu’s. So, to make sure that the data_base reflects all necessary format, we make it equal to the result of the posterior_predict function for the first considered region, and then we will add the information using a loop starting from the second region until all the regions are added.

# consider the file for the first region (to save the right format)
data_base <- preds[[1]]
bind_regions <- regions[1]
for (i in 2:length(regions)) {
  bind_regions <- paste(bind_regions, "and", regions[i])
}
data_base$location <- bind_regions

Doing the sum

In order to make the predictions considering the sum of all regions, some data manipulation is required. The predictions (long and short) and the futures mu’s can be summed since the end date is the same for all regions, so the forecasts start at the same moment. For the sum of the data, however, it is necessary to pay attention to the dates because the pandemic starts in each region at a different time so we must sum the cases only on the same date. All this will be done in a loop starting from the second region until the last one, since the information of the first region is already contained in the object (explained previously).

In the case of the past mu, the data frames containing the mu values are transposed in a way that each line represents a date and the columns are the values of each mu. After adding the column with the date, the data frame for each region is combined in only one and, at the end of the loop, it is aggregated by date. The final data frame must have the date column deleted and needs to be transposed again to stay in the original format.

To sum the covid data for each region, each dataset is merged by date in a way that all the information is considered for both objects in the merge. When there is data on a given date for one region and not for another, the column for the region that does not have the information is shown as “NA”, so it is changed to 0 to be added. Then, each column from one dataset is added to the corresponding column of the second one, respecting the dates. In the end, the duplicated column is deleted so that the loop can start again for another region without any problem.

Finally, to ensure that the pandemic_stats function has a horizon of data long enough to calculate pandemic summary information, some hidden objects are created by the posterior_predict function. These objects are created considering the number of steps to predict equals the maximum number between 1000 and the horizon specified by the user. In order to do predictions as the sum of regions it is important to sum these hidden objects as well. Once again, they can be summed directly because all the data are ending on the same date so the predictions are starting at the same moment.

# get the mean sample and set it to be dates x mcmc sample
mu_t <- t(data_base$pastMu)

# include the dates in the data frame
mu_final <- data.frame(data = data_base$data$date,mu_t)
names_mu <- names(data_base$pastMu)

# get hidden objects (necessary for the pandemic_stats function)
hidden_short_total <- methods::slot(data_base$fit,"sim")$fullPred$thousandShortPred 
hidden_long_total <- methods::slot(data_base$fit,"sim")$fullPred$thousandLongPred 
hidden_mu_total <- methods::slot(data_base$fit,"sim")$fullPred$thousandMus

# loop for each region - starting with the second one 
for (u in 2:length(regions)) {
  
  # preds for the selected region
  data_region <- preds[[u]]
  
  # sum the variables predictive_Long, predictive_Short and futMu 
  data_base$predictive_Long <- data_base$predictive_Long +
    data_region$predictive_Long
  data_base$predictive_Short <- data_base$predictive_Short +
    data_region$predictive_Short
  data_base$futMu <- data_base$futMu + data_region$futMu
  
  # create a large data frame by concatenating samples for current state in the mean data frame
  mu_t <- t(data_region$pastMu)
  mu_2 <- data.frame(data = data_region$data$date,mu_t)
  names_mu <- c(names_mu,names(data_region$pastMu))
  mu_final <- rbind(mu_final,mu_2)
  
  # merge datasets by date since they can differ on start
  data_base$data <- merge(data_base$data,data_region$data, 
                          by = "date", all = TRUE)
  data_base$data[is.na(data_base$data)] = 0
  data_base$data$cases.x = data_base$data$cases.x +
    data_base$data$cases.y
  data_base$data$deaths.x = data_base$data$deaths.x +
    data_base$data$deaths.y
  data_base$data$new_cases.x = data_base$data$new_cases.x +
    data_base$data$new_cases.y
  data_base$data$new_deaths.x = data_base$data$new_deaths.x +
    data_base$data$new_deaths.y
  data_base$data <- data_base$data[,-c(6:9)]
  names(data_base$data) <- c("date","cases","deaths",
                             "new_cases","new_deaths")
  
  # sum hidden objects (necessary for the pandemic_stats function) 
  hidden_short_region <- methods::slot(data_region$fit,"sim")$fullPred$thousandShortPred 
  hidden_short_total <- hidden_short_total + hidden_short_region
  hidden_long_region <- methods::slot(data_region$fit,"sim")$fullPred$thousandLongPred 
  hidden_long_total <- hidden_long_total + hidden_long_region
  hidden_mu_region <- methods::slot(data_region$fit,"sim")$fullPred$thousandMus
  hidden_mu_total <- hidden_mu_total + hidden_mu_region
  
}

# create hidden object (necessary for the pandemic_stats function)
methods::slot(data_base$fit,"sim")$fullPred$thousandShortPred <- hidden_short_total
methods::slot(data_base$fit,"sim")$fullPred$thousandLongPred <- hidden_long_total
methods::slot(data_base$fit,"sim")$fullPred$thousandMus <- hidden_mu_total

# aggregate the mean samples
mu_final <- aggregate(. ~ data, data=mu_final, FUN=sum)
mu_final <- mu_final[,-1]
mu_final <- t(mu_final)
names_mu <- unique(names_mu)
colnames(mu_final) <- names_mu
data_base$pastMu <- mu_final

Using the information

After finishing all the data manipulation, the object data_base is the final object that can be used to create plots or any other analysis that the user may want. If the user doesn’t want to see the summary information about the pandemic, the argument “summary = FALSE” can be included on the plot function to take out the notes.

plots <- plot(data_base,term = "both")
#> Plotting deaths cases
#> Short term plot created successfully
#> Long term plot created successfully
#> The generated plot(s) can be stored in a variable.
#> Long term plot: variable$long
#> Short term plot: variable$short
plots$long
plots$short