--- title: "bigMICE: multiple imputation for Big Data" bibliography: reference.bib link-citations: yes output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{bigMICE: multiple imputation for Big Data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- bigMICE is an R package based on the `sparklyr` library, designed for handling large datasets with multiple imputation using an efficient and scalable approach. Detailed information about the package and associated numerical experiments can be found in our manuscript [@morvan2026bigMICE] ## Setup and recommendations ### Spark and sparklyr The following commands can be run to set up an environment for a new project, and running them is optional. ```{r eval=FALSE} install.packages("renv") library(renv) renv::init() ``` Install sparklyr and spark (run once) with the following commands. If not using the latest version of `sparklyr`, make sure to install a compatible Spark version, and vice versa. For the latest sparklyr release (1.9.1), the compatible Spark version is 4.0.0. For sparklyr versions < 1.9.0, you will need a spark version < 4.0.0. ```{r eval=FALSE} install.packages("sparklyr") # version 1.9.1 options(timeout = 6000) library(sparklyr) spark_install(version="4.0.0") ``` To check that the correct combination of Spark and sparklyr have been installed, use the following two commands: ```{r eval=FALSE} sparklyr::spark_installed_versions() utils::packageVersion("sparklyr") ``` ### Hadoop For robust execution of Spark on big data sets, checkpointing can be needed. To make it possible to enable checkpointing, Hadoop needs to be installed. For smaller datasets or for running toy examples Hadoop installation can be skipped. **On Linux** [https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html](https://hadoop.apache.org/docs/r1.2.1/single_node_setup.html) **On Windows** [https://gist.github.com/vorpal56/5e2b67b6be3a827b85ac82a63a5b3b2e](https://gist.github.com/vorpal56/5e2b67b6be3a827b85ac82a63a5b3b2e) **Note** that specific Java versions are needed to run Spark: [https://spark.apache.org/docs/latest/](https://spark.apache.org/docs/latest/) (JDK 17 or JDK 21 at the moment of writing) ## Installation To install bigMICE from CRAN, use the following commands in R: ```{r eval=FALSE} install.packages("bigMICE") ``` Once installed, load the package: ```{r eval=FALSE} library(bigMICE) ``` ## Example Usage Loading necessary libraries: ```{r eval=FALSE} library(bigMICE) library(dplyr) library(sparklyr) ``` Creating a local Spark session. ```{r eval=FALSE} conf <- spark_config() conf$`sparklyr.shell.driver-memory`<- "10G" conf$spark.memory.fraction <- 0.8 conf$`sparklyr.cores.local` <- 4 #conf$`spark.local.dir` <- "/local/data/spark_tmp/" # needed for checkpointing. # If not possible, add the parameter checkpointing = FALSE to the mice.spark call sc = spark_connect(master = "local", config = conf) ``` Download the dataset boys.rda from the `mice` R package [here](https://github.com/amices/mice/tree/master/data) and then save it to the current working R directory. After that, run the following commands. ```{r eval=FALSE} # Loading the data load("boys.rda") #Making a binary outcome boysBin <- boys %>% mutate( phb = as.factor(case_when( phb == "P1" ~ 1, is.na(phb) ~ NA, TRUE ~ 0 )) ) tmpdir <- tempdir() csv_file <- paste(tmpdir,"data.csv", sep="/") write.csv(boys, csv_file, row.names = FALSE) sdf <- spark_read_csv(sc, "data", csv_file, header = TRUE, infer_schema = TRUE, null_value = "NA") %>% select(-all_of(c("hgt","wgt","bmi","hc"))) unlink(tmpdir, recursive= T) # preparing the elements before running bigMICE variable_types <- c(age = "Continuous_float", gen = "Nominal", phb = "Binary", tv = "Continuous_int", reg = "Nominal") analysis_formula <- as.formula("phb ~ age + gen + tv + reg") ``` Call the mice.spark function to obtain m=2 imputed dataset: ```{r eval=FALSE} imputation_results <- bigMICE::mice.spark(data = sdf, sc = sc, variable_types = variable_types, analysis_formula = analysis_formula, predictorMatrix = NULL, m = 2, maxit = 1, checkpointing = FALSE) print(imputation_results) spark_disconnect(sc) ```