--- title: "![](logo.png){width=1in} Introduction to the eVCGsampler package" output: html_document: toc: true toc_float: true toc_collapsed: false toc_depth: 1 vignette: > %\VignetteIndexEntry{eVCGsampler} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(eVCGsampler) library(knitr) ``` # Description Step-by-step usage examples for VCG sampling with the eVCGsampler package. The eVCGsampler package provides a powerful and flexible framework for Virtual Control Group (VCG) sampling, utilizing energy distance-based covariate balancing. It features intuitive visualization tools to evaluate covariate balance and includes a built-in permutation test to assess the statistical significance of imbalances. Unlike traditional matching or weighting approaches, no specific downstream analysis methods are required after balancing. This is because the method selects a subset of real, whole (unweighted) animals, preserving the integrity of the historical control data. Please note that this method does not replace careful pre-filtering of historical control data to make it comparable to your study data, but rather serves as a final refinement step at the end of VCG generation. It reduces biological heterogeneity but does not address technical heterogeneity or differences caused by variations in data collection methods, processing pipelines, or experimental conditions. # Balancing of one covariate A simple example with one covariate to be balanced. ```{r, echo=TRUE, warning=FALSE, fig.dim = c(6, 6)} # Generate data POOL (n=100) to sample from # Generate baseline data (n=30), for three treatment groups n=10 (all treated groups taken together) set.seed(3453) dat <- data.frame( treat = rep(0:1, c(100, 30)), weight = c(rlnorm(100, 2, 1), rnorm(30, 8.5, 1.7)) ) # Check whether the POOL has a different covariate distribution than the TG energy_test(treat ~ weight, data=dat) # Sample a smaller subset that is balanced to TG (it balance group=0 to group=1) out <- VCG_sampler(treat ~ weight, data=dat, n=10) dat2 <- out[[1]] # data frame with balance information, with new variable VCG, if 1, mean the observation was selected to be in VCG. out[[2]] # Balance target plot (optional) # Permutation ellipses are generated by randomly permuting the pool and treat groups to estimate usual (random) variability. # Only the X and Y axes are computed directly; the ellipse is interpolated between the axes. # This method is intended as a visual approximation rather than a precise statistical test. vcg <- dat2[which(dat2$VCG==1), ] # select the VCG that was sampled by VCG_sampler # Test if the balancing was successful (variable "treat_balanced", was created by the VCG_sampler function) energy_test(treat_balanced ~ weight, data=dat2) # Check covariate balance visual for the 'weight' covariate. plot_var(dat2, what='weight') ``` # Balancing of several covariates Example demonstrates how multiple covariates, including a categorical one, can be balanced. ```{r, echo=TRUE, warning=FALSE, fig.dim = c(6, 6)} set.seed(2244) # Generate baseline data for treatment groups (all treated groups taken together) d_tg <- data.frame( age = rnorm(20, 10, 1), weight= rnorm(20, 7, 1), class = rbinom(20, 1, 0.3) ) # Generate data lake (pool) data (to sample from) d_pool <- data.frame( age = rnorm(50, 12, 2), weight= rnorm(50, 6, 2), class = rbinom(50, 1, 0.6) ) # Combine the data to balance the covariates between the two data sets (creates the variable treat POOL vs TG) dat <- combine_data(d_pool, d_tg, indicator_name='treat') # Calculate current energy distance (optional) energy_distance(treat ~ age + weight + class, data=dat) # Check whether the POOL has a different covariate distribution than the TG energy_test(treat ~ age + weight + class, data=dat) # If you are unsure about the size of VCG, try the BestVCGsize function to determine the optimal size. # This is a purely exploratory approach! BestVCGsize(treat ~ age + weight + class, data=dat) # It is only intended for exploratory purposes, as the VCG size is normally given. # Sample a smaller subset that is balanced to TG (it balance treat=0 to treat=1) out <- VCG_sampler(treat ~ age + weight + class, data=dat, n=10) dat2 <- out[[1]] out[[2]] # Balance target plot (optional) # Test if the balancing was successful energy_test(treat_balanced ~ age + weight + class, data=dat2) # Check covariate balance one by one plot_var(dat2, what='age') plot_var(dat2, what='weight') plot_var(dat2, what='class') ``` # Balancing by stratum To achieve covariate balance within distinct subgroups (e.g., by sex) ```{r, echo=TRUE, warning=FALSE, fig.dim = c(6, 6)} # Generate data POOL (n=100) to sample from # Generate baseline data (n=30), for all treatment groups together. # And a stratum factor variable sex set.seed(14495) dat <- data.frame( treat = rep(0:1, c(100, 30)), cov1 = c(rnorm(100, 11, 2), rnorm(30, 10, 1)), cov2 = c(rnorm(100, 11, 2), rnorm(30, 10, 1)), cov3 = c(rnorm(100, 9, 2), rnorm(30, 10, 1)), sex = c(rbinom(100, 1, 0.5), rbinom(30, 1, 0.5))) dat$sex <- factor(dat$sex, labels=c('M', 'F')) # Sample a VCG of size 7 for male and 5 for female animals. out <- VCG_sampler(treat ~ cov1 + cov2 + cov3 | sex, data=dat, n=c(7, 5), plot=TRUE) dat2 <- out[[1]] out[[2]] # Test if the balancing was successful energy_test(treat_balanced ~ cov1 + cov2 + cov3 | sex, data=dat2, plot=F) # plot cov1 plot_var(dat2, what='cov1') ``` # Example with a difficult distribution The method is non-parametric and makes no assumptions about the distribution, as the following example shows. ```{r, echo=TRUE, warning=FALSE, fig.dim = c(6, 6)} # Create some bimodal sample data for the TG, the POOL data is normal distributed set.seed(5435) dat <- data.frame( treat = rep(0:1, c(100, 50)), cov_1 = c(rnorm(100, 7, 2), c(rnorm(25, 5, 1), rnorm(25, 9, 1)))) # Balance cov_1 covariate dat2 <- VCG_sampler(treat ~ cov_1, data=dat, n=25, plot=F) # Create histogram plot to show the results data <- data.frame( value = c(dat$cov_1[which(dat$treat==0)], dat$cov_1[which(dat$treat==1)], dat2$cov_1[which(dat2$VCG==1)]), group = factor(rep(c("POOL", "TG", "VCG"), c(100, 50, 25))) ) custom_colors <- c("POOL" = "#00A091", "TG" = "#7A00E6", "VCG" = "#FE7500") sample_sizes <- data.frame(label=c('n=100', 'n=50', 'n=25'),group=c("POOL", "TG", "VCG")) ggplot(data, aes(x = value, fill = group)) + geom_density(alpha = 0.6, color = NA) + scale_fill_manual(values = custom_colors) + facet_wrap(~ group, ncol = 1) + labs(title = "", x = "", y = "Density") + geom_text(data = sample_sizes, aes(x = Inf, y = Inf, label = label), hjust = 1.1, vjust = 1.5, inherit.aes = F) + theme_minimal() + theme(legend.position = "none",strip.text = element_text(size = 14, face = "bold"), plot.title = element_text(size = 16, face = "bold" ) ) ``` # Multiple control groups (VCGs) If multiple VCGs are needed, the multiSampler function can be used. It will use the VCG_sampler function with parameter "random"=TRUE. That means, it will use the energy distance as inverse probability for the observation to be sampled. This will incorporate some randomness in the process, but will reduce the balancing quality to some extent. This function should only be used if you really need multiple VCG, e.g. for PoC studies. It is not intended for selecting one VCG from them afterwards! In this case, the VCG_sampler function should be used directly and only one VCG should be generated. ```{r, echo=TRUE, warning=FALSE, fig.dim = c(6, 6)} # Create data dat <- data.frame( treat = rep(0:1, c(50, 30)), cov_1 = c(rnorm(50, 5, 2), rnorm(30, 5, 1)), cov_2 = c(rnorm(50, 11, 2), rnorm(30, 10, 1)) ) # Generate 10 VCGs of size 5 out <- multiSampler(treat ~ cov_1 + cov_2, data = dat, n = 5, Nsamples = 10) out[[2]] str(out[[1]]) ``` # References Székely, G. J., & Rizzo, M. L. (2016). Energy distance. Wiley Interdisciplinary Reviews: Computational Statistics, 8(1), 27–38.