--- title: "dtrackr - Grouping, Nesting and Long format data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{dtrackr - Grouping, Nesting and Long format data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(dplyr) library(tidyr) library(dtrackr) knitr::opts_chunk$set(echo = TRUE) ``` ## Long format data `dtrackr` assumes a tidy data paradigm where one row of data is relevant to one logical entity, whether it be cars, irises, diamonds, or anything else. This is not always the case, if for example the data you are processing comes from a join of data sets. Here we simulate a set of patients, test samples, and test results in a hypothetical trial: ```{r} age_cats = factor(sprintf("%02d-%02d",seq(0,80,5),seq(4,84,5))) # A set of synthetic patients: patients = tibble::tibble( patient_id = 1:100, age_category = sample(age_cats,100, replace=TRUE), ethnicity = sample(1:6, 100, replace = TRUE), gender = sample(c("Male","Female"), 100, replace=TRUE), group = sample(c("Cases","Controls"), 100, replace=TRUE) ) # each patient is going to have a random selection of tests tests = tibble::tibble( test_id = 1:1000, patient_id = sample(1:100,1000, replace = TRUE), test_type = sample(c("FBC","LFT","Electrolytes"), 1000, replace=TRUE), test_date = as.Date("2025-01-01")+sample.int(50, 1000, replace=TRUE) ) # and each test a random selection of results consisting of components and # values: tests = tests %>% mutate( result = purrr::map(test_type, ~ case_when( .x == "FBC" ~ list(tibble::tibble( component = c("HB","platelets","WCC"), value = c( runif(1,13.5,15), runif(1,100,1000), runif(1,0,30)) )), .x == "LFT" ~ list(tibble::tibble( component = c("AST","GGT"), value = c( runif(1,0,100), runif(1,0,100)) )), .x == "Electrolytes" ~ list(tibble::tibble( component = c("NA","K","Glucose"), value = c( runif(1,130,150), runif(1,3.3,5.2), runif(1,50,150)) )) )) ) data = patients %>% inner_join( tests %>% unnest(result) %>% unnest(result), by="patient_id" ) data %>% glimpse() ``` We might have an objective to prepare this data set for analysis but have inclusion or exclusion criteria that apply at different levels. We might have patients who need to be excluded as too young or old, or specific test results that were taken at the wrong time, or patients who have evidence of diabetes, or exclude specific test results that are out of range. All of this we need to do while stratified by the control group status. To achieve this we use nesting to collapse the data frame into one row per patient, one row per test or one row per test result, depending on what we are trying to exclude. This allows `dtrackr` to dynamically change what it regards as a single countable thing, depending on the context of the pipeline. ```{r} processed = data %>% # the data is originally long format with one row per test result: track("{.count} test results") %>% mutate(maybe_diabetic = any(component == "Glucose" & value>130), .by = patient_id) %>% nest(test_panel = c(component,value), .messages="") %>% # Now the data is long format with one row per test: comment("{.count} tests") %>% nest(tests = starts_with("test_"), .messages="") %>% # and now long format with one row per patient: comment("{.count} patients") %>% group_by(group) %>% comment("{.count} patients") %>% # these exclusions are at the patient level exclude_all( .headline = "people", maybe_diabetic ~ "{.excluded} diabetics", age_category %in% age_cats[1:4] ~ "{.excluded} under 20" ) %>% # these are now back at the test level unnest(tests) %>% comment("{.count} tests",.headline = "") %>% exclude_all( .headline = "tests", test_date < "2025-01-07" ~ "{.excluded} with invalid dates" ) %>% count_subgroup(test_type, .headline = "") %>% # and finally at the granular test result level unnest(test_panel) %>% exclude_all( .headline = "results", component == "HB" & value < 14 ~ "{.excluded} invalid Hb results", component == "K" & value < 3.5 ~ "{.excluded} haemolysed K+" ) %>% group_by(test_type, .add=TRUE, .messages="By tests") %>% count_subgroup(component, .headline = "{test_type}") %>% ungroup(.messages = "{.count} eligible results") %>% nest(test_panel = c(component,value), .messages="") %>% comment("{.count} eligible tests") %>% nest(tests = starts_with("test_"), .messages="") %>% comment("{.count} eligible patients") processed %>% flowchart() ``` ## Maximum groupings Going back to the original example data, in a slightly contrived example let's assume we want to exclude age categories that don't have a close gender match between cases and controls. We have to create a lot of small groups to count. ```{r} data %>% group_by(age_category, gender, group) %>% summarise( n = n_distinct(patient_id) ) %>% pivot_wider(values_from = n, names_from = group) %>% filter(abs(Cases-Controls) <= 1) %>% glimpse() ``` If we were to try and monitor this data frame through the pipeline there would be a problem with the flowchart because too many groups are generated. This causes performance and legibility issues for the resulting graph and is a result of an interim stage of the data pipeline where grouping is used to do fine scale summarisation operation. The most number of groups that `dtrackr` will attempt to keep track of is configurable but defaults to 16, and if the number of groups exceeds that it will pause tracking, until the number of groups is restored to a lower number, at which point it will start following again. A "< hidden steps >" message is inserted into the graph when this happens but this can be changed, or disabled altogether with `options(dtrackr.hidden_steps = "")`. `dtrackr` does not by default warn the user of this unless the `options(dtrackr.verbose=TRUE)` is set. ```{r} old = options(dtrackr.verbose=TRUE) data %>% track() %>% group_by(gender) %>% comment(c("{.count} items","before pause")) %>% # the tracking is paused on this next step as the number of groups becomes >16 group_by(age_category, group, .add=TRUE) %>% comment("This message is not tracked") %>% summarise( n = n_distinct(patient_id) ) %>% pivot_wider(values_from = n, names_from = group) %>% filter(abs(Cases-Controls) <= 1) %>% # the tracking is automatically resumed at this point as the grouping has # returned to manageable levels. group_by(gender) %>% comment(c("{.count} summarised rows","after resume")) %>% flowchart() options(old) ``` By default this behaviour is triggered if we get to 16 subgroups. This can be changed by setting the option: ```R options(dtrackr.max_supported_groupings = 16) ``` Pausing and unpausing the tracking can also be done manually by calling `dtrackr::pause()` and `dtrackr::resume()`. This is a fairly experimental feature, and I don't expect it to be heavily used.