Automatic Variable Labeling

library(sumExtras)
library(gtsummary)
library(dplyr)

use_jama_theme()

Raw variable names like trt, marker, and grade don’t belong in a publication table. If you’re building 20+ tables across an analysis, manually relabeling the same variables in every tbl_summary() call is time consuming. add_auto_labels() lets you define labels once and apply them everywhere.

Creating a Data Dictionary

A dictionary is a data frame with two columns: variable (exact variable names) and description (the labels you want displayed). Column names are case-insensitive.

dictionary <- tibble::tribble(
  ~variable,    ~description,
  "trt",        "Chemotherapy Treatment",
  "age",        "Age at Enrollment (years)",
  "marker",     "Marker Level (ng/mL)",
  "stage",      "T Stage",
  "grade",      "Tumor Grade",
  "response",   "Tumor Response",
  "death",      "Patient Died"
)

dictionary
#> # A tibble: 7 × 2
#>   variable description              
#>   <chr>    <chr>                    
#> 1 trt      Chemotherapy Treatment   
#> 2 age      Age at Enrollment (years)
#> 3 marker   Marker Level (ng/mL)     
#> 4 stage    T Stage                  
#> 5 grade    Tumor Grade              
#> 6 response Tumor Response           
#> 7 death    Patient Died

In practice, you could load this from a CSV or define it once at the top of your analysis script.

Labeling gtsummary Tables

Pass the Dictionary Explicitly

trial |>
  tbl_summary(by = trt, include = c(age, grade, marker)) |>
  extras() |> 
  add_auto_labels(dictionary = dictionary)
Overall
N = 200
1
Drug A
N = 98
1
Drug B
N = 102
1
p-value2
Age 47 (38, 57) 46 (37, 60) 48 (39, 56) 0.718
    Unknown 11 7 4
Grade


0.871
    I 68 (34%) 35 (36%) 33 (32%)
    II 68 (34%) 32 (33%) 36 (35%)
    III 64 (32%) 31 (32%) 33 (32%)
Marker Level (ng/mL) 0.64 (0.22, 1.41) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21) 0.085
    Unknown 10 6 4
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Automatic Discovery

If a dictionary object exists in your environment, add_auto_labels() finds it without you passing it:

# dictionary already exists from above
trial |>
  tbl_summary(by = trt, include = c(age, stage, response)) |>
  extras() |> 
  add_auto_labels()
Overall
N = 200
1
Drug A
N = 98
1
Drug B
N = 102
1
p-value2
Age 47 (38, 57) 46 (37, 60) 48 (39, 56) 0.718
    Unknown 11 7 4
T Stage


0.866
    T1 53 (27%) 28 (29%) 25 (25%)
    T2 54 (27%) 25 (26%) 29 (28%)
    T3 43 (22%) 22 (22%) 21 (21%)
    T4 50 (25%) 23 (23%) 27 (26%)
Tumor Response 61 (32%) 28 (29%) 33 (34%) 0.530
    Unknown 7 3 4
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Pre-Labeled Data

If your data already has label attributes (e.g., from haven::read_sas() or manual assignment), add_auto_labels() reads those directly:

labeled_trial <- trial
attr(labeled_trial$age, "label") <- "Patient Age at Baseline"
attr(labeled_trial$marker, "label") <- "Biomarker Concentration (ng/mL)"

labeled_trial |>
  tbl_summary(by = trt, include = c(age, marker)) |>
  extras() |> 
  add_auto_labels()
Overall
N = 200
1
Drug A
N = 98
1
Drug B
N = 102
1
p-value2
Patient Age at Baseline 47 (38, 57) 46 (37, 60) 48 (39, 56) 0.718
    Unknown 11 7 4
Biomarker Concentration (ng/mL) 0.64 (0.22, 1.41) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21) 0.085
    Unknown 10 6 4
1 Median (Q1, Q3)
2 Wilcoxon rank sum test

Manual Overrides Always Win

Labels set via label = list(...) in tbl_summary() always take priority over dictionary or attribute labels:

trial |>
  tbl_summary(
    by = trt,
    include = c(age, grade, marker),
    label = list(age ~ "Age (from tbl_summary function)")
  ) |>
  extras() |> 
  add_auto_labels(dictionary = dictionary)
Overall
N = 200
1
Drug A
N = 98
1
Drug B
N = 102
1
p-value2
Age (from tbl_summary function) 47 (38, 57) 46 (37, 60) 48 (39, 56) 0.718
    Unknown 11 7 4
Grade


0.871
    I 68 (34%) 35 (36%) 33 (32%)
    II 68 (34%) 32 (33%) 36 (35%)
    III 64 (32%) 31 (32%) 33 (32%)
Marker Level (ng/mL) 0.64 (0.22, 1.41) 0.84 (0.23, 1.60) 0.52 (0.18, 1.21) 0.085
    Unknown 10 6 4
1 Median (Q1, Q3); n (%)
2 Wilcoxon rank sum test; Pearson’s Chi-squared test

Regression Tables

Works with tbl_regression() the same way:

lm(marker ~ age + grade + stage, data = trial) |>
  tbl_regression() |>
  add_auto_labels()
Characteristic Beta 95% CI p-value
Age at Enrollment (years) 0.00 -0.01, 0.01 >0.9
Tumor Grade


    I
    II -0.35 -0.67, -0.04 0.027
    III -0.12 -0.43, 0.19 0.4
T Stage


    T1
    T2 0.33 -0.01, 0.67 0.057
    T3 0.21 -0.17, 0.58 0.3
    T4 0.14 -0.22, 0.50 0.4
Abbreviation: CI = Confidence Interval

Label Priority

When both dictionary labels and attribute labels exist for the same variable, attribute labels take priority by default:

  1. Manual labels (from label = list(...) in tbl_summary()) always win
  2. Attribute labels (from attr(data$var, "label")) take priority over dictionary
  3. Dictionary labels are used as a fallback

We recommend setting options(sumExtras.prefer_dictionary = TRUE) so dictionary labels take priority over attribute labels. This is especially useful when your imported data has generic attribute labels but your dictionary has the labels you actually want in publication tables. See vignette("options") for details.

trial_both <- trial
attr(trial_both$age, "label") <- "Age from Attribute"

dictionary_conflict <- tibble::tribble(
  ~variable, ~description,
  "age", "Age from Dictionary"
)

# Attribute wins over dictionary
trial_both |>
  tbl_summary(by = trt, include = age) |>
  add_auto_labels(dictionary = dictionary_conflict) |>
  extras()
Overall
N = 200
1
Drug A
N = 98
1
Drug B
N = 102
1
p-value2
Age from Attribute 47 (38, 57) 46 (37, 60) 48 (39, 56) 0.718
    Unknown 11 7 4
1 Median (Q1, Q3)
2 Wilcoxon rank sum test

Automatic Labeling via Options

If you always keep a dictionary in your environment, you can skip calling add_auto_labels() entirely. Set this once per session (or put it in your .Rprofile):

options(sumExtras.auto_labels = TRUE)

Now every extras() call picks up the dictionary automatically:

dictionary <- tibble::tribble(
  ~variable,    ~description,
  "age",        "Age at Enrollment (years)",
  "marker",     "Marker Level (ng/mL)",
  "grade",      "Tumor Grade"
)

# No add_auto_labels() needed
trial |>
  tbl_summary(by = trt) |>
  extras()

If no dictionary is found and the data has no label attributes, extras() continues normally. If something goes wrong, it warns and moves on. You can still call add_auto_labels() explicitly whenever you need per-table control.

See vignette("options") for more on .Rprofile setup.

More Vignettes