Weighted Statistics With table1

Benjamin Rich

2025-09-11

Introduction

Weighted descriptive statistics are required in some contexts, for instance, in the analysis of survey data. This can be accomplished by using the provided weighted wrapper class. Internally, this will cause the functions wtd.mean, wtd.var, wtd.quantile and wtd.table from the Hmisc package (which is optional in general but required to use this functionality) to be used in place of their standard non-weighted counterparts.

Example

We take an example from the survey package (note that this is for illustration purposes only, it is not meant to be a real application):

library(survey, quietly=TRUE)
## 
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
## 
##     aml
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
data(myco)

myco$Leprosy <- factor(myco$leprosy, levels=1:0, labels=c("Leprosy Cases", "Controls"))

myco$AgeCat <- factor(myco$Age,
    levels=c(7.5,      12.5,       17.5,       22.5,       27.5,       32.5      ),
    labels=c("5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34")
)

myco$ScarL <- as.logical(myco$Scar)

label(myco$Age) <- "Age"
units(myco$Age) <- "years"
label(myco$AgeCat) <- "Age Group"
label(myco$ScarL) <- "BCG vaccination scar"

table1(~ ScarL + Age + AgeCat | Leprosy, data=weighted(myco, wt), big.mark=",")
Leprosy Cases
(N=258)
Controls
(N=61,310)
Overall
(N=61,568)
BCG vaccination scar
Yes 100 (38.8%) 34,847 (56.8%) 34,947 (56.8%)
No 158 (61.2%) 26,463 (43.2%) 26,621 (43.2%)
Age (years)
Mean (SD) 21.2 (8.32) 16.8 (8.36) 16.8 (8.37)
Median [Min, Max] 22.5 [7.50, 32.5] 17.5 [7.50, 32.5] 17.5 [7.50, 32.5]
Age Group
5 to 9 25 (9.7%) 17,327 (28.3%) 17,352 (28.2%)
10 to 14 50 (19.4%) 13,172 (21.5%) 13,222 (21.5%)
15 to 19 44 (17.1%) 10,325 (16.8%) 10,369 (16.8%)
20 to 24 39 (15.1%) 8,026 (13.1%) 8,065 (13.1%)
25 to 29 47 (18.2%) 5,981 (9.8%) 6,028 (9.8%)
30 to 34 53 (20.5%) 6,479 (10.6%) 6,532 (10.6%)

It also works in “transpose” mode:

table1(~ Age + ScarL | Leprosy, data=weighted(myco, wt), transpose=T, big.mark=",")
Age
(years)
BCG vaccination scar
Leprosy Cases
(N=258)

Mean (SD): 21.2 (8.32)
Median [Min, Max]: 22.5 [7.50, 32.5]

Yes: 100 (38.8%)
No: 158 (61.2%)
Controls
(N=61,310)

Mean (SD): 16.8 (8.36)
Median [Min, Max]: 17.5 [7.50, 32.5]

Yes: 34,847 (56.8%)
No: 26,463 (43.2%)
Overall
(N=61,568)

Mean (SD): 16.8 (8.37)
Median [Min, Max]: 17.5 [7.50, 32.5]

Yes: 34,947 (56.8%)
No: 26,621 (43.2%)

For more flexibility, we may not want the weighting to be applied globally, but only to some of the variables. We can do this as well, by using weighted on individual variables:

table1(~ weighted(ScarL, wt) + Age + AgeCat | Leprosy, data=myco, big.mark=",")
Leprosy Cases
(N=258)
Controls
(N=258)
Overall
(N=516)
BCG vaccination scar
Yes 100 (38.8%) 34,847 (56.8%) 34,947 (56.8%)
No 158 (61.2%) 26,463 (43.2%) 26,621 (43.2%)
Age (years)
Mean (SD) 21.2 (8.32) 21.2 (8.32) 21.2 (8.31)
Median [Min, Max] 22.5 [7.50, 32.5] 22.5 [7.50, 32.5] 22.5 [7.50, 32.5]
Age Group
5 to 9 25 (9.7%) 25 (9.7%) 50 (9.7%)
10 to 14 50 (19.4%) 50 (19.4%) 100 (19.4%)
15 to 19 44 (17.1%) 44 (17.1%) 88 (17.1%)
20 to 24 39 (15.1%) 39 (15.1%) 78 (15.1%)
25 to 29 47 (18.2%) 47 (18.2%) 94 (18.2%)
30 to 34 53 (20.5%) 53 (20.5%) 106 (20.5%)

This implementation allows for simple weighted statistics, but does not currently support more complex designs from the survey package like stratified sampling or cluster sampling.

The weighted and indexed classes

The weighted class is just a wrapper around a vector or data.frame that adds a vector of weights as an attribute. These weights are carried along or subsetted appropriately during operations like slicing or subsetting. See ?weighted for some examples.

The indexed class is similar, but it simply maintains the indices of a vector (row indices for a data.frame) when a subset or slide is taken. This leads to some interesting possibilities when we want to do more complex things.

The following example also comes from the survey package:

data(api)

dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)

svyby(~api99+api00, ~stype, dclus1, svymean)
##   stype    api99    api00 se.api99 se.api00
## E     E 607.7917 648.8681 22.81660 22.36241
## H     H 595.7143 618.5714 41.76400 38.02025
## M     M 608.6000 631.4400 32.56064 31.60947
svytable(~sch.wide+stype, dclus1)
##         stype
## sch.wide         E         H         M
##      No   406.1640  101.5410  270.7760
##      Yes 4467.8035  372.3170  575.3989

Using table1, the same results can be presented more beautifully:

myrender <- function(x, name, ...) {
    if (is.numeric(x)) {
        r <- svymean(as.formula(paste0("~", name)), subset(dclus1, (1:nrow(dclus1)) %in% indices(x)))
        r <- c(Mean=as.numeric(r), SE=sqrt(attr(r, "var", exact=T)))
        r <- unlist(stats.apply.rounding(as.list(r), big.mark=","))
    } else {
        r <- svytable(as.formula(paste0("~", name)), subset(dclus1, (1:nrow(dclus1)) %in% indices(x)))
        r <- unlist(stats.apply.rounding(as.list(r), big.mark=",", digits=1, rounding.fn=round_pad))
    }
    c("", r)
}

apiclus1$stype2 <- factor(apiclus1$stype, levels=c("E", "M", "H"),
    labels=c("Elementary", "Middle School", "High School"))

label(apiclus1$api99)    <- "API in 1999"
label(apiclus1$api00)    <- "API in 2000"
label(apiclus1$sch.wide) <- "Met school-wide growth target?"

table1(~ api99 + api00 + sch.wide | stype2, indexed(apiclus1), render=myrender,
    render.strat=names)
Elementary Middle School High School Overall
API in 1999
Mean 608 609 596 607
SE 22.8 32.6 41.8 24.2
API in 2000
Mean 649 631 619 644
SE 22.4 31.6 38.0 23.5
Met school-wide growth target?
No 406.2 270.8 101.5 778.5
Yes 4,467.8 575.4 372.3 5,415.5