Weighted descriptive statistics are required in some contexts, for
instance, in the analysis of survey data. This can be accomplished by
using the provided weighted
wrapper class. Internally, this
will cause the functions wtd.mean
, wtd.var
,
wtd.quantile
and wtd.table
from the
Hmisc
package (which is optional in general but required to
use this functionality) to be used in place of their standard
non-weighted counterparts.
We take an example from the survey
package (note that
this is for illustration purposes only, it is not meant to be a real
application):
##
## Attaching package: 'survival'
## The following object is masked from 'package:boot':
##
## aml
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
data(myco)
myco$Leprosy <- factor(myco$leprosy, levels=1:0, labels=c("Leprosy Cases", "Controls"))
myco$AgeCat <- factor(myco$Age,
levels=c(7.5, 12.5, 17.5, 22.5, 27.5, 32.5 ),
labels=c("5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34")
)
myco$ScarL <- as.logical(myco$Scar)
label(myco$Age) <- "Age"
units(myco$Age) <- "years"
label(myco$AgeCat) <- "Age Group"
label(myco$ScarL) <- "BCG vaccination scar"
table1(~ ScarL + Age + AgeCat | Leprosy, data=weighted(myco, wt), big.mark=",")
Leprosy Cases (N=258) |
Controls (N=61,310) |
Overall (N=61,568) |
|
---|---|---|---|
BCG vaccination scar | |||
Yes | 100 (38.8%) | 34,847 (56.8%) | 34,947 (56.8%) |
No | 158 (61.2%) | 26,463 (43.2%) | 26,621 (43.2%) |
Age (years) | |||
Mean (SD) | 21.2 (8.32) | 16.8 (8.36) | 16.8 (8.37) |
Median [Min, Max] | 22.5 [7.50, 32.5] | 17.5 [7.50, 32.5] | 17.5 [7.50, 32.5] |
Age Group | |||
5 to 9 | 25 (9.7%) | 17,327 (28.3%) | 17,352 (28.2%) |
10 to 14 | 50 (19.4%) | 13,172 (21.5%) | 13,222 (21.5%) |
15 to 19 | 44 (17.1%) | 10,325 (16.8%) | 10,369 (16.8%) |
20 to 24 | 39 (15.1%) | 8,026 (13.1%) | 8,065 (13.1%) |
25 to 29 | 47 (18.2%) | 5,981 (9.8%) | 6,028 (9.8%) |
30 to 34 | 53 (20.5%) | 6,479 (10.6%) | 6,532 (10.6%) |
It also works in “transpose” mode:
Age (years) |
BCG vaccination scar | |
---|---|---|
Leprosy Cases (N=258) |
Mean (SD): 21.2 (8.32) Median [Min, Max]: 22.5 [7.50, 32.5] |
Yes: 100 (38.8%) No: 158 (61.2%) |
Controls (N=61,310) |
Mean (SD): 16.8 (8.36) Median [Min, Max]: 17.5 [7.50, 32.5] |
Yes: 34,847 (56.8%) No: 26,463 (43.2%) |
Overall (N=61,568) |
Mean (SD): 16.8 (8.37) Median [Min, Max]: 17.5 [7.50, 32.5] |
Yes: 34,947 (56.8%) No: 26,621 (43.2%) |
For more flexibility, we may not want the weighting to be applied
globally, but only to some of the variables. We can do this as well, by
using weighted
on individual variables:
Leprosy Cases (N=258) |
Controls (N=258) |
Overall (N=516) |
|
---|---|---|---|
BCG vaccination scar | |||
Yes | 100 (38.8%) | 34,847 (56.8%) | 34,947 (56.8%) |
No | 158 (61.2%) | 26,463 (43.2%) | 26,621 (43.2%) |
Age (years) | |||
Mean (SD) | 21.2 (8.32) | 21.2 (8.32) | 21.2 (8.31) |
Median [Min, Max] | 22.5 [7.50, 32.5] | 22.5 [7.50, 32.5] | 22.5 [7.50, 32.5] |
Age Group | |||
5 to 9 | 25 (9.7%) | 25 (9.7%) | 50 (9.7%) |
10 to 14 | 50 (19.4%) | 50 (19.4%) | 100 (19.4%) |
15 to 19 | 44 (17.1%) | 44 (17.1%) | 88 (17.1%) |
20 to 24 | 39 (15.1%) | 39 (15.1%) | 78 (15.1%) |
25 to 29 | 47 (18.2%) | 47 (18.2%) | 94 (18.2%) |
30 to 34 | 53 (20.5%) | 53 (20.5%) | 106 (20.5%) |
This implementation allows for simple weighted statistics, but does
not currently support more complex designs from the survey
package like stratified sampling or cluster sampling.
weighted
and indexed
classesThe weighted
class is just a wrapper around a vector or
data.frame
that adds a vector of weights as an attribute.
These weights are carried along or subsetted appropriately during
operations like slicing or subsetting. See ?weighted
for
some examples.
The indexed
class is similar, but it simply maintains
the indices of a vector (row indices for a data.frame
) when
a subset or slide is taken. This leads to some interesting possibilities
when we want to do more complex things.
The following example also comes from the survey
package:
data(api)
dclus1<-svydesign(id=~dnum, weights=~pw, data=apiclus1, fpc=~fpc)
svyby(~api99+api00, ~stype, dclus1, svymean)
## stype api99 api00 se.api99 se.api00
## E E 607.7917 648.8681 22.81660 22.36241
## H H 595.7143 618.5714 41.76400 38.02025
## M M 608.6000 631.4400 32.56064 31.60947
## stype
## sch.wide E H M
## No 406.1640 101.5410 270.7760
## Yes 4467.8035 372.3170 575.3989
Using table1
, the same results can be presented more
beautifully:
myrender <- function(x, name, ...) {
if (is.numeric(x)) {
r <- svymean(as.formula(paste0("~", name)), subset(dclus1, (1:nrow(dclus1)) %in% indices(x)))
r <- c(Mean=as.numeric(r), SE=sqrt(attr(r, "var", exact=T)))
r <- unlist(stats.apply.rounding(as.list(r), big.mark=","))
} else {
r <- svytable(as.formula(paste0("~", name)), subset(dclus1, (1:nrow(dclus1)) %in% indices(x)))
r <- unlist(stats.apply.rounding(as.list(r), big.mark=",", digits=1, rounding.fn=round_pad))
}
c("", r)
}
apiclus1$stype2 <- factor(apiclus1$stype, levels=c("E", "M", "H"),
labels=c("Elementary", "Middle School", "High School"))
label(apiclus1$api99) <- "API in 1999"
label(apiclus1$api00) <- "API in 2000"
label(apiclus1$sch.wide) <- "Met school-wide growth target?"
table1(~ api99 + api00 + sch.wide | stype2, indexed(apiclus1), render=myrender,
render.strat=names)
Elementary | Middle School | High School | Overall | |
---|---|---|---|---|
API in 1999 | ||||
Mean | 608 | 609 | 596 | 607 |
SE | 22.8 | 32.6 | 41.8 | 24.2 |
API in 2000 | ||||
Mean | 649 | 631 | 619 | 644 |
SE | 22.4 | 31.6 | 38.0 | 23.5 |
Met school-wide growth target? | ||||
No | 406.2 | 270.8 | 101.5 | 778.5 |
Yes | 4,467.8 | 575.4 | 372.3 | 5,415.5 |