Tutorial Inverse probability weights to correct for ascertainment bias using R package ‘wcox’

Step 1: Load R package ‘wcox’ and toy data.

Load the package.


library(wcox)

require(dplyr)
require(tidyr)
require(survival)

Load the toy data set.

# Load toy data.
data("fam_dat")

Note that this concerns a simulated data set and no real observations. However, it is similar in structure to what one might come across in for example familial cancer studies.

# Show the first few lines of the toy data.
head(fam_dat)
#>    family_id           x event_indicator age individual_id
#> 8          1  0.33093247               1  38             1
#> 9          1  0.18551392               1  67             2
#> 10         1 -1.65904097               0 100             3
#> 11         1 -0.38816903               1  14             4
#> 12         1  0.03958956               1   2             5
#> 18         2  1.05115476               1   9             6

Families share the same ‘family_id’; ‘individual_id’ is unique per individual. Risk modifier ‘x’ is a continuous variable. The event indicator (‘event_indicator’) takes value 1 if the individual experienced the event during follow-up and 0 otherwise. We consider the follow-up time since birth, i.e. age. For individuals experiencing the event, ‘age’ is the time-to-event. For others, this is the time-to-censoring.

# Show the number of families and top rows.
cat( "There are", unique(fam_dat$family_id) %>% length() , "families in data set. ")
#> There are 219 families in data set.

# Examine family sizes.
fam_sizes <- fam_dat %>% group_by(family_id) %>% count()

cat( "Family size is on average ", 
     fam_sizes$n %>% mean() %>% round(1), 
     " and ranges from ", 
     fam_sizes$n %>% min(), " to ", 
     fam_sizes$n %>% max(), ".", sep = "")
#> Family size is on average 4.9 and ranges from 2 to 7.

It is good practice to report the distribution of family size together with the analysis results.

Besides sample data, we need some information about the population in order to calculate weights that correct for ascertainment bias.

Step 2: Acquire population incidence rate in the correct format.

In the weight calculation ‘wcox’ uses the difference between the incidence rate of the event of interest in the sample versus the population. Therefore, the incidence rate in the population at risk is needed.

Incidence rate is the number of events in a certain time window divided by the population.

Think of sentences like “Breast cancer occurs in out of 100 000 women between age xx and xx.” For many (medical) events, such data is available in national registries, for example the Dutch cancer registry (https://iknl.nl/nkr/cijfers-op-maat, select crude rate for incidence to get per 100 000). For high-risk populations of carriers of certain pathogenic genetic variants, incidence rates are often available in scientific publications or can be inferred based on published hazard ratios multiplied by underlying general population-based incidence rates.

Two aspects are important here. We need the incidence rate per 100 000 individuals, as that is what the package expects. And we need to make a choice of age groups. ‘wcox’ allows to select the standard 5-years or 10-years age groups or define age groups manually.

The population incidence needs to be a so called vector, a sequence of incidence rates per age group. For entering the age group cut-offs: 1) We can specify them manually where the vector ‘breaks’ needs to be one item longer than the incidence rate vector because the end of the last age group is also included. Intervals are of the form [begin, end), which means that in our example below, we will only include those with a follow-up time smaller than 100. 2) We can choose 5-years categories (0-4, 5-9, 10-4 et cetera) or 10-years categories (0-9, 10-19, 20-29 et cetera). Then, ‘wcox’ will define the cut-offs automatically.


# Enter the incidence by age group per 100 000, starting with the youngest age group.
incidence_rate <- c(2882, 1766, 1367, 1155, 987, 845, 775, 798, 636, 650)

# Define the age group cutoffs.
breaks_manually <- c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
breaks_10yrs <- "10 years"

We are almost ready to calculate.

Step 3: Prepare the sample data using Prepare_data().

The function Prepare_data() expects a data.frame() in which each row concerns one individual (wide format). Moreover, it looks for the individual identifier, event indicator and time-to-event/censoring (age) by searching for variables ‘id’, ‘d’, ‘y’, respectively. In our case the latter two variables are named differently and we have to rename them.

# Rename variables.
my_dat <- rename(fam_dat, id = individual_id, d = event_indicator, y = age)

Now, it’s time to combine the three pieces (sample data, external data, choice of age categories) to prepare the data for weight calculation. Remember that the calculation is based on the comparison between the sample data and the population and therefore we need to have the population incidence.

The function Prepare_data() takes three arguments: dat inputs the sample data.frame with ‘id’, ‘d’, ‘y’ and a family identifier, population_incidence inputs the vector (i.e. sequence of form c(xxx, xxx, xxx, …) ) with incidence rates per 100 000, breaks inputs either a vector with breaks or one of the pre-set options (“5 years” or “10 years”). The code below uses the manual breaks. To try-out the pre-set option for 10 years age groups, remove the “#” in front of the last lines.


# Using option 1 (manual cut-offs):
my_dat_out <- Prepare_data(dat = my_dat, population_incidence = incidence_rate, 
                           breaks = breaks_manually)

# Unhash to use option 2 (pre-set cut-offs):
# my_dat_out <- Prepare_data(dat = my_dat, population_incidence = incidence_rate, 
# breaks = "10 years")

Let’s see what the prepared data looks like.


# Select the newly add columns.
my_dat_out %>% select(id, k, d, S_k, S_k.) %>% arrange(id) %>% head(8)
#>   id        k d       S_k      S_k.
#> 1  1  (30,40] 1 0.8909206 0.9182261
#> 2  2  (60,70] 1 0.9254270 0.9634811
#> 3  3 (90,100] 0 0.9370675 0.9924243
#> 4  4  (10,20] 1 0.8381150 0.8331193
#> 5  5   (0,10] 1 0.7496117 0.6760318
#> 6  6   (0,10] 1 0.7496117 0.6760318
#> 7  7   (0,10] 1 0.7496117 0.6760318
#> 8  8   (0,10] 0 0.7496117 0.6760318

The weights will be calculated based on the newly add columns S_k and S_k. .

In the paper, sample based quantities have a hyperscript ‘0’ which is replaced by a dot (.) here. S_k is based on the external information (incidence rate).

With the prepared data set, we are ready to calculate the weights!

Step 4: Calculate weights using Calculate_weights().

Function Calculate_weights() requires the data set prepared in the previous step, which we will refer to as prepared data. N.B.: Using the original data set directly will fail as the external information is not integrated there.

# Calculate weights using the object that is output from the Prepare_data() function.

w <- Calculate_weights(dat = my_dat_out)
#> [1] "No negative weights"

The function indicates that there are no negative weights. This means that our weights are valid. We will have a look at them now.

# Show the weights.
my_dat_out %>% mutate(weight = w) %>% 
  select(id, d, weight) %>% 
  arrange(id) %>% 
  filter(id %in% c(1,3))
#>   id d   weight
#> 1  1 1 1.374799
#> 2  3 0 1.000000

Individual 1 (id = 1) experiences the event between age 30 and 39 (d = 1). The weight for the interval in which the event took place is 1, while other intervals get weighted by less than 1. Individual 3 never experiences the event during follow-up and is censored within interval 90-99 years.

Step 5: Fit a weighted Cox model using R package ‘survival’.

Now, we show how to fit a Cox proportional hazards model with our calculated weights to correct for ascertainment bias, using R package ‘survival’. Weights can be included using the argument ‘weights’. Note that because we inputted the exact same data.frame in the function Calculate_weights() in the previous step, i.e. the prepared data, the resulting weight vector can be directly used: the order of individuals is the same. The covariate of interest is risk modifier ‘x’. In order to obtain robust standard errors, the cluster term needs to be included.

# Fit the model.
fit <- coxph(Surv(y, d==1) ~ x + cluster(family_id), weights = w, data =  my_dat_out)

fit
#> Call:
#> coxph(formula = Surv(y, d == 1) ~ x, data = my_dat_out, weights = w, 
#>     cluster = family_id)
#> 
#>      coef exp(coef) se(coef) robust se    z      p
#> x 1.39863   4.04963  0.06048   0.06789 20.6 <2e-16
#> 
#> Likelihood ratio test=600.2  on 1 df, p=< 2.2e-16
#> n= 1069, number of events= 631

What does this say about the effect of the risk modifier?

# Extract estimates.
fit.summary <- summary(fit)

# Summarize findings.
cat("Covariate effect (95% CI): ",
exp(fit.summary$coefficients[1]), " (",
exp(fit.summary$coefficients[1] - 1.96*fit.summary$coefficients[4]), ", ",
exp(fit.summary$coefficients[1] + 1.96*fit.summary$coefficients[4]), ").", sep = "")
#> Covariate effect (95% CI): 4.049629 (3.545061, 4.626013).