The twangRDC R package is a streamlined version of the
twang R package that was created specifically for use in the Federal Statistical Research Data Centers (FSRDCs), which are restricted data enclaves run by the Census Bureau to promote research use of data products from the Census and other federal statistical agencies. In particular, the twangRDC package contains functions to address linkage failures, which may systematically affect some records creating bias in population estimates. Further, the same functions can be used to ensure that a comparison group is equivalent to a treatment group on observable characteristics.
The package utilizes gradient boosted models to estimate weights for linkage failures and comparison group construction. Results using twangRDC will not necessarily be reproduced using
twang due to important differences in implementation. The twangRDC package allows for much larger datasets, like those used in population sciences, and many more covariates, but users should note that with smaller datasets, the
twang package is computationally more efficient. The twangRDC package uses xgboost for gradient boosting, which provides users with an optimized gradient boosting library that can utilize parallel computation within the FSRDCs.
The twangRDC package uses gradient boosted models to derive weights for nonequivalent groups. The algorithm alternates between adding iterations to the gradient boosted model (increasing its complexity) and evaluating balance between the two groups on the covariates. The algorithm automatically stops when additional iterations no longer improve balance. The package allows the user to generate weights for two different scenarios: linkage failures and the construction of a comparison group.
If the goal is the construction of a comparison group, twangRDC can apply gradient boosted propensity score weights for the average treatment effect on the treated. Treatment observations receive a weight of 1, while comparison observations receive a weight of the odds of treatment based on the propensity score model. The steps of the algorithm are described here assuming the user specifies strata in the model, but the steps are the same without strata by considering the data coming from a single stratum. Guidance on specific decisions relevant for each step is provided throughout the examples.
iters.per.stepiterations to the gradient boosted model.
The primary function of twangRDC is
ps.xgb, which performs the methodology previously described. In Table 1, we describe the options of
ps.xgb. Additional details can be found in the help file.
|formula||A symbolic description of the model to be fit with an observed data indicator or a treatment indicator on the left side of the formula and the variables to be balanced on the right side. If
|linkage||An indicator of whether the weighting should be for linkage failure or comparison group construction.
|strata||An optional factor variable identifying the strata. If specified, balance is optimized within strata.|
|params||xgboost parameters. Details below.|
|file||An optional filename to save intermediate results. The file is overwritten at each step of the algorithm.|
|iters.per.step||An integer specifying the number of iterations to add to the model at each step of algorithm.|
|max_steps||An integer specifying the maximum number of steps to take while optimizing the gradent boosted model. The maximum number of iterations considered during optimization is given by
|id.var||A variable that uniquely identifies observations.|
|min.width||An integer specifying the minimum number of iterations between the current number of model iterations and the optimal value.|
|save.model||A logical value indicating if the xgboost model is saved as part of the output object.|
|weights||An optional variable that identifies user defined weights to be incorporated into the optimization, e.g., sampling design weights. If specified, the output automatically accounts for these weights.|
A detailed description of the xgboost options can be found in the xgboost documentation. Here, we briefly describe the options that are most useful in this context.
etais a value between 0 and 1 that controls the learning rate. Smaller values of
etareduce overfit but require additional iterations to achieve optimal balance.
max_depthis the maximum allowable depth of the gradient boosted trees. A larger value allows the model to consider more complex interactions between the covariates. Increasing
max_depthmay require a reduction in
min_child_weightis the minimum number of observations needed to further partition a tree. If a leaf node is such that it has fewer than
min_child_weightobservations, the tree partitioning process will stop. Users can conceptualize this as similar to the number of observations in a level of a categorical variable for a standard regression model. Note, if weights are specified in
min_child_weightreferences the sum of the weights.
The options for twangRDC should be set to achieve optimal balance. Many of the arguments are tuning parameters that should be tweaked to explore if the resulting model achieves better or worse balance than other choices of the arguments. We recommend that users run the models several times modifying the choice of the arguments to optimize their choice. In most cases, we have found that small changes to the arguments will not result in substantial differences in the underlying model fit. However, it is important to explore this possibility for each new analysis. Some general recommendations are provided here.
formulashould include all covariates that are to be balanced. If working with small to moderate sample sizes, it is likely that achieving balance on a large set of covariates is not feasible, and the user should make the choice of covariates as parsimonious as possible. The underlying gradient boosted model will automatically consider interactions and nonlinearities, but balance is only calculated based on the form of the covariates in the
formula. To assess balance for specific interactions, the interactions must be included in the
iters.per.stepwill reduce the computation time necessary for the model to finish. However, there is a tradeoff between the choice of
iters.per.stepand the balance achieved by the model. Larger values will result in worse balance. We recommend values between 50 and 500 for datasets exceeding 100,000 records.
min_child_weightshould be set to a combination of values that allow the xgboost model to run for at least a few thousand iterations. If the optimum iteration is too small, users can decrease
min_child_weight. Conversely, if the algorithm does not stop within the users expectations, the value of
max_depthcan be increased or
min_child_weightdecreased. Our experience suggests that values of
etabetween 0.1 and 0.3 are a good place to start.
We will highlight two uses of the twangRDC R package. First, we will generate weights that address potential bias due to linkage failures by ensuring individuals with PIKs are representative of their geographic region’s population. Second, we will highlight using the twangRDC R package to generate propensity score weights such that a group of southern metropolitan areas can be used as a comparison group for residents of Orleans Parish, Louisiana, which corresponds exactly with the boundaries of the City of New Orleans.
A simulated data file was created for use in this vignette. It contains simulated records for residents of Orleans Parish and three other metropolitan areas in the South census region (Atlanta, GA, Memphis, TN, and Washington, D.C.). We built the file exclusively from public use data, but it mirrors the structure of restricted versions of the 2000 Decennial Census available through the FSRDC network2. Each simulated record includes basic individual demographic characteristics and basic household characteristics. Each record also includes two important indicators, one to simulate whether the individual record received a PIK and another to denote whether the individual lived in Orleans Parish.
The data was created by extracting all variables for households and individuals from the 5% Integrated Public Use Microdata Sample (IPUMS3) of the 2000 Decennial Census. We simulated assignment of households to census tracts and attached tract identifiers and characteristics.
We also simulated an indicator of PIK assignment (PIK=yes/no) to person records. Public use data do not include PIK assignment, so we estimated a predicted probability of receiving a PIK by using estimated regression parameters from Bond et al. 20134. We added a random error to the predicted probability so that the PIK assignment status is not deterministic and converted the predicted probability into a dichotomous PIK=yes/no variable.
Lastly, we pooled Orleans Parish records with those from other southern metropolitan areas, created an indicator for Orleans Parish residence and, for the purposes of this vignette, sampled the data to shrink the size of the dataset. The simulated file contains individual records for 8,893 residents of Orleans Parish, LA and 9,503 individual records for residents from select southern metropolitan areas.
Table 2 provides a description of the data elements included in the simulated data file.
|Data element||Description||Labels and Coding|
|metarea||A categorical variable for metropolitan area.||Atlanta, GA (52)
Memphis, TN/AR/MS (492)
New Orleans, LA (556)
Washington, DC/MD/VA (884)
|c00_age12||A categorical variable for age in years at the 2000 Decennial Census.||0 to 2 years old (1)
3 to 5 years old (2)
6 to 9 years old (3)
10 to 14 years old (4)
15 to 18 years old (5)
19 to 24 years old (6)
25 to 34 years old (7)
35 to 44 years old (8)
45 to 54 years old (9)
55 to 64 years old (10)
65 to 74 years old (11)
75 and older (12)
|c00_sex||A binary indicator of sex as reported on the 2000 Decennial Census.||Male (0)
|c00_race||A categorical variable for race as reported on the 2000 Decennial Census.||White (1)
African American (2)
American Indian or Alaskan Native (3)
Other Asian or Pacific Islander (5)
Some other race (6)
|c00_nphu||The number of persons in housing unit as reported on the 2000 Decennial Census.||1 to 16|
|tract_id_str||Census tract identifier.|
|sim_pik||Simulated binary indicator of PIK assignment.||No PIK assigned (0)
PIK assigned (1)
|nola_rec||Binary indicator for record from Orleans Parish.||Not Orleans Parish Record (0)
Orleans Parish Record (1)
|id||An ID variable that uniquely identifies individuals in the dataset.|
get.weights function extracts the weights at the optimal iteration. The resulting data contains the weights and the ID variable specified in
id.var. The weights can then be merged back in with the original data using the
id.var. Note that base R merge function is slow compared to modern alternatives. If your data is large, consider
# extract weights w = get.weights(res.pik) # merge weights into data dta=merge(dta , w , by='id' , all=TRUE)
These weights can be used in subsequent steps of the analysis. For example, if matching a specific marginal population total is necessary, the weights output by this function can be used as an input to a post-stratification step. Alternately, if the goal of the study is to compare two groups, these weights can be used as inputs into a propensity score model.
Our second example use of twangRDC will generate a comparison group for Orleans Parish consisting of residents of the three other southern metropolitan areas. The steps of the process remain the same as the PIK weighting example, but with minor adjustments to the interpretation and specification of the model. First,
linkage = FALSE specifies that we are no longer interested in weights to account for linkage failure, but instead, we wish to generate a comparison group using propensity score weights. Note that
ps.xgb only estimates the average treatment effect on the treated, with the treatment group identified by records with a value of 1 in the left hand side of the formula. Second, our formula now includes an interaction between age and sex. This informs the algorithm to evaluate balance based on unique combinations of age and sex (e.g., it will attempt to balance the proportion of each sex-by-age combination). Currently, twangRDC only allows for specification of interactions between factor variables. Finally, we no longer specify a stratification variable, as we are attempting to create a comparison group that is representative of Orleans parish as a whole. If we were interested in using the linkage failure weights from our previous example, they could be specified in the
weights option, and the output of this model would automatically account for those weights.
# set the model parameters params = list(eta = .3 , max_depth = 5 , subsample = 1 , max_delta_step=0 , gamma=0 , lambda=0 , alpha=0, min_child_weight=50 , objective = "binary:logistic") # fit the xgboost model res.ps = ps.xgb(nola_rec ~ c00_age12:c00_sex + c00_race , data=nola_south, params=params, max.steps=25, iters.per.step=100, min.iter=1000, min.width = 500, id.var="id", linkage = FALSE)
After fitting the model, the same steps covered in the linkage failure example should be used here as well. The first step is to evaluate the convergence of the model using the
plot function (Figure 2). We are verifying that the algorithm has run for a sufficient number of iterations such that a clear minimum has been achieved.
Next, we evaluate the performance of the algorithm by looking at balance tables (Table 5). Since we specified
bal.table function will now produce the treatment group mean for each covariate, as well as the mean before and after propensity score weighting among comparison cases. It also provides the standardized differences for each covariate. As in the linkage failure example, the goal is for the standardized differences after weighting to all be close to zero.
|Treatment Group Mean||Unadjusted Comparison Group Mean||Adjusted Comparison Group Mean||Unadjusted Standardized Difference||Adjusted Standardized Difference|
Since no strata were specified in the model, the
type='strata' option used in the linkage failure example is no longer useful. These weights can be extracted and merged in with the data using the same steps described in the linkage failure example.
For more information, tutorials, and tools for weighting and analysis of nonequivalent groups, see the TWANG website at https://www.rand.org/statistics/twang.html.
Wagner, Deborah and Mary Layne. The Person Identification Validation System (PVS): Applying the Center Administrative Records Research and Applications’ (CARRA) Record Linkage Software. US Census Bureau, Center Administrative Records Research and Applications Working Paper #2014-01. 2014.↩︎
Steven Ruggles, Sarah Flood, Ronald Goeken, Josiah Grover, Erin Meyer, Jose Pacas and Matthew Sobek. IPUMS USA: Version 10.0 [dataset]. Minneapolis, MN: IPUMS, 2020. https://doi.org/10.18128/D010.V10.0↩︎
Bond, Brittany, J. David Brown, Adela Luque and Amy O’Hara. The nature of the bias when studying only linkable person records: Evidence from the ACS. US Census Bureau, Center Administrative Records Research and Applications Working Paper #2014-08, 2014.↩︎