The Statistical Analysis of Misreporting on Sensitive Survey Questions

Using the misreport package in R

Gregory Eady (February 26, 2017)

1. Introduction

This document provides a brief introduction to the R package misreport. The package implements the method introduced in Eady (Forthcoming) to permit researchers to statistically examine the predictors of misreporting on sensitive survey questions. In brief, the goal of the method is to model whether survey respondents provide one response to a sensitive item in a list experiment — a measurement technique designed to elicit a truthful response — but answer otherwise when asked to reveal that response openly on a direct question.1 misreport is made available through the Comprehensive R Archive Network (CRAN).

2. Covariates

Before turning to the data-generating process that characterizes the setup for the list experiment and direct question, we first simulate survey respondents and their characteristics. These respondent characteristics will later be used as predictors of the control items, sensitive belief, and misreporting.

To begin, let’s say that we have \(10000\) respondents \(i = 1, \ldots, 10000\) in a data.frame A:

n <- 10000
A <- data.frame(i = 1:n)

For concreteness, we will generate respondent characteristics that we will say represent age, gender, and education:

A$age <- round(runif(n, 18, 90))
A$gender <- sample(c("Woman", "Man"), n, replace = TRUE, prob = c(0.52, 0.48))
A$education <- sample(c("Below high school",
                        "High school",
                        "College"),
                      n, replace = TRUE, prob = c(0.25, 0.35, 0.4))
A$education <- factor(A$education, levels = c("Below high school",
                                              "High school",
                                              "College"))

Our simulated data now appear as follows:

# Display first 10 rows of the data
print(A[1:10, ], row.names = FALSE)
  i age gender         education
  1  47  Woman       High school
  2  87  Woman Below high school
  3  57    Man Below high school
  4  73    Man           College
  5  58    Man           College
  6  48  Woman       High school
  7  52  Woman       High school
  8  62  Woman       High school
  9  35    Man       High school
 10  36  Woman Below high school

3. Data-generating process

We now simulate responses to a list experiment and a direct question. In this vignette, we’ll assume that answering affirmatively to the sensitive item is to give the socially unacceptable response. In other words, responses of \(Z_i = 1\) or \(D_i = 1\) indicate providing the socially unacceptable response, where \(Z_i\) denotes the response to the sensitive item in the list experiment and \(D_i\) denotes the response to the direct question.

To begin, we’ll first assign each respondent at random to the treatment or control group:

A$treatment <- sample(c(rep(0, n/2), rep(1, n/2)), n)

We’ll define the population parameters in the sensitive-item sub-model as follows and then simulate whether respondents hold the sensitive belief:

param_sensitive <- c("Intercept" = -0.5,
                     "gender (Woman)" = 0.25,
                     "age" = 0.01,
                     "education (High school)" = -0.3,
                     "education (College)" = -0.5)

lin_pred <- cbind(1,
                  A$gender == "Woman",
                  A$age,
                  A$education == "High school",
                  A$education == "College") %*% param_sensitive

# Simulate whether respondents hold sensitive belief
A$true_belief <- rbinom(n, 1, prob = plogis(lin_pred))

The proportion of respondents in our sample that holds the sensitive belief is:

prop.table(table(A$true_belief))

     0      1 
0.5319 0.4681 

We’ll now simulate whether each respondent who holds the sensitive belief misreports it when asked directly. For the purpose of this vignette, we’ll set the effect of treatment assignment on misreporting to 0. In other words, respondents who receive the treatment list are neither more nor less likely to misreport on the direct question when it is asked later in the survey (one can also relax this constraint and model it if desired).

param_misreport <- c("Intercept" = -0.5,
                     "gender (Woman)" = -0.3,
                     "age" = -0.01,
                     "education (High school)" = 0.3,
                     "education (College)" = 0.5,
                     "treatment" = 0)

lin_pred <- cbind(1,
                  A$gender == "Woman",
                  A$age,
                  A$education == "High school",
                  A$education == "College",
                  A$treatment) %*% param_misreport

A$misreport <- rbinom(n, 1, prob = plogis(lin_pred))

# By the monotonicity assumption, only those who hold the sensitive belief misreport it. Therefore, if true_belief = 0, then a respondent does not misreport.
A$misreport[A$true_belief == 0] <- 0

Lastly, for the control-items sub-model, we’ll set the number of control items to \(J = 4\). We’ll also set the parameter \(U\) in the control-items sub-model to 0. In other words, those who misreport do not respond to the control items differently from those who do not (this too can be relaxed).

J <- 4

param_control <- c("Intercept" = -0.25,
                   "gender (female)" = 0.25,
                   "age" = 0.01,
                   "education (high school)" = -0.25,
                   "education (college)" = -0.5,
                   "U" = 0,
                   "Z" = 0.25)

lin_pred <- cbind(1,
                  A$gender == "Woman",
                  A$age,
                  A$education == "High school",
                  A$education == "College",
                  A$misreport,
                  A$true_belief) %*% param_control

# Simulate responses to the control items
A$y_star <- rbinom(n, J, prob = plogis(lin_pred))

Putting it all together, we can calculate respondents’ answers to both the list experiment and direct question:

# List experiment response
A$y <- A$y_star + A$true_belief * A$treatment

# Direct question response
A$direct <- ifelse(A$misreport == 1, 0, A$true_belief)

Our data now appear as follows:

# Display first 10 rows of the data
print(A[1:10, ], row.names = FALSE)
  i age gender         education treatment true_belief misreport y_star y direct
  1  47  Woman       High school         0           0         0      3 3      0
  2  87  Woman Below high school         0           1         0      3 3      1
  3  57    Man Below high school         0           1         1      4 4      0
  4  73    Man           College         1           1         0      3 4      1
  5  58    Man           College         0           1         1      3 3      0
  6  48  Woman       High school         0           1         1      2 2      0
  7  52  Woman       High school         0           1         1      1 1      0
  8  62  Woman       High school         0           0         0      0 0      0
  9  35    Man       High school         1           0         0      4 4      0
 10  36  Woman Below high school         1           0         0      2 2      0

In our simulated data, the first three respondents happen to represent the three respondent types of interest:

4. Analysis

To model the responses to the list experiment and direct question, we use the function listExperiment(), which is the workhorse of the misreport package. This function aims to model (1) the response to the control items, (2) the response to the sensitive item, and (3) whether the response to the direct question is a respondent’s true belief.

We run the function as follows:

library(misreport)

model.1 <- listExperiment(y ~ 1 + gender + age + education,
                          data = A, J = J,
                          treatment = "treatment",
                          direct = "direct",
                          sensitive.response = 1,
                          control.constraint = "partial",
                          misreport.treatment = FALSE)

There are two notable arguments here. First, control.constraint is set to "partial". This corresponds to our simulation data being set up such that those who misreport do not respond differently to the control items compared to those who do not misreport (i.e. implicitly \(U = 0\) in the control-items submodel). To model this, we could set the argument to "none", which will include a parameter for \(U\) in the control-items sub-model to be estimated. Alternatively, setting control.constraint to "full" would remove both of the parameters \(U\) and \(Z\) from the control-items sub-model (i.e. implicitly set \(U = 0\) and \(Z = 0\)). Doing this would mean, effectively, that we are assuming that neither misreporting nor holding the sensitive belief predicts responses to the control items.

Second, misreport.treatment is set to FALSE. This corresponds to our data-generating process such that receipt of the treatment list does not affect whether respondents misreport. Theoretically, such a relationship might exist because respondents in the treatment group recall their response to the sensitive item in the list experiment and provide the same response to the direct question for reasons of, say, cognitive simplicity or perhaps to be consistent on principle. It is advised that researchers separate the list experiment and direct question far apart in a survey when possible to help avoid this possibility. If treatment assignment does affect misreporting, however, setting misreport.treatment to TRUE will add a parameter representing treatment status to the misreport sub-model to model this relationship.

After model fitting, summary output from the model can be obtained using summary() as follows:

# Show 3 significant digits
summary(model.1, digits = 3)

List experiment sub-models

Call: listExperiment(formula = y ~ 1 + gender + age + education, data = A, 
    treatment = "treatment", J = J, direct = "direct", sensitive.response = 1, 
    control.constraint = "partial", misreport.treatment = FALSE, 
    n.runs = 1)

CONTROL ITEMS Pr(Y* = y)
                       est.    se       z     p
(Intercept)          -0.192 0.048  -3.975 0.000
genderWoman           0.274 0.026  10.465 0.000
age                   0.009 0.001  14.956 0.000
educationHigh school -0.280 0.034  -8.322 0.000
educationCollege     -0.543 0.033 -16.507 0.000
Z                     0.240 0.027   8.976 0.000
---

SENSITIVE ITEM Pr(Z* = 1)
                       est.    se      z     p
(Intercept)          -0.400 0.183 -2.185 0.029
genderWoman           0.024 0.101  0.237 0.813
age                   0.012 0.002  5.279 0.000
educationHigh school -0.335 0.118 -2.826 0.005
educationCollege     -0.619 0.118 -5.229 0.000
---

MISREPORT Pr(U* = 1)
                       est.    se      z     p
(Intercept)          -0.260 0.297 -0.876 0.381
genderWoman          -0.524 0.166 -3.168 0.002
age                  -0.010 0.004 -2.557 0.011
educationHigh school  0.291 0.192  1.511 0.131
educationCollege      0.364 0.199  1.827 0.068
---

Observations: 10,000 (0 of 10,000 observations removed due to missingness)
Log-likelihood -20398.8
# Recalling the population parameters for comparison:
Control:   -0.25  0.25  0.01 -0.25 -0.50  0.25 
Sensitive: -0.50  0.25  0.01 -0.30 -0.50 
Misreport: -0.50 -0.30 -0.01  0.30  0.50

4. Some useful quantities of interest

4.1. Predicted probabilities

It will often be the case that reseachers will want to summarize model output by generating predicted probabilities. The simplest way to do this is using the function predict(), which generates predictions separately for whether an individual holds the sensitive belief (z.hat) and whether an individual misreports (u.hat). To demonstrate this function, we’ll use it below by focusing on the variable gender and calculating the mean difference in the predicted probability of misreporting between men and women:

# Predicted probabilities of misreporting (covariates held at their observed values)
pred.woman <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Woman"))$u.hat

pred.man <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Man"))$u.hat
# Mean predicted probability of misreporting (gender set to "Woman")
mean(pred.woman)
[1] 0.2587654
# Mean predicted probability of misreporting (gender set to "Man")
mean(pred.man)
[1] 0.3694153
# Mean difference in predicted probabilities of misreporting between women and men
mean(pred.woman - pred.man)
[1] -0.1106498

Because misreporting occurs, by assumption, only among those who hold the sensitive belief, we might prefer to calculate differences in predicted probabilities only among the sub-sample of respondents who hold the sensitive belief. By the nature of the list experiment, we don’t know who among our sample truly holds the sensitive belief, but we can instead weight by the posterior predicted probabilities that each respondent does. These probabilities fall out naturally from the EM procedure and are available in the fitted model object as the component w:

# Display first 10 rows of posterior predicted probabilities
model.1$w[1:10, ]
      Misreport sensitive Truthful sensitive Non-sensitive
 [1,]           0.2204961                  0     0.7795039
 [2,]           0.0000000                  1     0.0000000
 [3,]           0.3737128                  0     0.6262872
 [4,]           0.0000000                  1     0.0000000
 [5,]           0.2677970                  0     0.7322030
 [6,]           0.1824359                  0     0.8175641
 [7,]           0.1507975                  0     0.8492025
 [8,]           0.1256164                  0     0.8743836
 [9,]           0.6479408                  0     0.3520592
[10,]           0.0731082                  0     0.9268918

If we sum across the first two columns of each row, we have the probabilities that each respondent holds the sensitive belief.2 We might therefore weight our predicted probabilities of misreporting as follows:

weighted.mean(pred.woman - pred.man, rowSums(model.1$w[, 1:2]))
[1] -0.1088317

Note that whether a respondent is a man or woman also affects responses to the sensitive item itself. One should therefore use the sample for these predicted probabilities as one sees fit.

4.2. Simulating predicted probabilities

To quantify uncertainty in the predicted probabilities, we can use simulation (King, Tomz, and Wittenberg 2000). First, we’ll simulate 1000 batches of model parameters:

n_sims <- 1000

# Simulate model parameters
coefs <- c(model.1$par.control, model.1$par.sens, model.1$par.misreport)
par_sim <- mvtnorm::rmvnorm(n_sims, coefs, model.1$vcov.mle)

# Get the matrix of parameters for the misreport submodel
# Note that the parameter estimates in par_sim are in the following order: control, sensitive, misreport
par_sim_misreport <- par_sim[, (length(coefs)-length(model.1$par.misreport)+1):length(coefs)]

We now have 1000 batches of parameters for the misreport sub-model, the first 10 rows of which appear as follows:

par_sim_misreport[1:10, ]
      (Intercept) genderWoman          age educationHigh school educationCollege
 [1,]  -0.4997965  -0.3554420 -0.008850399           0.49011359        0.4378888
 [2,]   0.1570283  -0.8961896 -0.014097916           0.19254676        0.4921500
 [3,]   0.0760989  -0.5320369 -0.014109480          -0.02994769        0.2643022
 [4,]  -0.4323941  -0.5119460 -0.006656752           0.37450219        0.6242772
 [5,]  -0.3539957  -0.3561838 -0.010066673           0.32685083        0.4652617
 [6,]  -0.3599190  -0.3308619 -0.006737198           0.01199662        0.4432342
 [7,]  -0.2213338  -0.5662163 -0.011097655           0.31579604        0.3449018
 [8,]  -0.3359534  -0.5654059 -0.005542622           0.28559999        0.1076779
 [9,]  -0.5889371  -0.1545445 -0.008821937           0.66426281        0.6433846
[10,]  -0.3778983  -0.5888047 -0.006623658           0.44411094        0.1663846

We can calculate the mean difference in predicted probabilities for each batch of parameters using predict(), setting gender first to "Woman" and then to "Man":

pp_diff <- rep(NA, n_sims)

# For each row of parameters, calculate the mean difference in predicted probabilities
for(i in 1:n_sims) {

  pp_woman <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Woman"), par.misreport = par_sim_misreport[i, ])$u.hat

  pp_man <- predict(model.1, newdata = data.frame(A[, names(A) != "gender"], gender = "Man"), par.misreport = par_sim_misreport[i, ])$u.hat

  pp_diff[i] <- mean(pp_woman - pp_man)

}

Alternatively, we could use weighted.mean() instead of mean() to weight the predicted differences in misreporting by the predicted probabilities of each respondent holding the sensitive belief or those calculated using the parameters in the sensitive-item sub-model.

Finally, we use these simulated differences to quantify our uncertainty in the predicted difference in misreporting between men and women:

mean(pp_diff)
[1] -0.1101213
quantile(pp_diff, c(0.05, 0.95)) # 90% interval
        5%        95% 
-0.1679831 -0.0486495 

References


Blair, Graeme, Kosuke Imai, Bethany Park, Alexander Coppock, and Winston Chou. 2016. “list: Statistical Methods for the Item Count Technique and List Experiment.” The Comprehensive R Archive Network (CRAN).

Eady, Gregory. Forthcoming. “The Statistical Analysis of Misreporting on Sensitive Survey Questions.” Political Analysis, 1–19.

King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 341–55.


  1. Another useful vignette for analysis of a list experiment and direct question can be found here by the authors of the list package (Blair et al. 2016).

  2. Note that the column of respondents of the type “Truthful sensitive” have probabilities of 0 or 1 because these are respondents who have either answered affirmatively (or not) to the direct question. By the monotonicity assumption, those who openly admit to holding the sensitive belief are assumed not to be misreporting.