List Experiments for Sensitive Questions – a Methods Bleg

About a year ago, I wrote a blog post on issues surrounding data collection and measurement. In it, I talked about “list experiments” for sensitive questions, about which I was not sold at the time. However, now that I have a bunch of studies going to the field at different stages of data collection, many of which are about sensitive topics in adolescent female target populations, I am paying closer attention to them. In my reading and thinking about the topic and how to implement it in our surveys, I came up with a bunch of questions surrounding the optimal implementation of these methods. In addition, there is probably more to be learned on these methods to improve them further, opening up the possibility of experimenting with them when we can. Below are a bunch of things that I am thinking about and, as we still have some time before our data collection tools are finalized, you, our readers, have a chance to help shape them with your comments and feedback.

What is a sensitive question? In any setting, a sensitive question is one that you’d think that the likelihood of your respondents being untruthful with their response is significantly higher than the rest of your questions. For various reasons, people are more reluctant to talk about some topics than others, even when the enumerator emphasizes their anonymity and privacy. However, in an RCT setting, I like to define a sensitive question as any key (primary) outcome for which the possibility of differential self-reporting bias is a concern. Think schooling, for example, in an evaluation of a scholarship program. If you have the resources, you can get other independent estimates of school attendance (random audits, etc.), but sometimes you don’t have the money to do this. Methods, such as list experiments, can perhaps then help.

What is a list experiment? The basic list experiment would follow something like this: you would randomly split your sample into two groups: direct response and veiled response. In the direct response group, you ask your sensitive question directly as you normally would when you don’t worry about self-reporting bias. In another section, you’d have a set of, say, four “innocuous statements.” The respondent is instructed to listen to these statements and respond with the total number of correct statements. Notice that the enumerator does not know the answers to the individual questions (more on caveats later). Via random assignment, this group serves as the control group that gives is the average number of correct statements in our sample. In the veiled response group, the sensitive question is included in the list of innocuous questions as an additional question. The difference between the veiled and the direct response groups is the prevalence of your sensitive question in your sample. See this paper from a health messaging intervention in Uganda and this paper on antigay sentiment in the United States, as recent examples using this method.

The power tradeoff from experimenting within your experiment: Notice that this method requires you to split your sample. If you’re going to ignore the answers to your key outcome from the direct response group (as I am leaning towards doing), this leads to a straightforward loss of power from reduced sample size in your experiment. If you have the money and the energy, you could go to some individuals outside your study sample (that are identical to your study sample) and learn the number of correct responses from them, after which you can subtract it from everyone in your study, who would be in the veiled response group. But, that can be costly, and, unless, you thought this through at baseline and have a group that you excluded just for this purpose, you’re not sure that this average is what would have held in your sample. However, this is individual randomization, so you don’t need a lot of individuals allocated to the direct response group to get a good estimate of the number of correct statements. Instead of allocating 50% of your sample, do, say, 10-20%.

Should I block the randomization by original treatment arms? I currently think that this is a good idea. It is possible that the answers to the innocuous question you came up with are influenced by your original treatment, in which case, you want more of a difference-in-difference rather than a single difference for your prevalence of sensitive questions by original treatment arm: for each individual, you could consider subtracting the correct number of statements for the intervention group that individual belongs to rather than everyone. However, this becomes more demanding from a power perspective. So, I’d favor blocking the random assignment by intervention status, but then still analyzing the data by using the average in the entire sample of the direct response group – as long as there are no apparent large differences within this group by intervention arm.

What about the innocuous questions? Notice that, in my example above, if you answer 0 or 5 as your number of correct statements in the veiled response group, the enumerator still knows your answer to the sensitive question (and every other one). Perhaps, you don’t care about one tail because people in that tail are saying they did not bribe officials, etc. But, what should we do about the other tail? One way to tackle this problem would be to come up with a question that is obviously false? For example, in our study in Liberia, I could have a statement that says, “Barack Obama is the President of Liberia.” Then, no one (hopefully) would answer with a 5. However, the quick respondents will catch on to this and know that the enumerator will discern their answer to the sensitive question if they say “4.” This becomes academic, but it’s been suggested to me that one way to get around this problem is to find one statement that a large majority of individuals in your sample would say is correct, and then find another statement that is highly negatively correlated with it. So, if you answer, “correct” to one, you’d almost always answer the other with a “false” (“I live in an apartment,” and “I have a pit latrine.”).

What experiment should I run? Currently, I am leaning towards experimenting with some unknowns. First, one method to deal with sensitive questions is CASI or ACASI (Audio Computer-Assisted Self-Interviewing). However, their effectiveness has not been established when it comes to eliciting more truthful answers, in, say, risky sexual activity. I would first introduce a CASI/ACASI group in addition to the direct response group. Second, as I mentioned in the previous post mentioned above, the veiled response requires the respondents to listen, count/add to report the total number of correct statements. Some people I talked to worry about the noise that is introduced by this. So, one way of dealing with this would be to also do the veiled response group use ACASI. So, at least, they could take their time to count the correct number of statements on the screen before responding. Third, one could add a further improvement here to reduce noise: imagine that the list of questions are on one screen (in an ACASI setting where the enumerator cannot see the responses) and as the respondent answers each question, the question disappears and a counter updates the number of correct statements. At the end, the screen has only a big number at the center. If you instructed the respondent before the administration of the section that the enumerator cannot know her answers due to this method of administration, she might be more truthful. But, the question is whether she’d be more truthful than a simple ACASI, both of which require her to record the answer to the individual question on the screen, even though the answer cannot be seen by the enumerator. So, an experiment may include a direct response group, an ACASI direct response group, a traditional veiled response group, an ACASI veiled response group, and an ACASI veiled response where the respondent is still answering the questions individually, but she knows that only the total is visible at the end (you could make the total disappear at the end as well before the tablet is passed back onto the enumerator). Lots of choices…

Finally, some people have commented on the fact that making a big deal about the innocuous question, telling the respondent that the enumerator cannot know their answers, as well as the contrast between the innocuous and sensitive questions may only draw more attention to the whole thing. A better method may be to simply bundle some questions that you would have in your survey anyway, into the group of “innocuous questions” (so they’re not so innocuous) in the direct response group and then add your sensitive one to these in the veiled response group without prompting the respondent about privacy, etc. at all. Most people will realize that their answer is hidden from the enumerator. This is a hypothesis that can be tested, so is the hypothesis that where these questions come in the survey may matter.

If you have comments on any of these ideas, or have more ideas of your own, please use the comment section to share. Thanks!

Comments

Thanks for sharing. We just completed an effort where we applied this technique in Kerala in the context of Gender Based Violence. Here it goes the abstract and the link in case this helps.

This paper analyzes the incidence and extent to which domestic violence and physical harassment on public/private buses is underreported in Kerala, India, using the list randomization technique. The results indicate that the level of underreporting is over nine percentage points for domestic violence and negligible for physical harassment on public/private buses. Urban households, especially poor urban households, tend to have higher levels of incidence of domestic violence. Further, women and those who are professionally educated tend to underreport more than others. Underreporting is also higher among the youngest and oldest age cohorts. For physical harassment on public/private buses, rural population—especially the rural non-poor and urban females—tend to underreport compared with the rural poor and urban males.

Some additional comments
Hi Berk, we used a list randomization in our last survey to elicit condom use of sex workers in Senegal. We found that condom use was over-reported by 20 pp (condom use was 97% when self-reported and 77% with the list experiment). Overall, I think this is a good and inexpensive method to elicit sensitive behaviours in LMICs and it does allow sub-group analysis, which can provide important information. In our case, we find that it was mainly HIV positive and sex workers at high risk of HIV who were not using condoms, as one may expect.
I have a couple of additional comments on the method. Firstly, while in theory, the number and choices of non-key items should not affect the results, I strongly believe that it does and that non-key items should be related to the sensitive behaviours. There is some evidence that non-key items that are unrelated to sexual behaviors were better to elicit unbiased HIV related behaviors. For instance, in Droitcour (1991), they found that the use of non-key items that were unrelated to the key content make participants suspicious about the survey, and therefore reduces the success of the list randomization method. Despite this, most list randomizations I am aware of are designed by using non sensitive items that have nothing to do with the sensitive behaviour. Secondly, the results obtained from the list randomization are to some extent imprecise, and given the implementation challenges when doing a list randomization, the method is often applied to small samples. In this case, using a double list randomization where each group serves once as the control group and once as the treated group can increase precision. Thirdly, I still believe that the condom use elicited with our list experiment is still over-estimated, hence it would be useful to test the validity of the results obtained with the list randomization to other indirect elicitation methods or other methods (get condom use from clients?). Finally, I think that more methodological research should be undertaken to be able to use the results from the list randomization at the individual level, which is the main drawback of the method. We were thinking of using information from the sub-group analysis results to estimate the probability of condom use from the list experiment at the participant level but of course this is a bit tricky to implement.

The double list requires to have two separate list experiments and respondents who are in the treatment for list A becomes control for list B and vice versa. There is an example in Glynn (2010): https://scholar.harvard.edu/files/aglynn/files/sts.pdf

I'll add some of my own experience here, for which we are currently in the process of journal submission. I can't share the full abstract quite yet, but the very short version is that we tested list randomization in KwaZulu-Natal against both direct questionnaires, and more importantly, against known truths (HIV status, acquired by linkage with a local surveillance system). The list randomization arm not only did not improve upon direct questionnaires, but it performed much worse for some questions, even among those who definitively knew their own status. We think we know roughly why it performed so poorly, and it may be fixable, but that should help add a bit of caution.

List randomization, it seems, is not quite ready for prime time, and I would urge caution until we better research and understand best practices.

I looked at the working paper on your website - interesting work. I am discounting the results on HIV testing, as your explanation was immediately what I thought when I saw the findings (that people may be thinking about other offers of HIV tests than the specific one mentioned in your question). However, on the HIV status: is it possible that the fact that you knew the answers played a role? Meaning that if the respondents knew that you (or your enumerators) knew, that would make them less likely to lie to the enumerator's face. This could have also influenced how they answered the block questions including the HIV one. I wonder verifiable HIV status was the elephant in the room and other information that really only the respondent (and a few other anonymous people) know would see better performance from LR...

One more question: do I understand correctly that the "direct elicitation" group did not ask the non-sensitive questions in a block, but rather one by one? If so, can we really say that the list-block performed worse than the direct response group? I understand that you had other aims in doing this, but this raises issues of interpretation for the aim of comparing DR to LR, no? One way of interpreting results with that lense would be similar to Aurelia's worry above, which is that the LR may improve things over DR while still not coming close to the truth...

Thanks for your interest! That reminds me that I need to update the working paper version (now updated to 1.3). I am also e-mailing a colleague of mine to comment here who ran a study in Ghana, with similarly disappointing findings. A few responses below:

re: "However, on the HIV status: is it possible that the fact that you knew the answers played a role?"
Yes, this is certainly possible. However, we note that we would have expected the direct answers to be closer to 0 for the doubly known truth group in that case as well, so to get a number that was higher even than the doubly known truth group is troubling.

re: "One more question: do I understand correctly that the "direct elicitation" group did not ask the non-sensitive questions in a block, but rather one by one?"
Yes, you have interpreted the methods correctly, and that may have had an influence on the results, albeit only if the counting method itself was biased in some way (which it turns out is likely to be true). This design, as you note, was for two reasons: 1) we wanted to precisely explore the proportions of non-sensitive questions to better construct a future questionnaire in this population, and 2) if the counting method (list count vs direct count) influenced the number of true answers, it is likely that we have introduced additional design effects, and as such are unlikely to get close to the true value. One of the main discussion points from our pilot is that the counting method itself is likely one of the main culprits, and unfortunately one we didn't test directly during the pilot itself.

I am starting to worry that there is something substantively different about testing this among HIV+ populations or when the sentive question is HIV status or HIV-related. We'll see what transpires when we test other sensitive topics (violence, personal hygiene, etc.) among adolescent female populations...

Looking at their list experiment, I think the results could also come from the design (cf my first comment). I am convinced this is a great method to elicit sensitive behaviours but that of course it requires to be well designed otherwise participants could become suspicious about this task. Also I would not use a list experiment to elicit HIV status, we have used subjective expectations in a survey among sex workers to elicit this info and comparing it to biological markers, it performed pretty well as far as sensitivity was concerned but specificity was not great. Indeed the main issue we had is that a few participants thought they were infected while they were not according to their last HIV test and of course the LR will not help with this. Also I anticipate that revealing HIV status can be such a big deal that I am not sure that the LR could work for that.

I conducted a list experiment in Ghana with adolescent girls about sexual behavior as part of a larger RCT and it was, frankly, a complete failure. The results were a mess. I think part of this was due to the fact that it was self-administered and some girls were either confused or just not interested (some evidence of that as I made sure to put in innocuous items I knew for sure should be true for them and some answered 0). Another problem was power (I had 3 arms with 800 girls, but clustered into 34 schools). I think a bigger problem with research on sexual behavior of adolescents is that adolescent responses to these questions are inherently full of inconsistencies, recall error, and misunderstanding. When I conducted qualitative work after the RCT, that became very clear to me. It wasn't that girls were shy about providing answers about their sexual health; it was more that they misremember, they say completely inconsistent things minutes apart in the interview, and they interpret questions in different ways. This, to me, is the big problem in research on adolescent sexual health. Perhaps the list experiment can work in other contexts, but I wouldn't do it again with adolescents.