Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

Can anyone point me to some think pieces on what to do with census type data where at least some of the people surveyed do not self-identify a race/ethnicity, gender, or other demographic data?

In studies (and datasets) examining diversity in various communities, I see a wide range of strategies. Some simply remove records where a person declines to specify a race/ethnicity or gender. Some include them and create an "unknown" category. If looking at race/ethnicity, for example, some studies just state they assume all non-responders are white.

None of these solutions seem perfect. Are there any others? And if you remove responses from people who, to take one example, decline to specify a gender, aren't you potentially just erasing part of the population that does not identify as male or female?

A number of surveys now allow participants the option to "decline to state" a gender or "decline to state" a race/ethnicity or some similar option, or to just not answer the question at all, is there any developing consensus on what to do after that data is collected? Does it depend on the population being studied?

1 Answer
1

It depends on what you want to do with the data. One common approach is to treat non-response as „missing“, meaning, you drop this observation. This will possibly lead to a bias in the data, e.g. if a certain group refuses to answer a question more often than others. As a cure to this problem, some use ex-post weighting of data. You can do this if you know how often some group will occur in the population as a whole. So if you see that group A is underrepresented in your data, you assign a higher weight to each observation in group A. You can use this weights to adjust descriptive statistics or regression results. The technique is called probability weighting: https://en.m.wikipedia.org/wiki/Inverse_probability_weighting.

Note that there are limitations of this approach. You can reasonably claim that ex post weighting is okay for things such as income if you have sufficient data, assuming that (important!) say a low income bracket in the available data is representative for the whole population of low income households/people. This will not necessarily hold in other cases, e.g. if you study certain lifestyles or so.

Non-response may also occur if people are ashamed to answer some question. In this case, there is a systematic bias in the data with unknown data generating process, so that it is more or less impossible to cure this problem/bias.

Yet another option would be imputation, meaning you estimate what someone would answer (often applied to income for instance). However, in this case you can never be sure that you get things right. So imputation often is very problematic.