I have some questions for the calculation of prevalence rate for pre-1996 data.

I understand that I need to use different weight for different times. I also learned that there will be a universe for each questions. For example, to calculate the prevalence for diabetes, I should use DIABETICYRC, and that:

For 1973 and 1975, DIABETICYRC should be weighted with PERWEIGHT.
For 1978 to 1981, DIABETICYRC should be weighted with DIABWT.
For 1982 to 1996, DIABETICYRC should be weighted with CONDWT4.

Question:
How should I calculate the prevalence rate specifically?

What I have done is like this:

if year == 1996:
numerator = ONE(answering "20/21/22" for DIABETICYRC in 1996)*CONDWT4
denominator = ONE(answering "10/20/21/22" for DIABETICYRC in 1996)*CONDWT4
prevalence rate = numerator/denominator*
where ONE(`) returns 1 if "`" is satisfied and 0 if not*
DIABETICYRC:
00: not in universe
10: no
20: yes
21: Yes, indicated by response to direct survey question
22: Yes, indicated by other source

Is my method correct? Since I calculate for diabetes and the result seems unreasonable.

This seems correct in general. A few notes that might be helpful. First, in 1996 the only available response categories are 00 “NIU,” 10 “No,” and 20 “Yes.” I don’t think specifying unused codes will make any difference, but I thought it is worth mentioning. Additionally, it is not clear to me from what is shared above how you are specifying the sampling weight. Be sure to check the documentation of your statistical software about how to correctly specify the sampling weight.

Thank you, Jeff! I am using STATA 15.2 and treat the sampling weight as analytic weight (aweight). The specific process is to summarize the numerator and denominator separately, both with analytic weight. Store the sum of the weight of the two and divide one by the other. The code looks like below, do you think these are appropriate?

I have another issue about the choice of samples. If I would like to calculate the prevalence rate for subgroups (age, gender, income, etc.), should I stick with the current way of computation and only add more restrictive conditions (e.g. age>=50 & age <=59)? Or should I change the weight and sample? Thank you!

You are working on a trade-off between temporal precision vs. statistical precision. Sticking with one sample and adding more restrictive conditions will reduce the number of observations that meet the given criteria and will increase the margin of error associated with your prevalence estimates. Pooling a number of samples together will allow you to have more observations that meet the given criteria but will only represent an average of the prevalence rate over the pooled time period. So, I really can’t tell you which way is the best as this really depends on your ultimate research objectives.