Welcome to the Institute for Digital Research and Education

Stata FAQ
How can I analyze a subpopulation of my survey data in Stata?

NOTE: This page was created using Stata 9. All of the code on
this page will work with Stata 10. The code on this page will not work
with Stata 8 (or earlier versions of Stata).

When analyzing survey data, it is common to want to look only a certain
respondents, perhaps only women, or only respondents over age 50. When
analyzing these subpopulations (AKA domains), you need to use the appropriate
option. Stata 9 has two subpopulation options that are very flexible and
easy to use. Using the subpopulation option(s) is extremely important when
analyzing survey data. If the data set is subset, meaning that
observations not to be included in the subpopulation are deleted from the data
set, the standard errors of the estimates cannot be calculated correctly.
When the subpopulation option(s) is used, only the cases defined by the
subpopulation are used in the calculation of the estimate, but all cases are
used in the calculation of the standard errors. For more information on
this issue, please see
Sampling
Techniques, Third Edition by William G. Cochran (1977) and
Small Area Estimation by J. N. K. Rao (2003).

For the sake of consistency, we will use the mean
command for all of our examples. However, the subpop and over
options work the same for all svy commands.

We will start by looking at the mean of our continuous
variable, ell. Next, we will consider two variables to use with the
subpop option, yr_rnd, which is coded 0/1, and both, which
is coded 1/2. As you will see, the subpop option handles these two
variables differently.

Here we can see that yr_rnd is coded 0/1. (This missing
option is used here to show that there are no missing values for this variable.
We will want to know this later on.) Notice in the output of the svy: tab
command that there are 789.6 cases coded 1. (It is not a whole number because we
are estimating this value using the probability weights.) In the output of
the svy: mean command, we also see that 789.552 cases are included in the
subpopulation.

Now let's try to use a variable coded 1/2 instead of 0/1. Here we can
see that both is coded 1/2. (This missing option is used
here to show that there are no missing values for this variable. We will
want to know this later on.) Notice in the output of the svy: tab
command that there are 1888 cases coded 1. However, in the output of the
svy: mean command, we see that all of the observations, 6194 cases, are
included in the subpopulation. This is because the subpop option
must have a true/false variable. As stated on page 39 of the Stata 9 Survey
manual, when the subpop option is used, the subpopulation is actually
defined by the 0s (false), which indicate those cases to be excluded from the
subpopulation. Non-0 values are included in the analysis, except for missing values, which
are excluded from the analysis. Because we have no cases coded as 0, all
of the cases are included in the subpopulation, as explained in the note in the
output.

Now let's create a copy of both and recode the 1s to 0s. We
will also set some values to missing, to see what happens with missing values in
the subpopulation variable. The output of the tab command shows us
that the recoding went as planned. The output of the svy: mean
command shows that the all of the cases not coded 0 or missing (the 424 cases
coded as 2) are
included in the subpopulation. Notice the note that Stata provides when
the subpopulation variable is not coded 0/1.

You can also use if when defining your subpopulation. It should
be stressed that this is VERY different from using if to remove cases
from an analysis. Using if in the subpop option does not
remove cases from the analysis. The cases excluded from the subpopulation
by the if are still used in the calculation of the standard errors, as
they should be.

You can use either subpop or over with multiple variables to
create the subpopulation that you want. Let's see some examples using the
over option. First, we will use yr_rnd, our 0/1 variable, then
both, our 1/2 variable. Notice that the output is different from the
output using the subpop option in that both categories of the variable
are given, and there is no note when a 1/2 variable is used. Please note
that the over option is only available for the survey commands mean,
proportion, ratio and total.

Now let's use both yr_rnd and both as the subpopulation
variables. First we will use the svy: tab command to ensure that
there are cases in all four categories. Then we use the svy: mean
command with the over option.

Below we create a new variable from emer with four categories. Then
we will use this variable with yr_rnd and both; all combinations of the
variables are shown in the output. This is often very useful and saves
you from having to create a new subpopulation variable. However, if each
of your variables have many categories, the output can become long and
cumbersome, especially if you are only interested in a few combinations of
categories.