1. The proper -svyset- should include the stage of selecting dwellings.

-svyset censustract [pweight=???], strata(area) || dwelling || _n

For the proper pweight, see point 4 below.

2. You did not really stratify on gender, so drop all reference to a
gender stratum.

3. Your design, selecting one person at random, and hoping to get
enough elderly people, is not one I recommend. There are standard
approaches for oversampling sub-populations in household surveys. At
the least, one can list older and younger people in each dwelling and
select separately from each list.

4. The design makes it very difficult to calculate the sampling
weights. You appear to be saying that you stopped interviewing when
you had enough elderly and younger people ( or when you ran out of
dwellings). This is a version of 'sequential sampling' (Sharon Lohr,
Sampling: Design and Analysis, Duxbury, p. 403)

Here are my best guesses at sample weights.

4a. person weight =
1/(prob sel tract) x (no. dwellings in tract)/(no. of dwellings
where you obtained interviews) x (no. of people in the person's
dwelling)

4b. If you listed the ages of all people in the 12 selected
dwellings, not just those where you did interviewed, you can do more:

4c. If you have ages of all people in the sampled dwellings,
substitute 'no. of dwellings where you obtained interviews' for '12
sampled dwellings' in the formulas in 4b. These weights may
slightly over-estimate the proportion of elderly people.

5. If there are census figures available for your target population,
apply a post-stratification weighting to make the ratio of 'elderly'
and 'younger' people match that in the census. See Lohr, Chapter 8.

-Steven

On Mar 31, 2008, at 6:27 AM, Angel Rodriguez Laso wrote:

Thank you, Steven, for your interest.

Answering to your questions, I didn’t go into more details on the
sampling
procedure because I didn’t think they were needed for the
definition of
strata and PSUs. There was intermediate sampling of dwellings.
There was a
list of all dwellings in census tracts and from this list 12
dwellings in
each selected census tract were chosen at random. From each
dwelling one
person was taken at random (and his/her weight calculated from the
number of
people living in the dwelling). People were interviewed until a
sample of 7
bellow 65 and 3 over 65 was obtained in each census tract. The
reason why 12
dwellings were selected initially is that it was expected that
taking only
10 would not yield the final 7/3 proportion desired. Nevertheless,
not in
all census tracts 7 and 3 individuals could be selected and that's the
reason (more than the existence of missing items) why there are census
tracts with only one individual over 65.

I'm trying to check if following your advice (merging strata in
single PSU
per stratum census tracts) or just dropping the second stage
specification,
would give very different results, but when I run a svy: prop under
the
first specification:

Angel-
I'm sorry that I missed your initial post; I was on vacation and
canceled my Statalist subscription. I agree with Stas's suggestion
for the first specification.

I have some questions

1. Your description implies that you created a list of ALL people in
each selected tract, stratified by age. Then selected by simple
random sampling: 7 from the below 65 list; 3 from the over 65 list.
Is that a correct description? Or, was there intermediate sampling
of dwellings?

2. Your PSU's are census tracts, not people. ("Primary" refers only
to the first stage.) You are saying that in some of the census
tracts, you had only one person either under or 'over' 65. Is that
correct?

For those tracts, I suggest that you go with option 1, but ignore
the stratification, but keep the sampling probabilities. That is,
create a single stratum for those tracts by recoding.

You may still analyze your outcomes by age. The analysis age groups
need not match the stratum age-groups.

-Steven

On Mar 28, 2008, at 10:40 AM, Angel Rodriguez Laso wrote:

Thank you for your answer, Stas.

I´ve tried both specifications and the first surprise was that
Stata 9
ignores further stages when stage 1 is sampled with replacement. It
was good
to come across this warning because in our survey sampling was
without
replacement and the sampling fraction of the census tracts was
quite high
(more than one third in some strata) what precludes assuming that
selection
was with replacement.

The problem with using age groups as second stage strata is that
being 3 the
number of people over 65 selected per census tract, whenever there
are
missing values in the variables some strata become single-PSU
(person)
strata, what prevents Stata from calculating standard errors. So,
the two
specifications I´ve tried are:

Not surprisingly standard errors with both specifications differ
only in
some hundreths. I believe this is mainly due to the fact that in
both cases
degrees of freedom are very large. This is something I want to
check with
you: From the reading of Korn and Graubard "Analysis of health
surveys" I´ve
understood that in complex surveys degrees of freedom are
calculated as
#PSUs - #strata (624 for the first specification and 1244 for the
second,
because Stata duplicates the number of census tracts because each
of them
belongs to two different strata). I do not follow you very well
when you
recommend doing a small simulation with census or simulated data to
ascertain degrees of freedom or when you state that Taylor series
expansion
standard errors might be badly off with small samples. It´s usual
practice
to work with such low numbers of individuals per PSU (10 in my
case) and
I´ve never heard that there was a problem of a small sample size
then.

Unfortunately, I don´t have enough knowledge to go for option 3.

To conclude, although both specifications yield similar results, I
agree
with you that the second one implies linked selection of PSUs while
the
first one is conceptually sounder.

Ángel Rodríguez Laso
Institute of Public Health of the Region of Madrid

I would say your first specificaiton makes better sense, even though
the design it produces is quite weird, and the degrees of freedom in
that design are strange (and 7 initial strata won't get you very far,
anyway). In Stata 10, that's doable with

svyset tract, strata(area) || person, strata(age_group)

if I am getting your design right.

In the second specification with region by age strata, you have some
sort of coupled sampling when selecting a PSU in one stratum implies
selecting a certain PSU in the another stratum linked by geography.
You could still analyze that, but you would need to get accurate
pairwise probabilities of selection to compute Horwitz-Thompson
estimator, and Grundy-Yates-Sen estimator of its variance (which I
don't think is implemented anywhere commercially as those higher
order
probabilities of selection are rarely known; Jeff P, that might
produce a cutting edge addition to Stata's set of -svy- tools,
although I've no idea how to input and parse those :)). Any
reasonably
high level book would have it (Kish, Cochran, Mary Thompson's books
spring to mind). For special cases, I think that can be programmed in
Mata. Let's call that option 3. Note that the naive implementation as

svyset tract, strata(area X age) || person

produces wrong probabilities of selection, and the variances are
likely to be understated, as there is more variability in this
specification than in your actual design.

If I were in your shoes, I would try both specifications you
described
and see whether they are producing comparable substantive results.
Keep in mind that either way you are getting asymptotic Taylor series
expansion standard errors, and they might be badly
off with small samples like those you have. And I think you need to
worry about your degrees of freedom, not your number of PSUs; I would
do a small simulation to determine the approximate d.f.s for your
main
variables -- from census data if you have it, or from simulated data
resembling the actual population. If I had infinite time to work on
that project (meaning, a week or two of devoted programming), I would
implement option 3 as the most proper.

Greetings to all members of the list,
I have the following questions on svysetting for an analysis of a
complex
survey:
We have carried out a regional health population survey. We defined

strata

initially as geographic areas in the region (n=7) and allocated
to each

of

them a sample proportional to their population. But because we
wanted to
over-represent the elderly, we set that the number of people
over 65

years

sampled in all areas had to reach a minimum number. We didn't
change the
sample size of people bellow 65 obtained through the proportional
allocation. Therefore the sampling fractions (and consequently the

weights)

are different for each area by age group (bellow/over 65) category.
Then we selected census tracts in each geographic area with
probabilities
proportional to their total population, and randomly sampled 10

individuals

in those selected, always keeping the proportion 7 bellow 65
years/3 over

65

years, which was the regional overall age distribution after the
oversampling explained above. My first question is if strata
should be
defined as geographic regions alone or as geographic area by age
groups
(bellow/ over 65 years) (n=14) when svysetting. The first
possibility

looks

more reasonable, because census tracts were selected within
geographic
areas, not within geographic-age groups areas. If this is
correct, then
probably the way to svyset would be declaring geographic areas as
first
stage strata, census tracts as first stage PSUs and age groups as
second
stage strata.
Alternatively, if the answer is that strata should be defined as
region

by

two age-groups categories, then the same census tract can belong
to two
different strata (for example area A bellow 65/ area A over 65)
depending

on

the age of the individual considered. If I svyset: strata (region
by age
group categories) and PSU= census tracts, STATA interprets that
there are
twice the number of PSUs than real census tracts are. Is that
correct?
Many thanks.
Ángel Rodríguez Laso
Institute of Public Health of the Region of Madrid