Re: st: DHS svy questions on weights and merged datasets

A look at the DHS reports suggests that:
1. 16 regions were the sampling strata in 1998
2. 17 regions were the sampling strata in 2008
Perhaps you can match those up with the variables you have (v024 in 2008, I believe).
Steve
sjsamuels@gmail.com
Perhaps that will help you.
This is a lot of questions for one post and ensures that most people will not read to the bottom (see the FAQ about "bundling questions"). I can only give advice about the first two. For the third, choosing the proper stratum variable from each year, my experience with DHS and similar surveys is that you must closely read the documentation and reports.
1. Weights: If you are analyzing individuals, then use the individual weight, even if you have merged in household information.
2. Combining surveys: see: http://www.stata.com/statalist/archive/2008-10/msg00521.html
You append the two data sets. First make sure that that your variables of interest have identical names in each and, as importantly, that they are identically coded. If not, you will have to create new, compatible, versions. Be sure to add a "year" variable before appending
You create uniquely numbered PSUs as in Stas's post, by:
egen psuXyear = group(psu year)
You will probably have to do the same with the strata, assuming that you have properly identified them.
Steve
sjsamuels@gmail.com
On Jun 19, 2012, at 11:17 AM, Julian Doczi wrote:
Dear Statalist Members,
I am undertaking an impact evaluation using Demographic and Household
Survey (DHS) data from the Philippines and difference-in-difference
regression, and have a couple of conceptual questions regarding the
handling of survey data with Stata's -svy- command that I hope someone
can assist me with. I have searched through Statalist and the internet
already, but have not been able to find satisfying answers to these
questions, though my apologies are in order if they have indeed
already been asked. I am using Stata 11.2 in Windows.
Firstly, when using -svy-, is it possible to simultaneously make use
of the sample weights from both the "Individual Recode" of the DHS and
the "Household Recode" of the DHS? - For my analysis, I am mainly
using the individual recode, but I also had to merge in a variable
from the household recode on the household's source of non-drinking
water. When I use the -svyset- command, I specify only the sample
weight of the individual recode (for those who know DHS, the variable
would be "v005 / 1,000,000") as a 'pweight'. But am I creating some
sort of error if I then go ahead and do a -svy: regress __ - using the
merged household variable, for which its dataset has its own
_different_ sample weight (again, for those who know DHS, it is the
"hv005 / 1,000,000" variable and is calculated with a different
formula)? Is this something that should concern me and is there any
way around it? Should I apply the weight somehow to the household
variable before I merge it over into the individual recode dataset?
Secondly, since I wish to do an impact evaluation using
difference-in-difference regression, this means that I will be using
DHS data from both 1998 and 2008. Normally, to do a dif-in-dif, one
merges both data sets together and creates a dummy variable for
whether their time period is 1998 or 2008. Conceptually, though, do
the -svy- commands function properly if the single dataset is actually
composed of two different datasets? For example, in the -svyset-
command, although both datasets (composed as one, long dataset) would
have PSU, sample weight, and strata variables of the same type, the
actual meaning of these variables for each separate dataset would be
very different. For example, the 1998 DHS for the Philippines has 752
unique PSUs, ranging from 1 to 755, while its 2008 DHS has 792 unique
PSUs, ranging from 1 to 794, that may or may not be the same PSUs as
those from 1998 (and even if there is overlap, there is a very small
probability that the PSUs would share the same numerical values).
Likewise, the sample weights and strata are similarly tailored
specifically to the particular dataset. So, if I merged these two
datasets, I will have a situation where, for example, the merged PSU
variable would range from 1 to 794 and essentially have two replicates
for 752 of these values (i.e. 1, 1, 2, 2, 3, 3, etc.).
So, can Stata's -svy- command handle this, or do I need to use a
different command, or a different way of merging / preparing my data
for dif-in-dif regression? I imagine that -svy- will just treat the
data as one big dataset, but this is not correct in terms of accurate
calculations of standard errors/variances, is it?
Surely I am not the first person to either merge data between DHS
recodes (question one) or to attempt estimations using data from two
DHS (question two), so I am hoping that someone with previous
experience will be able to assist me with this. As I mentioned, I
searched as well as I could through Statalist, but did not come across
answers to these. I also apologise in advance if the questions I am
asking have fairly obvious answers; my formal econometric/statistical
training to date has been very limited!
Finally, just to confirm, as I have read conflicting accounts on this,
for the -svyset- command using DHS, the strata variable I should use
is "v022" - "Sample stratum number"? I have read that whether one uses
this or uses "v023" (Sample Domain) or "v024" (Region) or "v025" (Type
of place of Residence - Urban/Rural) depends specifically on how the
country's particular survey was sampled.
(e.g. http://www.stata.com/statalist/archive/2009-07/msg00906.html ;
http://www.stata.com/statalist/archive/2011-07/msg00614.html)
Based on that, here is the basic description of sampling for the 1998
and 2008 Philippine DHS, which I have summarised from the country's
DHS final reports (available from MeasureDHS):
"The DHS is a multi-stage stratified design, designed to represent all
17 regions of the country. In each region, a stratified 3-stage sample
design was employed. First, PSUs were selected with probability
proportional to the estimated number of HHs from the 2000 Census. PSUs
consisted of one barangay (village) or a group of contiguous barangays
(villages). Second, enumeration areas (EAs) were selected within
sampled PSUs with probability proportional to size. Third, housing
units were selected with equal probability within EAs. EAs = area
within barangay (village) consisting of ~150 contiguous HHs - these
were identified during the 2000 Census."
In these datasets, "v023" simply equals zero (i.e. a national focus -
see: http://www.measuredhs.com/faq.cfm, question 4 under "Using Data
Files"), and since a consideration of urban vs. rural is not mentioned
anywhere, I assume that my strata must either be v022 (which includes
about 356 unique values for 2008 {as mentioned above, v021 contains
792 unique PSUs for 2008}) or v024 (with 17 unique values for 2008).
Based on my above description, can you help me decide which? Although
the first link I included above discourages use of v022, other links I
have seen made use of it. Although the aforementioned MeasureDHS FAQ
recommends using v023, it does not specifically state what to do if
v023 equals zero, as it does for me - it only says to 'investigate
your specific survey'.
I would greatly appreciate any assistance that could be offered,
and/or further reading/resources that could assist me on these issues.
Thank you very much in advance and Best Regards,
--
Julian Doczi (Mr.)
University of East Anglia, Norwich, U.K.
juliandoczi@gmail.com
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/