Re: st: average value among differing numbers of variables

Radu Ban <rban@nber.org> writes,
> This is a data management question. The data that I'm looking at (daily U.S.
> weather) has the following structure.
>
> day1 flag1 day2 flag2 day3 flag3 day4 flag4 day5 flag5 ... day31 flag31
> 0 s a2 a 0 s 0 s a5 a a31
> 0 s 0 s b3 a b4 b5 b31
> c1 0 s 0 s 0 s c5 a c31
>
> the "s" flag means that the measured element (say inches of rain) is
> accumulated over those days, which are assigned a 0 value, and the
> accumulated amount is reported in the day flagged with "a". i would like to
> replace the 0 value for the accumulation days with the average of the
> accumulated value over those days.
>
> given the notations above, specifically, i would like to replace 0, 0, b3
> (in the second row) with b3/3; 0, 0, 0, c5 (in the third row) with c5/4, and
> so on. note that, as in the first row there can be more than one
> accumulation series per row.
I do now know what the solution to this problem is yet -- I will -- but I
do know these kinds of problems are easier viewed the long way:
orig_obs i day flag
1 1 0 s
1 2 a2 a
1 3 0 s
1 4 0 s
1 5 a5 a
...
1 31 a31
2 1 0 s
2 2 0 s
2 3 b3 a
2 4 0 s
2 5 c5 a
...
2 31 c31
etc.
-reshape- can make the data look like that -- we'll worry about the details
later.
I am unsure from Radu's description whether variable flag ever contains
anything on that "a" and "s". Radu sort of implies " " is also possible,
in which case the measurement would be for that day. If so, I could
view that as an "a" observation: One accumulates the single day and
divides by one. So just in case there are any blanks,
. replace flag = "a" if flag=="s"
and then, just to verify that these data are as they have been explained to
be, let's verify that flag is now always "s" or "a":
. assert flag=="s" | flag=="a"
Now the problem is getting easier:
1. We start accumulation at the first observation for each orig_obs
group.
2. We continue accumulation up until we see an "a".
3. Then, we start accumulation again.
So let's add a new variable begin that will record 1 every time an accumulation
begins:
. by orig_obs: gen begin = cond(_n==1 | flag[_n-1]=="a", 1, 0)
Now our dataset looks like,
orig_obs i day flag begin
1 1 0 s 1
1 2 a2 a 0
1 3 0 s 1
1 4 0 s 0
1 5 a5 a 0
...
1 31 a31 a <could be 0 or 1>
2 1 0 s 1
2 2 0 s 0
2 3 b3 a 0
2 4 0 s 1
2 5 c5 a 0
...
2 31 c31 a <could be 0 or 1>
Understand what I did: I merely created a variable equal to 1 marking
the beginning of each group in which we need to distribute sum. If I now
sum variable begin, I will have group numbers:
. gen group = sum(begin)
The data set now looks like:
orig_obs i day flag begin group
1 1 0 s 1 1
1 2 a2 a 0 1
1 3 0 s 1 2
1 4 0 s 0 2
1 5 a5 a 0 2
...
1 31 a31 a ? 10 <- i just made up 10
2 1 0 s 1 11
2 2 0 s 0 11
2 3 b3 a 0 11
2 4 0 s 1 12
2 5 c5 a 0 12
...
2 31 c31 a ? ?
etc.
Now the problem is easy: within group, replace all the observations
with the value of the last observation in the group, divided by the
number of observations in the group:
. sort group
. by group: gen newday = day[_N]/_N
I have my solution. Actually, I could have typed
. by group: replace day = day[N]/_N
but I want to -list- the result and make sure it looks right to me. Then
I can replace day:
. replace day = newday
. drop newday
Now all I need to do is switch the data back to being in the wide format.
So here is the complete solution:
. gen orig_obs = _n
. reshape long day flag, i(i)
. replace flag = "a" if flag=="s"
. assert flag=="s" | flag=="a"
. sort orig_obs i
. by orig_obs: gen begin = cond(_n==1 | flag[_n-1]=="a", 1, 0)
. gen group = sum(begin)
. sort group
. by group: gen newday = day[_N]/_N
. list <-- look to make sure right
. replace day = newday
. drop newday group
. reshape wide
-- Bill
wgould@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/