Re: st: More efficient processing of nested loops?

Nick,
Thank you for your help with this. Your approach gives me the return
for a single day. There is some variation, but that day is usually the
return on the date the two datasets were merged. These observations
have non-missing values for datadate_2 and date_t_4. I had a difficult
time writing this up, and I apologize if my description of the merged
datasets or the problem was not clear enough.
Dataset A has around 12 years of daily returns for 3,500 firms.
Dataset B has the earnings announcement date for quarter t
(datadate_2) and quarter t-4 (date_t_4) in each row. I merged these
two datasets (1:1 cusip date) into Dataset C and ran the two loops.
There are typically around 250 days between datadate_2 and date_t_4
that need to be added together.
I can provide any additional information if needed. Here is the code I
ran from your responses:
bysort cum_id_cum (datadate_2) : gen d2_temp_2 = datadate_2[1]
bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1]
egen cusip_mode = mode(cusip), by(cum_id_cum)
local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip"
egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
replace cum_ab = cum_ab_temp if `mark'
I wrote the two inefficient loops based on what I would do by hand to
sum up the abnormal returns. Go into the first cusip and identify the
two boundary dates for the first earnings announcement. Go back
through all the daily returns for that cusip and mark each day that is
between these two dates. Sum up these individual returns into one
number. Do this again for the first cusip but second earnings
announcement, etc.
Thanks again for the time you have already spent on this,
Robson
On Wed, Aug 15, 2012 at 1:48 PM, Nick Cox <njcoxstata@gmail.com> wrote:
> More speed-ups:
>
> bysort cum_id_cum (datadate2) : gen d2_temp_2 = datadate_2[1]
> bysort cum_id_cum (date_t_4) : gen dt_temp_2 = date_t_4[1]
>
> On Wed, Aug 15, 2012 at 6:07 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>> replace cum_ab = r(sum) if `mark'
>>
>> should be
>>
>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>> replace cum_ab = cum_ab_temp if `mark'
>>
>> On Wed, Aug 15, 2012 at 5:59 PM, Nick Cox <njcoxstata@gmail.com> wrote:
>>> The good news is that I think you are right. This code appears to be
>>> much more complicated than it needs to be.
>>>
>>> I can't follow your word description -- doesn't mean it's unclear,
>>> just means that it is too much for me to absorb -- but from looking at
>>> your code there are several major and minor inefficiencies.
>>>
>>> The main problems are
>>>
>>> 1. You have two outer loops tangled together, one Stata's and one
>>> home-made, but neither appears necessary.
>>>
>>> 2. The inner loop is a loop over one case, and so not needed.
>>>
>>> 3. -egen- calls are very inefficient to calculate constants that
>>> -summarize- can calculate, except that #1 implies to me that you can
>>> do most of the work in a few -egen- calls.
>>>
>>> 4. Some copying of values from one place to another to no obvious purpose.
>>>
>>> With some guesswork, your code to me boils down to
>>>
>>> egen d2_temp_2 = min(datadate_2), by(cum_id_cum)
>>> egen dt_temp_2 = min(date_t_4), by(cum_id_cum)
>>> egen cusip_mode = mode(cusip), by(cum_id_cum)
>>> local mark "date <= d2_temp_2 & date >= dt_temp_2 & cusip_mode==cusip"
>>> egen cum_ab_temp = total(ab_ret) if `mark', by(cum_id_cum)
>>> replace cum_ab = r(sum) if `mark'
>>>
>>> Note that your use of "temporary variables" is not the same as Stata's.
>>>
>>> Nick
>>>
>>> On Wed, Aug 15, 2012 at 4:26 PM, Robson Glasscock <glasscockrc@vcu.edu> wrote:
>>>
>>>> I am running Stata 12. I have written code that creates a variable
>>>> that is the sum of abnormal returns for each firm. The abnormal
>>>> returns are accumulated from three days after the firm makes its
>>>> earnings announcement in quarter t-4 to three days after the firm
>>>> makes its earnings announcement in quarter t. My problem is that it
>>>> takes around 2 minutes for my code to execute for each firm/quarter
>>>> announcement, and there are around 150,000 earnings announcements in
>>>> the dataset.
>>>>
>>>> I modified a panel dataset with the earnings announcement dates for
>>>> each firm so that each observation contains both the quarter t
>>>> earnings announcement date and the quarter t-4 earnings announcement
>>>> date (with the three days added to each per above). I then merged this
>>>> dataset with a second panel dataset that contains the daily returns
>>>> for each firm. The merged dataset has around 10.2 million
>>>> observations.
>>>>
>>>> Next, I created a count variable, cum_id_cum, which is a running total
>>>> of each earnings announcement. This variable is truly cumulative and
>>>> does not reset back to 1 when the next firm releases its first
>>>> earnings announcement. The loop contains a variable, runn, that starts
>>>> with a value equal to "1" and increases by 1 each time the loop is
>>>> processed. I'm using that to help identify the particular dates of the
>>>> quarter t and quarter t-4 earnings announcement so that the abnormal
>>>> returns are accumulated over the correct period. Datadate_2 is the
>>>> quarter t earnings announcement and date_t_4 is the quarter t-4
>>>> earnings announcement. Cusip is the identifier for each firm. Date is
>>>> the date of the firm's return in the stock market for each trading
>>>> day.
>>>>
>>>> The big picture of my approach was to create temporary variables that
>>>> will equal the cusip, datadate_2, and date_t_4 when runn equals
>>>> cum_id_cum. These first-step temporary variables (d2_temp, dt_temp,
>>>> and cusip_temp) have missing values except in the observation where
>>>> runn equals cum_id_cum so I created second-step temporary variables
>>>> (cusip_temp_2, d2_temp_2, and dt_temp_2) which place the cusip,
>>>> datadate_2, and date_t_4 values for each particular run through the
>>>> entire dataset. This allows me to mark the days for each firm that are
>>>> between the dates of the earnings announcements and then sum up the
>>>> abnormal returns in a temporary variable called cum_ab_temp. The final
>>>> variable with the sum of the abnormal returns for each firm is cum_ab
>>>> and is retained in the observation where _merge==3 (from the merge
>>>> mentioned in the second paragraph of this post).
>>>>
>>>> My code is below. Note that I constrain it to the first 25,000
>>>> cum_id_cum values due to macro size constraints:
>>>>
>>>> gen runn= 1
>>>> levelsof cum_id_cum if cum_id_cum !=. & cum_id_cum <25000, local(cum_id_cum)
>>>> foreach 1 of local cum_id_cum{
>>>> gen d2_temp= datadate_2 if runn== cum_id_cum
>>>> gen dt_temp= date_t_4 if runn== cum_id_cum
>>>> gen cusip_temp= cusip if runn== cum_id_cum
>>>> egen cusip_temp_2= mode(cusip_temp)
>>>> egen d2_temp_2= min(d2_temp)
>>>> egen dt_temp_2= min(dt_temp)
>>>> foreach x of varlist date{
>>>> replace mark= 1 if `x' <= d2_temp_2 & `x' >= dt_temp_2 & cusip_temp_2==cusip
>>>> egen cum_ab_temp= total(ab_ret) if mark==1
>>>> replace cum_ab= cum_ab_temp if datadate_2== d2_temp & dt_temp==
>>>> date_t_4 & cusip_temp==cusip & d2_temp !=. & dt_temp !=.
>>>> replace runn= runn+1
>>>> drop d2_temp
>>>> drop dt_temp
>>>> drop d2_temp_2
>>>> drop dt_temp_2
>>>> drop cusip_temp
>>>> drop cusip_temp_2
>>>> drop cum_ab_temp
>>>> replace mark=0
>>>>
>>>> }
>>>> }
>>>>
>>>> I'm wondering if there is a more efficient way to do the above which
>>>> will result in a significantly faster processing time. My fear is that
>>>> the above ignores the functionality of Stata and instead uses
>>>> inefficient brute force.
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/