Re: st: speed question: -collapse- vs -egen-

Hello All!
Jeph has asked about an efficient way of creating a dataset with means
of one variable over the categories of another variable. He suggested
two possible solutions and Stas added a third one.
Below I report performance of each of these methods and compare it
with the fourth: a plugin.
I use an expanded version of auto.dta and tabulate mean {price} by
different levels of {rep78}.
1. All methods resulted in the following table of results*
meanprice rep78
4564.5 1
5967.625 2
6429.233 3
6071.5 4
5913 5
2. The timing is as follows (Stata SE, Windows Server 2003, 32-bit)
1: 33.80 / 1 = 33.7960
2: 31.22 / 1 = 31.2190
3: 21.33 / 1 = 21.3280
4: 5.58 / 1 = 5.5780
3. Since the plugin was intended for similar but not exactly the same
purposes, it does some extra work (simultaneously computing
frequencies, etc), which means that this is not the ultimate record.
4. The plugin must be "plugged-in" before use. To achieve this, I
first call the plugin without timing, so that Stata loads the DLL and
becomes aware of it. This process takes about 3 seconds on this
particular machine, because each DLL loaded into memory is scanned by
an antivirus on-the-fly. This time-loss is a one-time loss, and if
Jeph calls this program routinely, he should not be concerned about
this fixed cost. Even with this overhead, the plugin still easily
beats all of the competition above.
5. The benchmark program is listed below. -ftabstat- is an ado-wrapper
for the plugin. So if anyone wants to reproduce these results, must
first obtain this plugin from me (please write "ftabstat" in the email
subject line).
6. This particular plugin has a limitation of the matrix size for
groups (which is 11000 in most versions of Stata). It also does not
properly handle the missing values in categorical variable (that's why
I discard observations with missing {rep78}), but this can all be done
without a penalty in terms of execution time. Another limitation is
that the plugin is platform-specific, and this one is for Win32 only
(it will run with Stata 32-bit on Windows 64-bit, though).
7. If Jeph is concerned about speed - plugins are the way to go.
8. I would be glad to see a solution in Mata posted here to see how
efficient would that be.
9. the name "ftabstat" comes from "fast tabstat" as originally the
plugin was "tabstat on steroids". But then a couple of other features
were added.
tabstat price, by(rep78) save
takes ~45 seconds to just create a matrix, so it does not even qualify.
Have a good weekend,
Sergiy Radyakin
// Benchmark program speed.do
------------------------------------------------------
set more off
set mem 500m
log using speed.txt, text replace
// get data
sysuse auto, clear
keep rep78 price
keep if !missing(rep78)
expand 100000
timer clear
timer on 1
ftabstat price if _n==1 // kludge to load the plugin
timer off 1
timer list 1
timer clear 1
preserve // start Jeph Herrin 1
timer on 1
bys rep78: egen meanprice=mean(price)
bys rep78: keep if _n==1
keep rep78 meanprice
timer off 1
list rep78 meanprice, clean noobs abb(10)
restore
preserve // start Jeph Herrin 2
timer on 2
collapse (mean) meanprice=price,by(rep78)
timer off 2
list rep78 meanprice, clean noobs abb(10)
restore
preserve // start Stas Kolenikov
timer on 3
gen byte one=1 if !missing(price)
bys rep78: gen meanprice = sum(price)/sum(one)
by rep78: keep if _n==_N
keep rep78 meanprice
timer off 3
list rep78 meanprice, clean noobs abb(10)
restore
preserve // start Sergiy Radyakin
timer on 4
ftabstat price, by(rep78)
drop _all
matrix A=e(b)'
svmat A,names(col)
rename y1 meanprice
matrix A=e(Row)'
svmat A, names(col)
rename r1 rep78
timer off 4
list rep78 meanprice, clean noobs abb(10)
restore
timer list
log close
// end of benchmark program speed.do
------------------------------------------------------
*PS: The solution suggested by Stas will yield incorrect results,
since the sum of {one} will also count observations for which the
variable of interest is missing. This is definitely a typo and I have
corrected it before doing any comparisons by changing the following
line:
gen byte one = 1
to
gen byte one = 1 if !missing(price)
On 4/25/08, Stas Kolenikov <skolenik@gmail.com> wrote:
> NJC can offer a precise answer, but my take would be
>
> gen byte one = 1
> bys group: gen varmean = sum(mean)/sum(one)
> by group: keep if _n==_N
> keep whatever
>
> Topics like those should've been covered somewhere in Nick's column in
> Stata Journal, or in Stata tips. -egen- is slow as it does a lot of
> checks and parsing and stuff -- for big processing jobs, single-liners
> like above are always notably faster. -collapse- should be at least a
> tad faster than -egen-, but again I would expect it to lose to the
> above code.
>
> On Fri, Apr 25, 2008 at 2:37 PM, Jeph Herrin <junk@spandrel.net> wrote:
> >
> > I'm optimizing some code that needs to run often
> > for a simulation, and am wondering if I should
> > expect any difference in processing time between
> >
> > bys group: egen varmean=mean(myvar)
> > bys group: keep if _n==1
> > keep group varmean
> >
> > and
> >
> > collapse (mean) varmean=myvar, by(group)
> >
> > and if so, which would be faster?
> >
> > I know I could run some tests myself, but figured
> > that others had either already done so or at least
> > would have some insight.
> >
>
>
>
>
> --
> Stas Kolenikov, also found at http://stas.kolenikov.name
>
> Small print: Please do not reply to my Gmail address as I don't check
> it regularly.
> *
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
>
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/