Re: st: egen and Mata

This was very interesting to read. Thanks! It must require huge
experience and expertise to make statistical code run faster. Just
this morning on another listserv, Douglas Bates posted this comment,
which sounded very similar:

"Some of the best mathematics I have ever done was
the derivation of the analytic gradient and the ECME updates for the
profiled deviance and then I found that the use of these elegant
formulas actually slowed things down on large problems. (There's a
lesson there for those who think that statistical computing research
consists of deriving a formula and letting others worry about the
trivial implementation details.)"

-Dave

On Jul 16, 2008, at 9:25 AM, William Gould, StataCorp LP wrote:

Paulo Guimaraes <guimaraes@moore.sc.edu> reports,

[...] I tried to code in Mata the equivalent to the following Stata
command:
. egen newvar=mean(var), by(id)

His goal was to see if he could write code in Mata that would be
faster
then -egen- (an ado-file), and he failed. He made two attempts. He
reported
these results, and I have duplicated them:

3. Evidently I (Gould) have a faster computer than Paulo,
although mine is a middle of the road computer by modern
standards. Mine is an Intel 2.13 GHz Core 2 Duo running
64-bit Stata under Linux.

One of Paulo's questions was why Attempt 1 is running so slowly. I
haven't
looked into that because, in addition to running slowly, Attempt 1 is
producing incorrect results on my computer, and I assume it is
doing the same on Paulo's. The program has a bug.

Attempt 2, on the other hand, works perfectly and, in fact, is running
pretty quickly. Paulo finds Attempt 2 takes 1.46 times as long as
egen,
and on my comp[uter, the multiplier is 1.15.

Paulo wants to know how speed up Attempt 2, too. Answer: I don't
know
how. I made two attempts and produced code that I considered prettier
but in fact ran more slowly. Actually, I wasn't surprised by that.

My expectation was based on a comparison of what is true and what is
theoretically possible in comparisons of C and Mata. Mata produces
P-code, which must be interpreted. C produces machine language.
Mata's low-level functions (i=0, i++, etc.) run 2 to 4 times slower
than
compiled C code. Said differently, they run at an overhead of
of 1:1 to 3:1. One-to-one (1:1) means that, to execute one machine
instruction on behalf of the user, Mata executes one additional
instruction
just to figure out what needs to be done. Three-to-one means three
additional
instructions are executed to execute the one instruction on behalf
of the
user.

1:1 may seem like a theoretical lower bound, but it is not. Consider
a high-level function such as matrix inversion. It might take the
Mata
P-code interpreter 5 or even 10 machine instructions to figure out
what
needs to be done, and then it would execute a million machine
instructions
that were needed for the problem, the same million instructions C
would
have to execute, too. Thus, the overhead would be .000005:1.

But for very low level instructions for which there is a one-to-one
matching
with the instruction set on the chip, such as i++, 1:1 is a
theoretical
lower bound. So why not produce machine language rather than P-code?
Because in that case, the compiled code would not be portable.

All the above has to do with a comparison of Mata and C. Paulo is
comparing
Mata to ado-code, however. The ado-code to which Paulo is comparing,
however, is not typical or rather, it is what should be typical when
ado-code is used for the appropriate problem. In this case, most of
the
ado-code time is being spent in just a few highly efficient Stata
commands which themselves are implemented in C.

On my computer, the time required for solution (1) was 0.24 seconds.
Solution (2) would have taken 0.11 seconds. In fact, solution (2)
is exactly what -egen- implements. So -egen- is running at
45.8% efficiency. What could have been done by Stata's C code in
.11 seconds, took -egen- .24 seconds.

As an aside, let me add that -sort-, -by id:-, -gen-, and -replace-
are four features we have worked very hard on to make as efficient
as possible. There are 4,901 lines of C code involved in implementing
these functions for the problem given.

In Paulo's Alternative 2, he wrote 27 lines of code, and on my
computer, it
ran in .47 seconds. Thus, the ratio is .47/.11 = 4.27. This is
about the
ratio we expect to see on low-level code.

Paulo did as well as could be expected.

Note well, when Mata is used on problems requiring matrix inversion
or other
heavy lifting, which includes implementation of statistical commands
for which
Mata was designed, this multiplier of 4.27 falls, and can even fall
to 2 or
even just a little more than 1. That is why we use Mata for our
statistical
development. The fact remains, however, that for problems where the
solution
involves only Mata's low-level features, we expect a multiplier of 4.

So when I saw Paulo's question, I knew I was unlikely to have a better
answer for him. Nonetheless, I tried.

My embarrassment is that this took 0.77 seconds to run on my computer,
versus .47 seconds for Paulo's Attempt 2. I am nonetheless convinced
that a combination of Paulo's code and mine could run faster. Who
knows,
we might even get down to .40 seconds from Paulo's .47.

Paulo, my one comment about your Attempt 2 is putting all the data
together
in one matrix and then sorting it. In my attempt, I keep the data in
separate vectors, and rather than using sort(), I use order() to
obtain
the permutation vector and then use it where I need it.
My goal was to prevent coding things like data[,2], which requires
exectracting the second column from a matrix, and will require
extra time to execute.