After everything is said and done, a couple of regexes turns the results into CSV format. We add a line in the beginning naming the columns and switch back into GNU R.

3. Loading and plotting data

This line of R code will load our data into the system:
data<-read.table("~/repos/comparealgos.csv", header=TRUE, sep=",")

Let’s take some plots.

plot(n, time)

plot(n,bytes)

The intuitive story of (a) running time depending on the size of n and (b) the `mod` algorithm being faster than the gcd one seems to hold. But let’s put numbers into this.

4. Linear regression analysis

In a nutshell, linear regresion analysis is about fitting data to an equation (possibly a system of multiple equations in more sophisticated models) and performing statistical tests of whether the fit means anything. The first model we’re going try to fit is the following:

where is 0 for the gcd algorith and 1 for the mod algorithm, and accounts for serendipity.

Note the quadratic term: with this single term, we can both screen for more-than-linear and less-than-linear growth of time on n. Another way to look at this is to differentiate:

As we see, is a constant effect of scale, while is an effect that varies on the size of n. If is negative, we have a smaller effect as n grows and vice-versa.

It’s also interesting to look at :

where is the expected value. In this model, the effect of changing algorithms is supposed to be constant for any .

The estimate column holds the value for the parameters defined on the equation above. The t value column is a measure of how far the parameter is from zero given its uncertainty. If the regression residuals follow a gaussian normal distribution, 1.96 (or -1.96) is a good critical value for considering the variable in question significant; in any case, values above 3 tend to indicate an important relationship.

The effect of is apparently zero, but is significant and positive — which means the algorithm grows more-than-linearly on size. The effect of is very strongly negative, which indicates the second algorithm really tends to be faster.

We can run, just for kicks, the same regression for memory usage. The algebraic analysis remains the same.

Compare the multiple value between the two regressions. The statistic measures how closely the model fits the data. One way to look at this is to affirm that apparently memory usage is also determined by something else we haven’t mentioned — this evidently being the size of the divisor list.

5. A better model

Let’s revise our equation.

We’ve added an additional “interaction” term. To see what this means, let’s look at expected values again:

In this new model, the effect of changing algorithms is allowed to vary with the size of as well. Let’s estimate this model:

Now all the coefficients are indeed very significant, including the linear term on . This gives us the following formulas for predicting the running time of both algorithms on a given :

An interesting observation: the coefficient for being positive means that the running time of the GCD algorithm starts lower; on the other hand, MOD will quickly outrun it (solve a simple inequality to find ranges of if you’re so inclined).

Conclusions

I might return to this topic if it arises enough interest. For the time being, I tried keeping it short and sweet, even if it meant glossing over all the relevant theoretical details. Discussion is appreciated, etc. etc. Oh, and remember this all was run on an old G4 mac mini, on the interpreter (not compiled/optimized), all while browsing and IMing at the same time. Don’t just benchmark this on JRuby on your dual Opteron and say your language leaves my language biting the dust.

Related

There are much faster ways of generating the list of divisors; it’s far better to factor the number, and then trivially generate the list of divisors from the list of prime factors. Even trial factorization, the stupidest prime factorization you could write, would beat this one, since as soon as you find _one_ divisor you can reduce the size of the problem…

I leave a comment each time I appreciate a article on a site or if I have
something to valuable to contribute to the discussion.
It’s triggered by the passion displayed in the post I read. And on this article Simple linear analysis of the performance of two algorithms. | Data.Syntaxfree. I was actually moved enough to post a comment :) I actually do have some questions for you if it’s okay.
Could it be just me or do some of these responses appear
as if they are coming from brain dead folks? :-P And, if you are
posting at other social sites, I’d like to keep up with anything new you have to post. Could you list all of all your community sites like your twitter feed, Facebook page or linkedin profile?

Recent Posts

Dr. Syntaxfree

Dr. Syntaxfree has no PhD and shouldn't call himself a "doctor", but does so for amusement value anyway. An unemployed (ok, graduate student) econopundit by day, he's been progressively obsessed about Haskell to the point he often can't fathom not working on it. A jack-of-many-trades, he has an unusual CS background in that he knows no imperative programming at all, he hopes to be both helpful to those less knowledgeable than him and illustrative to the really smart people trying to understand the mentality of a common man trying to tackle functional programming.

Licenses

All rights reserved for textual content. A specific exception is granted to wikis hosted in the haskell.org domain; any text or code can be freely copied to such pages. All code is otherwise released under the LGPL.