Share it:

Curiously, even though we support arbitary arithmetic expressions for a while, I’ve never actually benchmarked how does our implementation compare to other ones, for instance, to MySQL. Time to fill that gap.

I used current trunk version of Sphinx (namely r2265) and MySQL 5.0.37, both running on Windows. Test data was 1 million random rows generated with following PHP script. Table type defaults to InnoDB.

The queries were in the form of SELECT MIN(a) AS result, 1 AS groupkey FROM bench GROUP BY groupkey, because of SphinxQL syntax being too tight (that’s to be fixed as time permits). Exactly the same queries were run against Sphinx and MySQL. Every query was repeated 3 times and minimum run time was picked, as it jittered for 0.01 or 0.02 sec across runs. Without further ado, there go the results!

N

expression

MySQL result

time

Sphinx result

time

k

1

min(a)

2244

0.39s

2244

0.08s

4.9x

2

min(a+b*c)

790494849806

0.45s

309

0.09s

5.0x

3

min(bigint(a)+b*c)

790494849806

0.45s

790494849806

0.11s

4.1x

4

sum(sin(a+b+c))

-695.3422782..

0.50s

765.542419

0.17s

2.9x

5

max(a+b*2-c/3*4+5+6+7+8+9)

6404668209

1.42s

6404668416

0.14s

10.1x

6

max(a+b*2-c*4/3+35)

6404668209

1.05s

6404668416

0.14s

7.5x

7

max(a+b*2-c*1.333333333+35)

6404668209.003..

0.84s

6404668416

0.14s

6.0x

8

max(a+b*2-idiv(bigint(c),3)*4+5+6+7+8+9)

6404668209

1.42s

6404668209

0.14s

10.1x

9

avg(a*b/c/1/2/3/4)

326342841.944..

2.55s

326041280.000

0.13s

19.6x

10

avg(a*b/c/24)

326342841.944..

1.39s

326041280.000

0.13s

10.7x

So there’s a noticeable difference in performance (and now I know that our expression calculator performs pretty well actually). However there are also some discrepancies in the results. Why?

Generally the reason is that Sphinx expression type system is (slightly) different. Let’s dissect specific examples.

All arguments to (a+b*c) expression are unsigned 32-bit integers (uint32), so Sphinx computed the expression as uint32. The result is however too big for uint32 and wraps around. This is how C/C++ compilers behave, actually, so you get have exactly the same result from uint32 res = a*b+c line in a C program. Databases are cleary different and auto-switch to a wider type. Now, when we enforce BIGINT type on one of the arguments (which is signed 64-bit aka int64), results start to match!

As for other expression that involve floating-point operations, like division or SIN, in those cases Sphinx (again working more like C/C++ compiler than a DB) computes results in 32-bit floats, while MySQL (supposedly) uses 64-bit doubles. Hence the roundoff errors and different results. Those are especially visible on this randomized data. For one, results are very different in SIN() case, basically because the sine values are small and 1 million times the 32-bit float roundoff error ends up being pretty big.

Another interesting finding is that MySQL does not optimize expressions really well, if at all. See for instance how expressions 5-6 and expressions 9-10 resulted in upto almost 2x different times from MySQL. Sphinx expression optimizer folds the constants and shows exactly the same timings. Of course examples like 5+6+7 are exaggerated but c/3*4 on the other hand is not, and it results in 20% difference.

Last but not least, despite being faster, we do still have plenty of room for optimizations. The expression calculations could be batched. GROUP BY 1 trick can not only fixed in syntax but also optimized for. I’ve made a very simple JIT expression compiler prototype once, boosting expressions like a+b*c to almost native performance levels. All that combined, I think there’s a solid chance to beat MySQL not just 20x but over 100x at complex enough expressions. Anybody out there doing lots of math in MySQL who can share samples of production queries?

The bottom line? Sphinx expressions can perform 3x to 20x better, and our optimizer also seems better. On int and bigint paths, there are no precision issues, but you have to be careful with C-style typing. On float path, there is a natural precision issue with 32-bit single precision type, C-style float once again. So I’ll be adding double precision support and redoing these benchmarks.

This entry was posted
on Monday, March 29th, 2010 at 10:00 and is filed under General.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

[...] from Sphinx continues to work on improving SQL (or SphinxQL) support and now he published benchmarks comparing arithmetic expression handling in Sphinx to one in MySQL. The result ? Sphinx scored 3x [...]