Translate

As before, the result has been measured with the feature/analytics branch of the v7fasttrack open source distribution, and it will soon be available as a preconfigured Amazon EC2 image. The updated benchmarks AMI with this version of the software will be out there within the next week, to be announced on this blog.

On the Cost of RDF Query Optimization

RDF query optimization is harder than the relational equivalent; first, because there are more joins, hence an NP complete explosion of plan search space, and second, because cardinality estimation is harder and usually less reliable. The work on characteristic sets, pioneered by Thomas Neumann in RDF3X, uses regularities in structure for treating properties usually occurring in the same subject as columns of a table. The same idea is applied for tuning physical representation in the joint Virtuoso / MonetDB work published at WWW 2015.

The Virtuoso results discussed here, however, are all based on a single RDF quad table with Virtuoso's default index configuration.

Introducing query plan caching raises the Virtuoso score from 80 qps to 144 qps at the 256 Mtriple scale. The SPB queries are not extremely complex; lookups with many more triple patterns exist in actual workloads, e.g., Open PHACTS. In such applications, query optimization indeed dominates execution times. In SPB, data volumes touched by queries grow near linearly with data scale. At the 256 Mtriple scale, nearly half of CPU cycles are spent deciding a query plan. Below are the CPU cycles for execution and compilation per query type, sorted by descending sum of the times, scaled to milliseconds per execution. These are taken from a one minute sample of running at full throughput.

We measure the compile and execute times, with and without using hash join. When considering hash join, the throughput is 80 qps. When not considering hash join, the throughput is 110 qps. With query plan caching, the throughput is 145 qps whether or not hash join is considered. Using hash join is not significant for the workload but considering its use in query optimization leads to significant extra work.

With hash join

Compile

Execute

Total

Query

3156 ms

1181 ms

4337 ms

Total

1327 ms

28 ms

1355 ms

query 01

444 ms

460 ms

904 ms

query 08

466 ms

54 ms

520 ms

query 06

123 ms

268 ms

391 ms

query 05

257 ms

5 ms

262 ms

query 11

191 ms

59 ms

250 ms

query 10

9 ms

179 ms

188 ms

query 04

114 ms

26 ms

140 ms

query 07

46 ms

62 ms

108 ms

query 09

71 ms

25 ms

96 ms

query 12

61 ms

13 ms

74 ms

query 03

47 ms

2 ms

49 ms

query 02

Without hash join

Compile

Execute

Total

Query

1816 ms

1019 ms

2835 ms

Total

197 ms

466 ms

663 ms

query 08

609 ms

32 ms

641 ms

query 01

188 ms

293 ms

481 ms

query 05

275 ms

61 ms

336 ms

query 09

163 ms

10 ms

173 ms

query 03

128 ms

38 ms

166 ms

query 10

102 ms

5 ms

107 ms

query 11

63 ms

27 ms

90 ms

query 12

24 ms

57 ms

81 ms

query 06

47 ms

1 ms

48 ms

query 02

15 ms

24 ms

39 ms

query 07

5 ms

5 ms

10 ms

query 04

Considering hash join always slows down compilation, and sometimes improves and sometimes worsens execution. Some improvement in cost-model and plan-space traversal-order is possible, but altogether removing compilation via caching is better still. The results are as expected, since a lookup workload such as SPB has little use for hash join by nature.

The rationale for considering hash join in the first place is that analytical workloads rely heavily on this. A good TPC-H score is simply unfeasible without this as previously discussed on this blog. If RDF is to be a serious contender beyond serving lookups, then hash join is indispensable. The decision for using this however depends on accurate cardinality estimates on either side of the join.

Previous work (e.g., papers from FORTH around MonetDB) advocates doing away with a cost model altogether, since one is hard and unreliable with RDF anyway. The idea is not without its attraction but will lead to missing out of analytics or to relying on query hints for hash join.

The present Virtuoso thinking is that going to rule based optimization is not the preferred solution, but rather using characteristic sets for reducing triples into wider tables, which also cuts down on plan search space and increases reliability of cost estimation.

When looking at execution alone, we see that actual database operations are low in the profile, with memory management taking the top 19%. This is due to CONSTRUCT queries allocating small blocks for returning graphs, which is entirely avoidable.