Display Settings

articles per page.

order.

DBpedia Usage Report

We've just published the latest DBpedia Usage Report, covering v3.3 (released July, 2009) to v3.9 (released September, 2013); v3.10 (sometimes called "DBpedia 2014"; released September, 2014) will be included in the next report.

We think you'll find some interesting details in the statistics. There are also some important notes about Virtuoso configuration options and other sneaky technical issues that can surprise you (as they did us!) when exposing an ad-hoc query server to the world.

LDBCSPB (Semantic Publishing Benchmark) is based on the BBC Linked Data use case. Thus the data modeling and transaction mix reflect the BBC's actual utilization of RDF. But a benchmark is not only a condensation of current best practice. The BBC Linked Data is deployed on Ontotext GraphDB (formerly known as OWLIM).

So, in SPB we wanted to address substantially more complex queries than the lookups than the BBC linked data deployment primarily serves. Diverse dataset summaries, timelines, and faceted search qualified by keywords and/or geography, are examples of online user experience that SPB needs to cover.

SPB is not an analytical workload, per se, but we still find that the queries fall broadly in two categories:

Some queries are centered on a particular search or entity. The data touched by the query size does not grow at the same rate as the dataset.

Some queries cover whole cross sections of the dataset, e.g., find the most popular tags across the whole database.

These different classes of questions need to be separated in a metric, otherwise the short lookup dominates at small scales, and the large query at large scales.

Another guiding factor of SPB was the BBC's and others' express wish to cover operational aspects such as online backups, replication, and fail-over in a benchmark. True, most online installations have to deal with these, yet these things are as good as absent from present benchmark practice. We will look at these aspects in a different article; for now, I will just discuss the matter of workload mix and metric.

Normally, the lookup and analytics workloads are divided into different benchmarks. Here, we will try something different. There are three things the benchmark does:

Updates - These sometimes insert a graph, sometimes delete and re-insert the same graph, sometimes just delete a graph. These are logarithmic to data size.

Short queries - These are lookups that most often touch on recent data and can drive page impressions. These are roughly logarithmic to data scale.

Analytics - These cover a large fraction of the dataset and are roughly linear to data size.

A test sponsor can decide on the query mix within certain bounds. A qualifying run must sustain a minimum, scale-dependent update throughput and must execute a scale-dependent number of analytical query mixes, or run for a scale-dependent duration. The minimum update rate, the minimum number of analytics mixes and the minimum duration all grow logarithmically to data size.

Within these limits, the test sponsor can decide how to mix the workloads. Publishing several results emphasizing different aspects is also possible. A given system may be especially good at one aspect, leading the test sponsor to accentuate this.

The benchmark has been developed and tested at small scales, between 50 and 150M triples. Next we need to see how it actually scales. There we expect to see how the two query sets behave differently. One effect that we see right away when loading data is that creating the full text index on the literals is in fact the longest running part. For a SF 32 ( 1.6 billion triples) SPB database we have the following space consumption figures:

46,886 MB of RDF literal text

23,924 MB of full text index for RDF literals

23,598 MB of URI strings

21,981 MB of quads, stored column-wise with default index scheme

Clearly, applying column-wise compression to the strings is the best move for increasing scalability. The literals are individually short, so literal per literal compression will do little or nothing but applying this by the column is known to get a 2x size reduction with Google Snappy.

The full text index does not get much from column store techniques, as it already consists of words followed by space efficient lists of word positions. The above numbers are measured with Virtuoso column store, with quads column-wise and the rest row-wise. Each number includes the table(s) and any extra indices associated to them.

Let's now look at a full run at unit scale, i.e., 50M triples.

The run rules stipulate a minimum of 7 updates per second. The updates are comparatively fast, so we set the update rate to 70 updates per second. This is seen not to take too much CPU. We run 2 threads of updates, 20 of short queries, and 2 of long queries. The minimum run time for the unit scale is 10 minutes, so we do 10 analytical mixes, as this is expected to take a little over 10 minutes. The run stops by itself when the last of the analytical mixes finishes.

The SUT is dual Xeon E5-2630, all in memory. The platform utilization is steadily above 2000% CPU (over 20/24 hardware threads busy on the DBMS). The DBMS is Virtuoso Open Source (v7fasttrack at github.com, feature/analytics branch).

The minimum update rate of 7/s was sustained, but fell short of the target of 70/s. In this run, most demand was put on the interactive queries. Different thread allocations would give different ratios of the metric components. The analytics mix, for example, is about 3x faster without other concurrent activity.

Is this good or bad? I would say that this is possible but better can certainly be accomplished.

The initial observation is that Q17 is the worst of the interactive lot. 3x better is easily accomplished by avoiding a basic stupidity. The query does the evil deed of checking for a substring in a URI. This is done in the wrong place and accounts for most of the time. The query is meant to test geo retrieval but ends up doing something quite different. Optimizing this right would by itself almost double the interactive score. There are some timeouts in the analytical run, which as such disqualifies the run. This is not a fully compliant result, but is close enough to give an idea of the dynamics. So we see that the experiment is definitely feasible, is reasonably defined, and that the dynamics seen make sense.

As an initial comment of the workload mix, I'd say that interactive should have a few more very short point-lookups, to stress compilation times and give a higher absolute score of queries per second.

Adjustments to the mix will depend on what we find out about scaling. As with SNB, it is likely that the workload will shift a little so this result might not be comparable with future ones.

In the next SPB article, we will look closer at performance dynamics and choke points and will have an initial impression on scaling the workload.

A benchmark is known by its primary metric. An actual benchmark implementation may deal with endless complexity but the whole point of the exercise is to reduce this all to an extremely compact form, optimally a number or two.

For SNB, we suggest clicks per second Interactive at scale (cpsI@ so many GB) as the primary metric. To each scale of the dataset corresponds a rate of update in the dataset's timeline (simulation time). When running the benchmark, the events in simulation time are transposed to a timeline in real time.

Another way of expressing the metric is therefore acceleration factor at scale. In this example, we run a 300 GB database at an acceleration of 1.64; i.e., in the present example, we did 97 minutes of simulation time in 58 minutes of real time.

Another key component of a benchmark is the full disclosure report (FDR). This is expected to enable any interested party to reproduce the experiment.

The system under test (SUT) is Virtuoso running an SQL implementation of the workload at 300 GB (SF = 300). This run gives an idea of what an official report will look like but is not one yet. The implementation differs from the present specification in the following:

The SNB test driver is not used. Instead, the workload is read from the file system by stored procedures on the SUT. This is done to circumvent latencies in update scheduling in the test driver which would result in the SUT not reaching full platform utilization.

The workload is extended by 2 short lookups, i.e., person profile view and post detail view. These are very short and serve to give the test more of an online flavor.

The short queries appear in the report as multiple entries. This should not be the case. This inflates the clicks per second number but does not significantly affect the acceleration factor.

As a caveat, this metric will not be comparable with future ones.

Aside from the composition of the report, the interesting point is that with the present workload, a 300 GB database keeps up with the simulation timeline on a commodity server, also when running updates. The query frequencies and run times are in the full report. We also produced a graphic showing the evolution of the throughput over a run of one hour --

(click to embiggen)

We see steady throughput except for some slower minutes which correspond to database checkpoints. (A checkpoint, sometimes called a log checkpoint, is the operation which makes a database state durable outside of the transaction log.) If we run updates only at full platform, we get an acceleration of about 300x in memory for 20 minutes, then 10 minutes of nothing happening while the database is being checkpointed. This is measured with 6 2TB magnetic disks. Such a behavior is incompatible with an interactive workload. But with a checkpoint every 10 minutes and updates mixed with queries, checkpointing the database does not lead to impossible latencies. Thus, we do not get the TPC-C syndrome which requires tens of disks or several SSDs per core to run.

This is a good thing for the benchmark, as we do not want to require unusual I/O systems for competition. Such a requirement would simply encourage people to ignore the specification for the point and would limit the number of qualifying results.

The full report contains the details. This is also a template for later "real" FDRs. The supporting files are divided into test implementation and system configuration. With these materials plus the data generator, one should be able to repeat the results using a Virtuoso Open Source cut from v7fasttrack at github.com, feature/analytics branch.

In later posts we will analyze the results a bit more and see how much improvement potential we find. The next SNB article will be about the business intelligence and graph analytics areas of SNB.

I will here develop some ideas on the platform of Peter Boncz's inaugural lecture mentioned in the previous post. This is a high-level look at where the leading edge of analytics will be, now that the column store is mainstream.

Peter's description of his domain was roughly as follows, summarized from memory:

The new chair is for data analysis and engines for this purpose. The data analysis engine includes the analytical DBMS but is a broader category. For example, the diverse parts of the big data chain (including preprocessing, noise elimination, feature extraction, natural language extraction, graph analytics, and so forth) fall under this category, and most of these things are usually not done in a DBMS. For anything that is big, the main challenge remains one of performance and time to solution. These things are being done, and will increasingly be done, on a platform with heterogenous features, e.g., CPU/GPU clusters, possibly custom hardware like FPGAs, etc. This is driven by factors of cost and energy efficiency. Different processing stages will sometimes be distributed over a wide area, as for example in instrument networks and any network infrastructure, which is wide area by definition.

The design space of database and all that is around it is huge, and any exhaustive exploration is impossible. Development times are long, and a platform might take ten years to be mature. This is ill compatible with academic funding cycles. However, we should not leave all the research in this to industry, as industry maximizes profit, not innovation or absolute performance. Architecting data systems has aspects of an art. Consider the parallel with architecture of buildings: There are considerations of function, compatibility with environment, cost, restrictions arising from the materials at hand, and so forth. How a specific design will work cannot be known without experiment. The experiments themselves must be designed to make sense. This is not an exact science with clear-cut procedures and exact metrics of success.

This is the gist of Peter's description of our art. Peter's successes, best exemplified by MonetDB and Vectorwise, arise from focus over a special problem area and from developing and systematically applying specific insights to a specific problem. This process led to the emergence of the column store, which is now a mainstream thing. The DBMS that does not do columns is by now behind the times.

Needless to say, I am a great believer in core competence. Not every core competence is exactly the same. But a core competence needs to be broad enough so that its integral mastery and consistent application can produce a unit of value valuable in itself. What and how broad this is varies a great deal. Typically such a unit of value is something that is behind a "natural interface." This defies exhaustive definition but the examples below may give a hint. Looking at value chains and all diverse things in them that have a price tag may be another guideline.

There is a sort of Hegelian dialectic to technology trends: At the start, it was generally believed that a DBMS would be universal like the operating system itself, with a few products with very similar functionality covering the whole field. The antithesis came with Michael Stonebraker declaring that one size no longer fit all. Since then the transactional (OLTP) and analytical (OLAP) sides are clearly divided. The eventual synthesis may be in the air, with pioneering work like HyPer led by Thomas Neumann of TU München. Peter, following his Humbolt prize, has spent a couple of days a week in Thomas's group, and I have joined him there a few times. The key to eventually bridging the gap would be compilation and adaptivity. If the workload is compiled on demand, then the right data structures could always be at hand.

This might be the start of a shift similar to the column store turning the DBMS on its side, so to say.

In the mainstream of software engineering, objects, abstractions and interfaces are held to be a value almost in and of themselves. Our science, that of performance, stands in apparent opposition to at least any naive application of the paradigm of objects and interfaces. Interfaces have a cost, and boxes limit transparency into performance. So inlining and merging distinct (in principle) processing phases is necessary for performance. Vectoring is one take on this: An interface that is crossed just a few times is much less harmful than one crossed a billion times. Using compilation, or at least type-and-data-structure-specific variants of operators and switching their application based on run-time observed behaviors, is another aspect of this.

Information systems thus take on more attributes of nature, i.e., more interconnectedness and adaptive behaviors.

Something quite universal might emerge from the highly problem-specific technology of the column store. The big scan, selective hash join plus aggregation, has been explored in slightly different ways by all of HyPer, Vectorwise, and Virtuoso.

Interfaces are not good or bad, in and of themselves. Well-intentioned naïveté in their use is bad. As in nature, there are natural borders in the "technosphere"; declarative query languages, processor instruction sets, and network protocols are good examples. Behind a relatively narrow interface lies a world of complexity of which the unsuspecting have no idea. In biology, the cell membrane might be an analogy, but this is in all likelihood more permeable and diverse in function than the techno examples mentioned.

With the experience of Vectorwise and later Virtuoso, it turns out that vectorization without compilation is good enough for TPC-H. Indeed, I see a few percent of gain at best from further breaking of interfaces and "biology-style" merging of operators and adding inter-stage communication and self-balancing. But TPC-H is not the end of all things, even though it is a sort of rite of passage: Jazz players will do their take on Green Dolphin Street and Summertime.

Science is drawn towards a grand unification of all which is. Nature, on the other hand, discloses more and more diversity and special cases, the closer one looks. This may be true of physical things, but also of abstractions such as software systems or mathematics.

So, let us look at the generalized DBMS, or the data analysis engine, as Peter put it. The use of DBMS technology is hampered by its interface, i.e., declarative query language. The well known counter-reactions to this are the NoSQL, MapReduce, and graph DB memes, which expose lower level interfaces. But then the interface gets put in the whole wrong place, denying most of the things that make the analytics DBMS extremely good at what it does.

We need better and smarter building blocks and interfaces at zero cost. We continue to need blocks of some sort, since algorithms would stop being understandable without any data/procedural abstraction. At run time, the blocks must overlap and interpenetrate: Scan plus hash plus reduction in one loop, for example. Inter-thread, inter-process status sharing for things like top k for faster convergence, for another. Vectorized execution of the same algorithm on many data for things like graph traversals. There are very good single blocks, like GPU graph algorithms, but interface and composability are ever the problem.

So, we must unravel the package that encapsulates the wonders of the analytical DBMS. These consist of scan, hash/index lookup, partitioning, aggregation, expression evaluation, scheduling, message passing and related flow control for scale-out systems, just to mention a few. The complete list would be under 30 long, with blocks parameterized by data payload and specific computation.

By putting these together in a few new ways, we will cover much more of the big data pipeline. Just-in-time compilation may well be the way to deliver these components in an application/environment tailored composition. Yes, keep talking about block diagrams, but never once believe that this represents how things work or ought to work. The algorithms are expressed as distinct things, but at the level of the physical manifestation, things are parallel and interleaved.

The core skill for architecting the future of data analytics is correct discernment of abstraction and interface. What is generic enough to be broadly applicable yet concise enough to be usable? When should the computation move, and when should the data move? What are easy ways of talking about data location? How can protect the application developer be protected from various inevitable stupidities?

No mistake about it, there are at present very few people with the background for formulating the blueprint for the generalized data pipeline. These will be mostly drawn from architects of DBMS. The prospective user is any present-day user of analytics DBMS, Hadoop, or the like. By and large, SQL has worked well within its area of applicability. If there had never been an anti-SQL rebel faction, SQL would not have been successful. Now that a broader workload definition calls for redefinition of interfaces, so as to use the best where it fits, there is a need for re-evaluation of the imperative Vs. declarative question.

T. S. Eliot once wrote that humankind cannot bear very much reality. It seems that we in reality can deconstruct the DBMS and redeploy the state of the art to serve novel purposes across a broader set of problems. This is a cross-over that slightly readjusts the mental frame of the DBMS expert but leaves the core precepts intact. In other words, this is a straightforward extension of core competence with no slide into the dilettantism of doing a little bit of everything.

People like MapReduce and stand-alone graph programming frameworks, because these do one specific thing and are readily understood. By and large, these are orders of magnitude simpler than the DBMS. Even when the DBMS provides in-process Java or CLR, these are rarely used. The single-purpose framework is a much narrower core competence, and thus less exclusive, than the high art of the DBMS, plus it has a faster platform development cycle.

In the short term, we will look at opening the SQL internal toolbox for graph analytics applications. I was discussing this idea with Thomas Neumann at Peter Boncz's party. He asked who would be the user. I answered that doing good parallel algorithms, even with powerful shorthands, was an expert task; so the people doing new types of analytics would be mostly on the system vendor side. However, modifying such for input selection and statistics gathering would be no harder than doing the same with ready-made SQL reports.

There is significant possibility for generalization of the leading edge of database. How will this fare against single-model frameworks? We hope to shed some light on this in the final phase of LDBC and beyond.

Last Friday, I attended the inaugural lecture of Professor Peter Boncz at the VU University Amsterdam. As the reader is likely to know, Peter is one of the database luminaries of the 21st century, known among other things for architecting MonetDB and Actian Vector (Vectorwise) and publishing a stellar succession of core database papers.

The lecture touched on the fact of the data economy and the possibilities of E-science. Peter proceeded to address issues of ethics of cyberspace and the fact of legal and regulatory practice trailing far behind the factual dynamics of cyberspace. In conclusion, Peter gave some pointers to his research agenda; for example, use of just-in-time compilation for fusing problem-specific logic with infrastructure software like databases for both performance and architecture adaptivity.

There was later a party in Amsterdam with many of the local database people as well as some from further away, e.g., Thomas Neumann of Munich, and Marcin Zukowsky, Vectorwise founder and initial CEO.

I should have had the presence of mind to prepare a speech for Peter. Stefan Manegold of CWI did give a short address at the party, while presenting the gifts from Peter's CWI colleagues. To this I will add my belated part here, as follows:

If I were to describe Prof. Boncz, our friend, co-worker, and mentor, in one word, this would be man of knowledge. If physicists define energy as that which can do work, then knowledge would be that which can do meaningful work. A schematic in itself does nothing. Knowledge is needed to bring this to life. Yet this is more than an outstanding specialist skill, as this implies discerning the right means in the right context and includes the will and ability to go through with this. As Peter now takes on the mantle of professor, the best students will, I am sure, not fail to recognize excellence and be accordingly inspired to strive for the sort of industry changing accomplishments we have come to associate with Peter's career so far. This is what our world needs. A big cheer for Prof. Boncz!

I did talk to many at the party, especially Pham Minh Duc, who is doing schema-aware RDF in MonetDB, and many others among the excellent team at CWI. Stefan Manegold told me about Rethink Big, an FP7 for big data policy recommendations. I was meant to be an advisor and still hope to go to one of their meetings for some networking about policy. On the other hand, the EU agenda and priorities, as discussed with, for example, Stefano Bertolo, are, as far as I am concerned, on the right track: The science of performance must meet with the real, or at least realistic, data. Peter did not fail to mention this same truth in his lecture: Spinoffs play a key part in research, and exposure to the world out there gives research both focus and credibility. As René Char put it in his poem L'Allumette (The Matchstick), "La tête seule à pouvoir de prendre feu au contact d'une réalité dure." ("The head alone has power to catch fire at the touch of hard reality.") Great deeds need great challenges, and there is nothing like reality to exceed man's imagination.

For my part, I was advertising the imminent advances in the Virtuoso RDF and graph functionality. Now that the SQL part, which is anyway the necessary foundation for all this, is really very competent, it is time to deploy these same things in slightly new ways. This will produce graph analytics and structure-aware RDF to match relational performance while keeping schema-last-ness. Anyway, the claim has been made; we will see how it is delivered during the final phase of LDBC and Geoknow.

In Hoc Signo Vinces (part 20 of n): 100G and 1000G With Cluster; When is Cluster Worthwhile; Effects of I/O

In the introduction to scale out piece, I promised to address the matter of data-to-memory ratio, and to talk about when scale-out makes sense. Here we will see that scale-out makes sense whenever data does not fit in memory on a single commodity server. The gains in processing power are immediate, even when going from one box to just two, with both systems having all in memory.

As an initial take on the issue we run 100 GB and 1000 GB on the test system. 100 GB is trivially in memory, 1000 GB is not, as the memory is 384 GB total, of which 360 GB may be used for the processes.

We run 2 workloads on the 100 GB database, having pre-loaded the data in memory:

run

power

throughput

composite

1

349,027.7

420,503.1

383,102.1

2

387,890.3

433,066.6

409,856.5

This is directly comparable to the 100 GB single-server results. Comparing the second runs, we see a 1.53x gain in power and a 1.8x gain in throughput from 2x the platform. This is fully on the level for a workload that is not trivially parallel, as we have seen in the previous articles. The difference between the first and second runs at 100 GB comes, for both single-server and cluster, from the latency of allocating transient query memory. For an official run, where the weakest link is the first power test, this would simply have to be pre-allocated.

We run 2 workloads on the 1000 GB database, starting from cold.

The result is:

run

power

throughput

composite

1

136,744.5

147,374.6

141,960.1

2

199,652.0

125,161.1

158,078.0

The 1000 GB result is not for competition with this platform; more memory would be needed. For actual applications, the numbers are still in the usable range, though.

The 1000 GB setup uses 4 SSDs for storage, one per server process. The server processes are each bound to their own physical CPU.

We look at the meters: 32M pages (8M per process) are in memory at each time. Over the 2 benchmark executions there are a total of 494M disk reads. The total CPU time is 165,674 seconds of CPU, of which about 10% are system, over 10,063 seconds of real-time. Cumulative disk-read wait-time is 130,177 s. This gives an average disk read throughput of 384 MB/s.

This is easily sustained by 4 SSDs; in practice, the maximum throughput we see for reading is 1 GB/s (256 MB/s per SSD). Newer SSDs would do maybe twice that. Using rotating media would not be an option.

Without the drop in CPU caused by waiting for SSD, we would have numbers very close to the 100 GB numbers.

The interconnect traffic for the two runs was 1,077 GB with no message compression. The write block time was 448 seconds of thread-time. So we see that blocking on write hurts platform utilization when running under optimal conditions, but compared to going to secondary storage, it is not a large factor.

The 1000 GB scale has a transient peak memory consumption of 42 GB. This consists of hash-join build sides and GROUP BYs. The greatest memory consumers are Q9 with 9 GB, Q13 with 11 GB, and Q16 with 7 GB. Having many of these at a time drives up the transient peak. The peak gets higher as the scale grows, also because a larger scale requires more concurrent query streams. At the 384 GB for 1000 GB ratio, we do not yet get into memory saving plans like hash joins in many passes or index use instead of hash. When the data size grows, replicated hash build sides will become less convenient, and communication will increase. Q9 and Q13 can be done by index with almost no transient memory, but these plans are easily 3x less efficient for CPU. These will probably help at 3000 GB and be necessary at least part of the time at 10,000 GB.

The I/O volume in MB per index over the 2 executions is:

index

MB

LINEITEM

1,987,483

ORDERS

1,440,526

PARTSUPP

199,335

PART

161,717

CUSTOMER

43,276

O_CK

19,085

SUPPLIER

13,393

Of this, maybe 600 GB could be saved by stream compressing o_comment. Otherwise this cannot be helped without adding memory. The lineitem reads are mostly for l_extendedprice, which is not compressible. If compressing o_comment made l_extendedprice always fit in memory, then there would be a radical drop in I/O. Also, as a matter of fact, the buffer management policy of least-recently-used works the very worst for big scans, specifically those of l_extendedprice: If the head is replaced when reading the tail, and the next read starts from the head, then the whole table/column is read all over again. Caching policies that specially recognized scans of this sort could further reduce I/O. Clustering lineitems/orders on date, as Actian VectorTPC-H implementations do, also starts yielding a greater gain when not running from memory: One column (e.g., l_shipdate) may be scanned for the whole table but, if the matches are bunched together, then most of l_extendedprice will not be read at all. Still, if going for top ranks in the races, all will be from memory, or at least there will be SSDs with read throughput around 150 MB/s per core, so these tricks become relatively less important.

In the 100 GB numerical quantities summaries, we see much the same picture as in the single-server. Queries get faster, but their relative times are not radically different. The throughput test (many queries at a time) times are more or less multiples of the power (single user) times. This picture breaks at 1000 GB where I/O first drops the performance to under half and introduces huge variation in execution times within a single query. The time entirely depends on which queries are running along with or right before the execution and on whether these have the same or different working sets. All the streams have the same queries with different parameters, but the query order in each stream is different.

The numerical quantities follow for all the runs. Note that the first 1000 GB run is cold. A competition grade 1000 GB result can be made with double the memory, and the more CPU the better. We will try one at Amazon in a bit.

***

The conclusion is that scale-out pays from the get-go. At present prices, a system with twice the power of a single node of the test system is cost effective. Scales of up to 500 GB are single commodity server, under $10K. Rather than going from a mid-to-large dual-socket box to a quad-socket box, one is likely to be better off having two cheaper dual-socket boxes. These are also readily available on clouds, whereas scale-up configurations are not. Onwards of 1 TB, a cluster is expected to clearly win. At 3 TB, a commodity cluster will clearly be the better deal for both price and absolute performance.

Scalability, specifically linear scalability, means that twice the data takes twice as long to process, or that double the gear processes the same data in half the time. This is only literally true for "embarrassingly parallel" workloads.

There are parts of TPC-H which have an embarrassingly parallel nature, like Q1 and Q7. There are parts that are almost as easy, like Q14, Q17, Q19, and Q21, where there is a big scan and a selective hash join with a hash table small enough to replicate everywhere. The scan scales linearly; building the hash does not, since it is done at single-server speed (once in each process). Some queries like Q9 and Q13 end up doing a big cross-partition join which runs into communication overheads.

This is our first look at how performance behaves with bigger data and a larger platform. The results shown here are interesting but are not final. I bet I can do better; by how much is what we'll find out soon enough.

We will here compare a 1000G setup on my desktop, and a 3000G setup at the CWI's Scilens cluster. The former is 2 boxes of dual Xeon E5 2630, and the latter is 8 boxes of dual Xeon E5 2650v2. All things run from memory and both have QDR IB interconnect. Counting cores and clock, the CWI cluster is 6x larger.

As a rough approximation, for the worst queries, 6x the gear runs 3x the data in the same amount of real time. The 1000G setup has near full platform utilization and the 3000G setup has about half platform utilization. In both cases, running two instances of the same query at the same time takes twice as long.

We use Q9 for this study. The plan makes a hash table of part with 1/14 of all parts, replicating to all processes. Then there is a hash table of partsupp with a key of ps_partkey, ps_suppkey, and a dependent of ps_supplycost. This is much larger than the part hash table and is therefore partitioned on ps_partkey. The build is for 1/14th of partsupp. Then there is a scan of lineitem filtered by the part hash table; then a cross-partition join to the partsupp hash table; then a cross partition join to orders, this time by index; then a hash join on a replicated hash table of supplier; then nation; then aggregation. The aggregation is done in each slice; then the slices are added up at the end.

The plan could be made better by one fewer partition crossing. Now there is a crossing from l_orderkey to l_partkey and back to o_orderkey. This would not be so if the cost model knew that the partsupp always hits. The cost model thinks it hits 1/14 of the time, because it does not know that the selection on the build is exactly the same as on the probe.

For the present purposes, the extra crossing just serves to make the matter of interest more visible.

The platform utilization on the small system is better, at 31/48 (running/total threads); the large one has 73/256.

The large case is clearly network bound. If this were for CPU only, it should be done in half the time it takes the small system to do 1000G.

We confirm this by looking at write wait: 3940 seconds of thread time blocked on write over 50s of real time. The figures on the small one are 3.9s of thread time blocked for 39s of real time. The data transfer on the large one is 93 GB.

How to block less? One idea would be to write less. So we try compression; there is a Google snappy-based message compression option in Virtuoso.

The write block time is 397 s of thread time over 39 s of real time, 10x better. The data transfer is 50.9 GB after compression. Snappy is somewhat effective for compression and very fast; in CPU profile, it is under 3% of Q9 on the small system. Gains on the small system are less, though, since blocking is not a big issue to start with.

This is still not full platform. But if the data transfer is further cut in half by a better plan, the situation will be quite good. Now we have 102/256 threads running, meaning that there could be another 40-50% of throughput to be added. The last 128 threads are second threads of a core, so count for roughly 30% of a real core.

The main cluster-specific operation is a send from one to many. This is now done by formulating the message to each recipient in a chain of string buffers; then, after all the messages are prepared, these are optionally compressed and sent to their recipient. This is needlessly simple: Compressing can proceed if ever there is a would-block situation on writing. If all the compression is done, then a blocked write should switch to another recipient, and only after all recipients have a would-block situation, then the thread can call-select with all descriptors and block on them collectively. There is a piece of code to this effect, but is not now being used. It has been seen to add no value in small cases, but could be useful here.

The IB fabric has been seen to do 1.8 GB/s bidirectionally on multiple independent point-to-point TCP links. This is about half the nominal 4 GB/s (40 Gbit/s with 10/8 encoding). So the aggregate throughputs that we see here are nowhere near the nominal spec of the network. Lower level interfaces and the occasional busy wait on the reading end could be tried to some advantage. We have not tried 10GbE either; but if that works at nominal speed, then 10GbE should also be good enough. We will try this at Amazon in due time.

In the meantime, there is a 3000G test made at the CWI cluster without message compression. The score is about 4x that of the single server at 300G using the same hardware. The run is with approximately half platform utilization. There are three runs of power plus throughput, the first run being cold.

Run

Power

Throughput

Composite

Run 1

305,881.5

1,072,411.9

572,739.8

Run 2

1,292,085.1

1,179,391.6

1,234,453.1

Run 3

1,178,534.1

1,092,936.2

1,134,928.4

The numerical quantities summaries follow. One problem of the run is a high peak of query memory consumption leading to slowdown. Some parts should probably be done in multiple passes to keep the peak lower and not run into swapping. The details will have to be sorted out. This is a demonstration of capability; the perfected accomplishment is to follow.

This article is about how scale-out differs from single-server. This shows large effects of parameters whose very existence most would not anticipate, and some low level metrics for assessing these. The moral of the story is that this is the stuff which makes the difference between merely surviving scale-out and winning with it. The developer and DBA would not normally know about this; thus these things fall into the category of adaptive self-configuration expected from the DBMS. But since this series is about what makes performance, I will discuss the dynamics such as they are and how to play these.

We take the prototypical cross partition join in Q13: Make a hash table of all customers, partitioned by c_custkey. This is independently done with full parallelism in each partition. Scan the orders, get the customer (in a different partition), and flag the customers that had at least one order. Then, to get the customers with no orders, return the customers that were not flagged in the previous pass.

The single-server time in part 12 was 7.8 and 6.0 with a single user. We consider the better of the times. The difference is due to allocating memory on the first go; on the second go the memory is already in reserve.

With default settings, we get 4595 ms (microseconds), with per node resource utilization at:

The top line is the summary; the lines below are per-process. The m/s is messages-per-second; KB/s is interconnect traffic per second; clw % is idle time spent waiting for a reply from another process. The cluster is set up with 4 processes across 2 machines, each with 2 NUMA nodes. Each process has affinity to the NUMA node, so local memory only. The time is reasonable in light of the overall CPU of 2700%. The maximum would be 4800% with all threads of all cores busy all the time.

The catch here is that we do not have a steady half-platform utilization all the time, but full platform peaks followed by synchronization barriers with very low utilization. So, we set the batch size differently:

cl_exec ('__dbf_set (''cl_dfg_batch_bytes'', 50000000)');

This means that we set, on each process, the cl_dfg_batch_bytes to 50M from a default of 10M. The effect is that each scan of orders, one thread per slice, 48 slices total, will produce 50MB worth of o_custkeys to be sent to the other partition for getting the customer. After each 50M, the thread stops and will produce the next batch when all are done and a global continue message is sent by the coordinator.

The platform utilization is better as we see. The throughput is nearly double that of the single-server, which is pretty good for a communication-heavy query.

This was done with a vector size of 10K. In other words, each partition gets 10K o_custkeys and splits these 48 ways to go to every recipient. 1/4 are in the same process, 1/4 in a different process on the same machine, and 2/4 on a different machine. The recipient gets messages with an average of 208 o_custkey values, puts them back together in batches of 10K, and passes these to the hash join with customer.

We try different vector sizes, such as 100K:

cl_exec ('__dbf_set (''dc_batch_sz'', 100000)');

There are two metrics of interest here: The write block time, and the scheduling overhead. The write block time is microseconds, which increases whenever a thread must wait before it can write to a connection. The scheduling overhead is cumulative clocks spent by threads while waiting for a critical section that deals with dispatching messages to consumer threads. Long messages make blocking; short messages make frequent scheduling decisions.

cl_sys_stat gets the counters from all processes and returns the sum. clr=>1 means that the counter is cleared after read.

We do Q13 with vector sizes of 10, 100, and 1000K.

Vector size

msec

mtx

wblock

10K

3297

10,829,910,329

0

100K

3150

1,663,238,367

59,132

1000K

3876

414,631,129

4,578,003

So, 100K seems to strike the best balance between scheduling and blocking on write.

The times are measured after several samples with each setting. The times stabilize after a few runs, as the appropriate size memory blocks are in reserve. Calling mmap to allocate these on the first run with each size has a very high penalty, e.g., 60s for the first run with 1M vector size. We note that blocking on write is really bad even though 1/3 of the time there is no network and 2/3 of the time there is a fast network (QDR IB) with no other load. Further, the affinities are set so that the thread responsible for incoming messages is always on core. Result variability on consecutive runs is under 5%, which is similar to single-server behavior.

It would seem that a mutex, as bad as it is, is still better than a distributed cause for going off core (blocking on write). The latency for continuing a thread thus blocked is of course higher than the latency for continuing one that is waiting for a mutex.

We note that a cluster with more machines can take a longer vector size because a vector spreads out to more recipients. The key seems to be to set the message size so that blocking on write is not common. This is a possible adaptive execution feature. We have seen no particular benefit from SDP (Sockets Direct Protocol) and its zero copy. This is a TCP replacement that comes with the InfiniBand drivers.

We will next look at replication/partitioning tradeoffs for hash joins. Then we can look at full runs.

This is an update presenting sample results on a newer platform for a single-server configuration. This is to verify that performance scales with the addition of cores and clock speed. Further, we note that the jump from 100G to 300G changes very little about the score. 3x larger takes approximately 3x longer, as long as things are in memory.

For the 100G, we go from 240 to 395, which is about 1.64x. The new platform has 16 vs 12 cores and a clock of 2.6 as opposed to 2.3. This makes a multiplier of 1.5. The rest of the acceleration is probably attributable to faster memory clock. Anyway, the point of more speed from larger platform is made.

The top level scores per run are as follows; the numerical quantities summaries are appended.

For the 300G runs, we note a much longer load time; see below, as this is seriously IO bound.

The first power test at 300G is a non-starter, even though this comes right after bulk load. Still, the data is not in working set and getting it from disk is simply an automatic disqualification, unless maybe one had 300 separate disks. This happens in TPC benchmarks, but not very often in the field. Looking at the first power run, the first queries take the longest, but by the time the power run starts, the working set is there. By an artifact of the metric (use of geometric mean for the power test), long queries are penalized less there than in the throughput run.

So, we run 3 executions instead of the prescribed 2, to have 2 executions from warm state.

To do 300G well in 256 GB of RAM, one needs either to use several SSDs, or to increase compression and keep all in memory, so no secondary storage at all. In order to keep all in memory, one could have stream-compression on string columns. Stream-compressing strings (e.g., o_comment, l_comment) does not pay if one is already in memory, but if stream-compressing strings eliminates going to secondary storage, then the win is sure.

As before, all caveats apply; the results are unaudited and for information only. Therefore we do not use the official metric name.

So far, we have analyzed TPC-H in a single-server, memory-only setting. We will now move to larger data and cluster implementations. In principle, TPC-H parallelizes well, so we should expect near-linear scalability; i.e., twice the gear runs twice as fast, or close enough.

In practice, things are not quite so simple. Larger data, particularly a different data-to-memory ratio, and the fact of having no shared memory, all play a role. There is also a network, so partitioned operations, which also existed in the single-server case, now have to send messages across machines, not across threads. For data loading and refreshes, there is generally no shared file system, so data distribution and parallelism have to be considered.

As an initial pass, we look at 100G and 1000G scales on the same test system as before. This is two machines, each with dual Xeon E5-2630, 192 GB RAM, 2 x 512 GB SSD, and QDR InfiniBand. We will also try other platforms, but if nothing else is said, this is the test system.

As of this writing, there is a working implementation, but it is not guaranteed to be optimal as yet. We will adjust it as we go through the workload. One outcome of the experiment will be a precise determination of the data-volume-to-RAM ratio that still gives good performance.

A priori, we know of the following things that complicate life with clusters:

Distributed memory — The working set must be in memory for a run to have a competitive score. A cluster can have a lot of memory, and the data is such that it partitions very evenly, so this appears at first not a problem. The difficulty comes with query memory: If each machine has 1/16th of the total RAM and a hash table would be 1/64th of the working set, on a single-server it is no problem just building the hash table. On a scale-out system, the hash table would be 1/4 of the working set if replicated on each node, which will not fit, especially if there are many such hash tables at the same time. Two main approaches exist: The hash table can be partitioned, but this will force the probe to go cross-partition, which takes time. The other possibility is to build the hash table many times, each time with a fraction of the data, and to run the probe side many times. Since hash tables often have Bloom filters, it is sometimes possible to replicate the Bloom filter and partition the hash table. One has also heard of hash tables that go to secondary storage, but should this happen, the race is already lost; so, we do not go there.

We must evaluate different combinations of these techniques and have a cost model that accurately predicts the performance of each variant. Adding to realism is always safe but halfway difficult to do.

NUMA — Most servers are NUMA (non-uniform memory architecture), where each CPU socket has its own local memory. For single-server cases, we use all the memory for the process. Some implementations have special logic for memory affinity between threads. With scale-out there is the choice of having a server process per-NUMA-node or per-physical-machine. If per-NUMA-node, we are guaranteed only local memory accesses. This is a tradeoff to be evaluated.

Network and Scheduling — Execution on a cluster is always vectored, for the simple reason that sending single-tuple messages is unfeasible in terms of performance. With an otherwise vectored architecture, the message batching required on a cluster comes naturally. However, the larger the cluster, the more partitions there are, which rapidly gets into shorter messages. Increasing the vector size is possible and messages become longer, but indefinite increase in vector size has drawbacks for cache locality and takes memory. To run well, each thread must stay on core. There are two ways of being taken off core ahead of time: Blocking for a mutex, and blocking for network. Lots of short messages run into scheduling overhead, since the recipient must decide what to do with each, which is not really possible without some sort of critical section. This is more efficient if messages are longer, as the decision time does not depend on message length. Longer messages are however liable to block on write at the sender side. So one pays in either case. This is another tradeoff to be balanced.

Flow control — A query is a pipeline of producers and consumers. Sometimes the consumer is in a different partition. The producer must not get indefinitely ahead of the consumer because this would run out of memory, but it must stay sufficiently ahead so as not to stop the consumer. In practice, there are synchronization barriers to check even progress. These will decrease platform utilization, because two threads never finish at exactly the same time. The price of not having these is having no cap on transient memory consumption.

Un-homogenous performance — Identical machines do not always perform identically. This is seen especially with disk, where wear on SSDs can affect write speed, and where uncontrollable hazards of data placement will get uneven read speeds on rotating media. Purely memory-bound performance is quite close, though. Un-anticipatable and uncontrollable hazards of scheduling cause different times of arrival of network messages, which introduces variation in run time on consecutive runs. Single-servers have some such variation from threading, but the effects are larger with a network.

The logical side of query optimization stays the same. Pushing down predicates is always good, and all the logical tricks with moving conditions between subqueries stay the same.

Schema design stays much the same, but there is the extra question of partitioning keys. In this implementation, there are only indices on identifiers, not on dates, for example. So, for a primary key to foreign key join, if there is an index on the foreign key, the index should be partitioned the same way as the primary key. So, joining from orders to lineitem on orderkey will be co-located. Joining from customer to orders by index will be colocated for the c_custkey = o_custkey part (assuming an index on o_custkey) and cross-partition for getting the customer row on c_custkey, supposing that the query needs some property of the customer other than c_custkey or c_orderkey.

A secondary question is the partition granularity. For good compression, nearby values should be consecutive, so here we leave the low 12 bits out of the partitioning. This has effect on bulk load and refreshes, for example, so that a batch of 10,000 lineitems, ordered on l_orderkey will go to only 2 or 3 distinct destinations, thus getting longer messages and longer insert batches, which is more efficient.

This is a quick overview of the wisdom so far. In subsequent installments, we will take a quantitative look at the tradeoffs and consider actual queries. As a conclusion, we will show a full run on a couple of different platforms, and likely provide Amazon machine images for the interested to see for themselves. Virtuoso Cluster is not open source, but the cloud will provide easy access.