A few blogs back, I was talking about the comeback of SQL (i.e., relational) federation. True, Virtuoso has always had that, but in the general perception, it is primarily known for RDF and Linked Data. Now these two areas of functionality are combined in a way that makes sense and can provide real performance, often surpassing the relational installed base while running SPARQL with no predefined schema.

So, we take our standard TPC-H 100 GB dataset and turn it into RDF. We adjust the schema a little, so that each customer, its orders, and the lineitems of those orders go into a per-customer graph. The part, partsupp, and other tables will all go into a public graph. The per-customer graph can be used as a security label; for example, if there is customer self-service access to the warehouse, or if the access is compartmentalized by areas of responsibility (e.g., customer countries or market segments).

We use two Virtuoso processes. The first contains the TPC-H 100G dataset, the same that was discussed in the TPC-H bulk load article. The second process attaches the tables from the first via SQL federation, and constructs an RDF translation into its RDF store. The mapping is made with an RDF view, also known as a Linked Data View. The initial RDF view can be generated from the relational schema, then edited for the selection of properties. If there are modeling or unit changes in the mapping, these are easiest done with SQL views, in which case the RDF mapping is made on top of the views, not the actual tables. The SQL views reside on the same server that has the RDF views, so no write access to the source database is needed.

The server configuration is found in virtuoso.ini. This is for 4 disks and 192 GB RAM, so if you try this, make sure you have at least this much or use an accordingly scaled down dataset.

11.8 hours, at just under 280 Kt/s, end-to-end. Worse has been heard of. This is a small single-server speed, on the usual test system with dual Xeon E5 2630, 192G RAM. A single-server for double the price might get double throughput. Beyond this, scale out is clearly the better deal. An elastic cluster will get throughput linear to the count of machines for this type of workload.

This shows that deploying mid-size enterprise data as RDF is a job that goes easily overnight with a commodity box, reading directly from the source system; no file-system-based staging areas are needed.

The dataset is 600M order lines; 150M orders; 15M customers; 20M parts, each with 4 suppliers; 1M total suppliers. You can contrast this to what you have in-house to get a rough estimate of what your own DW would come to.

Later, we will use this dataset to illustrate how to scope queries to security categories with graph-level security. Of course, this dataset also provides a point of SQL-to-SPARQL comparison for the ongoing TPC-H series. There will be more installments in not too long.