I’ve recently been implementing an HDT parser in Swift and had some thoughts on the process and on the HDT format more generally. Briefly, I think having a standardized binary format for RDF triples (and quads) is important and HDT satisfies this need. However, I found the HDT documentation and tooling to be lacking in many ways, and think there’s lots of room for improvement.

Benefits

HDT’s single binary file format has benefits for network and disk IO when loading and transferring graphs. That’s its main selling point, and it does a reasonably good job at that. HDT’s use of a RDF term dictionary with pre-assigned numeric IDs means importing into some native triple stores can be optimized. And being able to store RDF metadata about the RDF graph inside the HDT file is a nice feature, though one that requires publishers to make use of it.

Problems and Challenges

I ran into a number of outright problems when trying to implement HDT from scratch:

DictionarySection says the “section starts with an unsigned 32bit value preamble denoting the type of dictionary implementation,” but the implementation actually uses an unsigned 8 bit value for this purpose

FourSectionDictionary conflicts with the previous section on the format URI (http://purl.org/HDT/hdt#dictionaryPlain vs. http://purl.org/HDT/hdt#dictionaryFour)

The paper cited for “VByte” encoding claims that value data is stored in “the seven most significant bits in each byte”, but the HDT implementation uses the seven least significant bits

“Log64” referenced in BitmapTriples does not seem to be defined anywhere

There doesn’t seem to be documentation on exactly how RDF term data (“strings”) is encoded in the dictionary. Example datasets are enough to intuit the format, but it’s not clear why \u and \U escapes are supported, as this adds complexity and inefficiency. Moreover, without a canonical format (including when/if escapes must be used), it is impossible to efficiently implement dictionary lookup

The W3C submission seems to differ dramatically from the current format. I understood this to mean that the W3C document was very much outdated compared to the documentation at rdfhdt.org, and the available implementations seem to agree with this understanding

There doesn’t seem to be any shared test suite between implementations, and existing tooling makes producing HDT files with non-default configurations difficult/impossible

The secondary index format seems to be entirely undocumented

In addition, there are issues that make the format unnecessarily complex, inefficient, non-portable, etc.:

Distinct value space for predicate and subject/object dictionary IDs is at odds with many triple stores, and makes interoperability difficult (e.g. dictionary lookup is not just dict[id] -> term, but dict[id, pos] -> term; a single term might have two IDs if it is used as both predicate and subject/object)

The use of 3 different checksum algorithms seems unnecessarily complex with unclear benefit

A long-standing GitHub issue seems to indicate that there may be licensing issues with the C++ implementation, precluding it from being distributed in Debian systems (and more generally, there seems to be a general lack of responsiveness to GitHub issues, many of which have been open for more than a year without response)

The example HDT datasets on rdfhdt.org are of varying quality; e.g. the SWDF dataset was clearly compiled from multiple source documents, but did not ensure unique blank nodes before merging

Open Questions

Instead of an (undocumented) secondary index file, why does HDT not allow multiple triples sections, allowing multiple triple orderings? A secondary index file might still be useful in some cases, but there would be obvious benefits to being able to store and access multiple triple orderings without the extra implementation burden of an entirely separate file format.

Future Directions

In his recent DeSemWeb talk, Axel Polleres suggested that widespread HDT adoption could help to address several challenges faced when publishing and querying linked data. I tend to agree, but think that if we as a community want to choose HDT, we need to put some serious work into improving the documentation, tooling, and portable implementations.

Beyond improvements to existing HDT resources, I think it’s also important to think about use cases that aren’t fully addressed by HDT yet. The HDTQ extension to to support quads is a good example here; allowing a single HDT file to capture multiple named graphs would support many more use cases, especially those relating to graph stores. I’d also like to see a format that supported both triples and quads, allowing the encoding of things like SPARQL RDF Datasets (with a default graph) and TRiG files.

I recently began taking a look at the Wikidata query logs that were published a couple of months ago and wanted to look into how some features of SPARQL were being used on Wikidata. The first thing I’ve looked at is the use of property paths: how often paths are used, what path operators are used, and with what frequency.

Using the “interval 3” logs (2017-08-07–2017-09-03 representing ~78M successful queries1), I found that ~25% of queries used property paths. The vast majority of these use just a single property path, but there are queries that use as many as 19 property paths:

Pct.

Count

Number of Paths

74.3048%

58161337

0 paths used in query

24.7023%

19335490

1 paths used in query

0.6729%

526673

2 paths used in query

0.2787%

218186

4 paths used in query

0.0255%

19965

3 paths used in query

0.0056%

4387

7 paths used in query

0.0037%

2865

8 paths used in query

0.0030%

2327

9 paths used in query

0.0011%

865

6 paths used in query

0.0008%

604

11 paths used in query

0.0006%

434

5 paths used in query

0.0005%

398

10 paths used in query

0.0002%

156

12 paths used in query

0.0001%

110

15 paths used in query

0.0001%

101

19 paths used in query

0.0001%

56

13 paths used in query

0.0000%

12

14 paths used in query

I normalized IRIs and variable names used in the paths so that I could look at just the path operators and the structure of the paths.
The type of path operators used skews heavily towards * (ZeroOrMore) as well as sequence and inverse paths that can be rewritten as simple BGPs.
Here are the structures representing at least 0.1% of the paths in the dataset:

Pct.

Count

Path Structure

49.3632%

10573772

?s <iri1> * ?o .

39.8349%

8532772

?s <iri1> / <iri2> ?o .

4.6857%

1003694

?s <iri1> / ( <iri2> * ) ?o .

1.8983%

406616

?s ( <iri1> + ) / ( <iri2> * ) ?o .

1.4626%

313290

?s ( <iri1> * ) / <iri2> ?o .

1.1970%

256401

?s ( ^ <iri1> ) / ( <iri2> * ) ?o .

0.7339%

157212

?s <iri1> + ?o .

0.1919%

41110

?s ( <iri1> / ( <iri2> * ) ) / ( ^ <iri3> ) ?o .

0.1658%

35525

?s <iri1> / <iri2> / <iri3> ?o .

0.1496%

32035

?s <iri1> / ( <iri1> * ) ?o .

0.1124%

11889

?s ( <iri1> / <iri2> ) / ( <iri3> * ) ?o .

There are also some rare but interesting uses of property paths in these logs:

does not result in a BGP. Intuitively, I think this should be allowed. It’s intention seems clear. However, it results in two primary algebraic components: a basic graph pattern with the triple pattern :s <p> ?x, and a property path :s ZeroOrMorePath(<q>) ?y. This certainly breaks the rule about only using blank nodes in a single basic graph pattern.

The language in section 5.1.1 originated in SPARQL 1.0, and I believe was just overlooked during the update to the language that added property paths.

When handling blank node labels, instead of following the exact language of the specification, I believe SPARQL implementations should instead allow blank node labels that appear in any adjacent set of basic graph patterns and property paths.

Explicit ordering in GROUP_CONCAT, and quads support for both the HTTP Graph Store Protocol and CONSTRUCT queries (items 2, 5, and 8 in Jindřich’s list) seem like obvious improvements to SPARQL with a clear path forward for semantics and implementation.

Here are the some of the other wished-for features:

Explicitly specify the REDUCED modifier (#1)

As an implementor, I quite like the fact that REDUCED is “underspecified.”
It allows optimization opportunities that are much cheaper than a full
DISTINCT would be, while still reducing result cardinality. I think it’s
unfortunate that REDUCED hasn’t seen much use over the years, but I’m not
sure what a better-specified REDUCED operator would do different from DISTINCT.

Property path quantifiers (#3)

The challenge of supporting path quantifiers like elt{n,m} is figuring out
what the result cardinality should be. The syntax for this was standardized
during the development of SPARQL 1.1, but we couldn’t find consensus on whether
elt{n,m} should act like a translation to an equivalent BGP/UNION pattern
or like the arbitrary length paths
(which do not introduce duplicate results). For small values of n and m,
the translation approach seems natural, but as they grow, it’s not obvious
that use cases would only want the translation semantics and not the
non-duplicating semantics.

Perhaps a new syntax could be developed which would allow the query author
to indicate the desired cardinality semantics.

Date time/duration arithmetic functions (#6)

This seems like a good idea, and very useful to some users, though it would substantially increase the size and number of the built-in functions and operators.

Support for non-scalar-producing aggregates (#9)

I’m interested to see how this plays out as a SPARQL extension in systems like Stardog.
It likely has a lot of interesting uses, but I worry that it would greatly complicate
the query and data models, leading to calls to extend the semantics of RDF, and add new
query forms, operators, and functions.

Structured serialization format for SPARQL queries (#10)

I’m indifferent to this. I suspect some people would benefit from such a format,
but I don’t think I’ve ever had need for one (where I couldn’t just parse a query
myself and use the resulting AST) and it would be another format to support for
implementors.

Beyond that, here are some other things I’d like to see worked on (either standardization, or cross-implementation support):

This can be partially accomplished right now for hard-coded graphs by using an endpoint url with the default-graph-uri query parameter, but I’d like more general support that could work dynamically with the active graph when the SERVICE block is evaluated.

Structured errors for use in the SPARQL Protocol

My preference for this would be using the RFC7807 “Problem Details” JSON format, with a curated list of IRIs and associated metadata representing common error types (syntax errors, query-to-complex or too-many-requests refusals, etc.). There’s a lot of potential for smarter clients if errors contain structured data (e.g. SPARQL editors can highlight/fix syntax issues; clients could choose alternate data sources such as triple pattern fragments when the endpoint is overwhelmed).

As part of work on the Attean Semantic Web toolkit, I found some time to work through limit-by-resource, an oft-requested SPARQL feature and one that my friend Kjetil lobbied for during the SPARQL 1.1 design phase. As I recall, the biggest obstacle to pursuing limit-by-resource in SPARQL 1.1 was that nobody had a clear idea of how to fit it nicely into the existing SPARQL syntax and semantics. With hindsight, and some time spent working on a prototype, I now suspect that this was because we first needed to nail down the design of aggregates and let aggregation become a first-class feature of the language.

Now, with a standardized syntax and semantics for aggregation in SPARQL, limit-by-resource seems like just a small enhancement to the existing language and implementations by the addition of window functions. I implemented a RANK operator in Attean, used in conjunction with the already-existing GROUP BY. RANK works on groups just like aggregates, but instead of producing a single row for each group, the rows of the group are sorted, and given an integer rank which is bound to a new variable. The groups are then “un-grouped,” yielding a single result set. Limit-by-resource, then, is a specific use-case for ranking, where groups are established by the resource in question, ranking is either arbitrary or user-defined, and a filter is added to only keep rows with a rank less than a given threshold.

I think the algebraic form of these operations should be relatively intuitive and straightforward. New Window and Ungroup algebra expressions are introduced akin to Aggregation and AggregateJoin, respectively. Window(G, var, WindowFunc, args, order comparators) operates over a set of grouped results (either the output of Group or another Window), and Ungroup(G) flattens out a set of grouped results into a multiset.

If we wanted to use limit-by-resource to select the two eldest students per school, we might end up with something like this:

Students with their ages and schools are matched with a BGP. Grouping is applied based on the school. Rank with ordering by age is applied so that, for example, the result for the eldest student in each school is given ?rank=1, the second eldest ?rank=2, and so on. Finally, we apply a filter so that we keep only results where ?rank is 1 or 2.

The syntax I prototyped in Attean allows a single window function application applied after a GROUP BY clause:

However, a more complete solution might align more closely with existing SQL window function syntaxes, allowing multiple functions to be used at the same time (appearing syntactically in the same place as aggregation functions).

I recently found myself wanting a simple and efficient way to parse RDF files in Swift, and ended up writing a small library that uses David Robillard’s excellent Serd library for RDF syntax. The resulting projects (available on GitHub) are swift-serd, a low-level wrapper around the C API, and SerdParser, a more Swift-y higher-level API that allows parsing of RDF content and handling the resulting triples.

The resulting code benefits from serd’s performance and standards-compliance, and allows very simple parsing of N-Triples and Turtle files:

I’m devastated and heartbroken. I’m worried about the people in my life who would not have health insurance if not for Obamacare. I’m worried for Muslims, immigrants, the disabled, women, and anyone else who will be on the receiving end of the harassment and intolerance that has been normalized by our President-elect. I’m worried that we will dangerously backtrack on progress made in fighting climate change, and waste time obstinately defending fossil fuels and opposing green alternatives. I’m worried about living in a world where truth and facts don’t seem to carry any weight, where science is viewed with skepticism, and experts are dismissed out of hand.

I’m drawing inspiration from this. Regardless of what happens today, there will be a lot of work to do. I hope we can come together, with the understanding that we all want to make this country better. That will require listening to others, and trying to understand their concerns. And it will require being open to facts and evidence, even when they contradict beliefs. It’s election day. Let’s do this.

Answering SPARQL Queries over Databases under OWL 2 QL Entailment Regime by Kontchakov, et al. caught my attention during the main conference. The authors show a technique to map query answering under OWL QL entailment to pure SPARQL 1.1 queries under simple entailment, allowing OWL QL reasoning to be used with SPARQL 1.1 systems that do not otherwise support OWL. An important challenge highlighted by the work (and echoed by the audience after the presentation) was the limited performance of existing property path implementations. It seems as if there might be a subset of SPARQL 1.1 property paths (e.g. with limited nesting of path operators) that could be implemented more efficiently while providing enough expressivity to allow things like OWL reasoning.

There seemed to be a lot of interest in more flexible solutions to SPARQL query answering. Ruben’s Linked Data Fragments work addresses one part of this issue (though I remain somewhat skeptical about it as a general-purpose solution). That work was received very well at the conference, winning the Best Demo award. In conversations I had, there was also a lot of interest in the development of SPARQL systems that could handle capacity and scalability issues more gracefully (being smart enough to use appropriate HTTP error codes when a query cannot be answered due to system load or the complexity of the query, allowing a fallback to linked data fragments, for example).