08 September 2007

Counting and GROUP BY

One thing people miss from SPARQL is counting. It's a feature that working
group didn't have time for.

There's an implementation, following the design in SQL, in ARQ SVN which will
be in the next release (v2.1). v2.1 introductions the
cost-based optimizer
for in-memory basic graph patterns by
Markus.

It's a syntactic extension, not strict SPARQL, so you have to tell the system
to parse queries in the "ARQ" langauge by passing Syntax.syntaxARQ
to the query factory.

The following queries will work:

SELECT count(*) { ... }

SELECT (count(*) AS ?count) { ... }

This is based on having SELECT expressions as well as grouping. Using AS to
give a named variable is better style because the results can go into the
SPARQLXML results format; otherwise, an internal variable is allocated and they
have illegal SPARQL names.

Other examples:

SELECT (count(*) AS ?rows)
{ ... }
GROUP BY ?x

SELECT count(distinct *)
{ ... }
GROUP BY ?x

SELECT count(?y)
{ ... }
GROUP BY ?x

What is being counted is solutions, in the case of count(*) and
names, in the case of count(?var).

9 comments:

Such syntax had been discussed a while ago in the WG and ARQ's is one choice of those discussions. The discussion then had a consensus that without any markers, expressions are a nuisance to deal with when parsings. SPARQL already uses () for expressions in ORDER BY and FILTER. This syntax merely reuses that idea. "GROUP BY (expr)" also works in ARQ.

Without () "?x-?y" is ambiguous. Is it "?x-?y" or "?x" and "-?y"?

I couldn't get the Virtuoso style to work without going beyond difficulty of parsing that DAWG has avoided (no problem for javacc - but harder for pure LL(1)).

I'd guess that the Virtuoso syntax requires the aggregate to be last else "count distinct ?s ?p ?o" is ambiguous. "?o count distinct ?s ?p" seems to be legal by other examples so the meaning of "?o" varies by its position.

The RAP extensions introduces the use of comma in SELECT. There are other syntax extensions that SPARQL already provides for (eg. SQRT instead of math:sqrt).

For expressions in CONSTRUCT, I wondered about CONSTRUCT-SELECT (that would be quite SQL-like).

But the other way to do it would be to add a clause (after GROUP BY) such as WITH (named expressions) so "WITH (?x+?y AS ?z)".

I did think of doing this for SELECT expressions. It does make the relationship of aggregates and GROUP BY more natural in my view because the use aggregates in SELECT to have a sortof side-effect of both aggregating (so it's in the GROUP BY) and delivering the value.

Regarding COUNT syntax, I take it that the ARQ syntax is the best. My concern is that there are incompatible implementations around now, and by the time COUNT reaches REC status, they might be firmly entrenched. Some co-ordination now might reduce headaches later on.

Regarding SELECT-CONSTRUCT: This seems a bit counter-intuitive to me, because the only meaningful use of the SELECT clause would be to assign expressions to new variables, and that isn't really about SELECTing but about ... EXTENDing or ASSIGNing or doing something WITH a new variable. Something like CONSTRUCT-WITH-WHERE sounds more friendly to me.

I'm not sure I understand how SELECT, GROUP BY and WITH would interact. If you add CONSTRUCT to the mix, it might get quite complicated for a query writer to track the flow of a variable through the different stages.

I also could imagine expressions embedded directly into the CONSTRUCT clause in place of the usual RDF nodes: CONSTRUCT { ?x foaf:name EXPRESSION(str:concat(?first, " ", ?last)) }

We've gotten a lot of mileage out of a PREMISE clause in our queries. In our hacked version of Joseki, the model is duplicated and the premise statements are added to the copy before the query is made.

It is a big of a performance hit (hey, its research, right?) but it provides a pretty flexible way to do a lot of neat stuff, like querying hypothetical scenarios, enabling certain rules on a per-query basis, or providing a space where a client application can set up certain magic variables available in the query.

ted: another way to get the same effect is to query a union model. Put the premises in the updated part of the union and the underlying data in the another model in the union. This avoids copying the data all the time; it may slow the query a little but (for a reserch prototype!) it maybe well be faster than copying the data all the time.

the SERVICE extension is very useful. An additional useful extension for remote datasources would be to allow querying small static RDF documents inside a query similar to SERVICE. While SERVICE points to a SPARQL endpoint, the new feature should simply gather the RDF data from a static URI.

As far as I know this is not possible with NAMED GRAPH, or am I wrong?

dorgon: For a static source for data, FROM NAMED/GRAPH should do what you want. To do it dynamically, choosing the source during the query, isn't possible and would need a syntax extension and it's an interesting idea. Maybe it's a part of some more general - the ability to extend the graph (any graph) being queried so as to walk the GGG.