I've been packaging the nghttp2
HTTP/2.0 proxy and client by
Tatsuhiro Tsujikawa in both Debian
and with docker and noticed it takes some time to get the build
dependencies (C++ cough) as well as to do the build.

In the Debian packaging case its easy to create minimal dependencies
thanks to pbuilder and ensure the binary package contains only the
right files. See
debian nghttp2

For docker, since you work with containers it's harder to see what
changed, but you still really want the containers as small as
possible since you have to download them to run the app, as well as
the disk use. While doing this I kept seeing huge images (480 MB),
way larger than the base image I was using (123 MB) and it didn't
make sense since I was just packaging a few binaries with some small
files, plus their dependencies. My estimate was that it should be
way less than 100 MB delta.

I poured over multiple blog posts about Docker images and how to make
them small. I even looked at some of the squashing commands like
docker-squash that
involved import and export, but those seemed not quite the right
approach.

It took me a while to really understand that each Dockerfile
command creates a new container with the deltas. So when you see
all those downloaded layers in a docker pull of an image, it
sometimes is a lot of data which is mostly unused.

So if you want to make it small, you need to make each Dockerfile
command touch the smallest amount of files and use a standard
image, so most people do not have to download your custom l33t base.

It doesn't matter if you rm -rf the files in a later command; they
continue exist in some intermediate layer container.

So: prepare configure, build, make install and cleanup in oneRUN
command if you can. If the lines get too long, put the steps in
separate scripts and call them.

Lots of Docker images are based on Debian images because they are a
small and practical base. The debian:jessie image is smaller than
the Ubuntu (and CentOS) images. I haven't checked out the fancy
'cloud' images too much:
Ubuntu Cloud Images,
Snappy Ubuntu Core,
Project Atomic, ...

In a Dockerfile building from some downloaded package, you
generally need wget or curl and maybe git. When you install,
for example curl and ca-certificates to get TLS/SSL certificates,
it pulls in a lot of extra packages, such as openssl in the standard
Debian curl build.

You are pretty unlikely to need curl or git after the build stage of
your package. So if you don't need them, you could - and you should
- remove them, but that's one of the tricky parts.

If $BUILD_PACKAGES contains the list of build dependency packages
such as e.g. libxml2-dev and so on, you would think that this would
get you back to the start state:

I added --purge too since we don't need any config files in /etc
for build packages we aren't using.

Having done that, you might have removed some runtime package
dependencies of something you built. That's harder to automatically
find, so you'll have to list and install those by hand

$ RUNTIME_PACKAGES="...."$ apt-get install -y $RUNTIME_PACKAGES

Finally you need to cleanup apt which you should do with
rm -rf /var/lib/apt/lists/* which is great and removes all the index
files that apt-get update installed. This is in many best practice
documents and example Dockerfiles.

You could add apt-get clean which removes any cached downloaded
packages, but that's not needed in the official Docker images of
debian and ubuntu since the cached package archive feature is
disabled.

Finally don't forget to delete your build tree and do it in the
same RUN that you did a compile, so the tree never creates a new
container. This might not make sense for some languages where you
work from inside the extracted tree; but why not delete the src dirs?
Definitely delete the tarball!

The massive difference is the source tree and the 232 MB of build
dependencies that apt-get pulls in. If you don't clean all that up
before the end of the RUN you end up with a huge transient layer.

The final size of 149.8 MB compared to the 122.8 MB debian/jessie
base image size is a delta of 27 MB which for a few servers, a client
and their libraries sounds great! I probably could get it down a
little more if I just installed the binaries. The runtime libraries
I use are 5.9 MB.

Official Docker images bootstrap script
It includes multiple fixes including for initrd, dpkg apt speedup and
preventing services starting by policy-rc.d and/or upstart. It
makes apt-get "effectively running apt-get clean after every install"
as well as disable the package caching.

phusion image base script
The base preparation of phusion's images (which are quite
opinionated) is also interesting if you want to find out how to set
base languages and locales

This acquihire does NOT include me. I will be changing jobs shortly
but have nothing further to announce at this time.

I wish my former Digg colleagues the best of luck in their new roles. I
had a great time at Digg and learned a lot about working in a small
company, social news, analytics, public APIs and the technology stack
there.

I got tired of posting release announcements to my blog so I just
emailed the announcements to the redland-dev list, tweeted a link to it
from
@dajobe and announced it on
Freshmeat which a lot of places still
pick up..

Here are the tweets for the 13 releases I didn't blog since the start of
2011:

You know it's quite tricky to collapse months of changelogs (GIT
history) into release notes, compress it further into a news summary
of a few lines and even harder to compress that into less than 140
characters. It is way less if you include room for a link url and
space for retweeting and sometimes need a hashtag for context.

Releases that stand out here are Raptor 2.0.0 which was a major
release with lots of changes and Rasqal 0.9.21; that changed a lot
upwards and it was both an API break as well as lots of new
functionality.

Sources

Taken from my GitHub
repositories extracting the tagged
releases, excluding
ChangeLog* files, and running diffstat over the output
of a recursive diff -uRN.

date

package

old
version

new
version

source
files
changed

source
lines
inserted

source
lines
deleted

source
lines
net

2011-01-03

raptor

1.4.21

2.0.0

215

34,018

30,348

64,366

2011-01-12

rasqal

0.9.21

0.9.22

94

11,641

5,712

17,353

2011-01-27

rasqal

0.9.22

0.9.23

25

5,663

5,199

10,862

2011-01-30

rasqal

0.9.23

0.9.24

48

1,107

227

1,334

2011-02-01

redland

1.0.12

1.0.13

96

3,721

5,627

9,348

2011-02-22

rasqal

0.9.24

0.9.25

64

3,857

1,333

5,190

2011-03-20

raptor

2.0.0

2.0.1

42

6,163

5,833

11,996

2011-03-20

raptor

2.0.1

2.0.2

9

55

12

67

2011-03-26

flickcurl

1.20

1.21

19

737

308

1,045

2011-06-01

raptor

2.0.2

2.0.3

88

2,827

2,232

5,059

2011-06-27

rasqal

0.9.25

0.9.26

116

7,130

4,272

11,402

2011-07-23

raptor

2.0.3

2.0.4

33

808

103

911

2011-07-25

redland

1.0.13

1.0.14

75

3,681

5,477

9,158

Total

924

81,408

66,683

148,091

Click image to embiggen

Again Raptor 2.0.0 stands out as changing a huge number of files
and lines. Also you can see the mistake that was Raptor 2.0.1 being
corrected the same day with Raptor 2.0.2 with a few changes. This
didn't seem to get tweeted. However also note that several of the
Rasqal releases like 0.9.22 and 0.9.26 changed many files. The
'source lines net' column is the addition of the insert and deletes
although some of those lines are the same.

Words

Words from the changelog, the release notes and the news post
comparing the number of words in the rendered output.

date

package

old
version

new
version

changelog
words

release
note
words

changelog
to release
word ratio

news
words

changelog
to news
word ratio

2011-01-03

raptor

1.4.21

2.0.0

15,411

2,709

5.69

365

42.22

2011-01-12

rasqal

0.9.21

0.9.22

3,465

1,199

2.89

162

21.39

2011-01-27

rasqal

0.9.22

0.9.23

318

135

2.36

52

6.12

2011-01-30

rasqal

0.9.23

0.9.24

450

254

1.77

59

7.63

2011-02-01

redland

1.0.12

1.0.13

778

235

3.31

73

10.66

2011-02-22

rasqal

0.9.24

0.9.25

1,649

558

2.96

136

12.13

2011-03-20

raptor

2.0.0

2.0.1

247

76

3.25

50

4.94

2011-03-20

raptor

2.0.1

2.0.2

42

27

1.56

42

1.00

2011-03-26

flickcurl

1.20

1.21

119

-

-

68

-

2011-06-01

raptor

2.0.2

2.0.3

872

266

3.28

28

31.14

2011-06-27

rasqal

0.9.25

0.9.26

4,410

970

4.55

96

45.94

2011-07-23

raptor

2.0.3

2.0.4

517

345

1.50

77

6.71

2011-07-25

redland

1.0.13

1.0.14

1,347

620

2.17

88

15.31

Total

29,625

7,394

1,296

Click image to embiggen

So now we get to words. Yes, lots of words, most of them by
me. Starting with the changelog which is a hand edited version of the
SVN and later GIT changes was over 15K words for Raptor 2.0.0. And
that gets boiled down lots into release notes, news and then a terse
tweet. Since the changelog corresponds roughly to source changes but
the news to user visible changes like APIs, you can see that the
oddities are again Rasqal 0.9.26 where there were lots of changes but
not so much news; it was mostly internal work.

Now I need to go summarise this blog post in a tweet:
Releases = Tweets in 1156 words http://bit.ly/n88ZIQ

I'd like to emphasis a couple of the changes to the roqet(1) utility
program: you can now use it to query over data from standard input, i.e.
use it as a filter, but only if you are querying over one graph. You can
also specify the format of the data graphs either on standard input or
from URIs, if the format can't be determined or guessed from the mime
type, name or bytes. Finally roqet(1) can execute remote queries at a
SPARQL Protocol HTTP service, sometimes called a "SPARQL endpoint".

The new support for SPARQL Query 1.1 aggregate queries (and other
features) led me to make comments to the SPARQL working group about
the latest SPARQL Query 1.1 working draft based on the implementation
experience. The comments (below) were based on the implementation
I previously outlined in
Writing an RDF query engine. Twice
at the end of October 2010.

The implementation work to create the new features was substantial
but relatively simple to describe: new rowsources were added for each
of the aggregation operations and then included in the execution plan
when the
query structure indicated their presence after parsing. There was some
additional
glue code that needed to be add to allow rows to indicate their
presence in a group; a simple integer group ID was sufficient and
the ID value has no semantics, just a check for a change of ID
is enough to know a group started or ended.

I also introduced an internal variable to bind the result of
SELECT aggregate expressions after grouping ($$aggID$$
which are illegal names in sparql). I then used that to replace the
aggregate expression in the SELECT and the HAVING expressions and
used the normal evaluation mechanisms. As I understand it, the
SPARQL WG is now considering adding a way to name these explicitly
in the GROUP statement. A happy coincidence since I had implemented
it without knowing that.

To prepare this I did think about the approach a lot and developed
a couple of diagrams for the grouping and aggregation rowsources that
might help to understand how they work, how they can be implemented
and tested as standalone unit modules, which they were.

Rasqal Group By Rowsource

As always, the above isn't quite how it is implemented. There is no
group by node if there is an implicit group when GROUP BY is missing
but an aggregate expression is used; instead the Rasqal rowsource class
synthesizes 1 group around the incoming row, when grouping is requested.

Rasqal Aggregation Rowsource

This shows the guts of the aggregate expression evaluation where
internal variables are introduced, substituted into the SELECT and
HAVING expressions and then created as the aggregate expressions are
executed over the groups.

The rest of this post are my detailed thoughts on this draft of SPARQL
1.1 Query as
posted
to the SPARQL WG comments list but turned into HTML markup here.

I. General comments

I felt the specification introduced more optional features bundled
together, where it was not entirely clear what the combination of
those features would do. For example a query with no aggregate
expression but has a GROUP BY and HAVING is
allowed by the syntax and the main document doesn't say if it's
allowed or what it means.

I found it hard to assemble all the pieces from the mathematical
explanations into something I could code.

The spec has several terms in the grammar not in the query
document. After asking, these turned out to be federated query
(BINDINGS), or update (LOAD, ...) but these are not
pointed out or linked to clearly although there is mention of the
documents in the status section. Please make these more clear.

I decided to concentrate on the new Aggregates feature since I had
already implemented SELECT expressions, leaving
Subqueries and Negation to later. Property
paths should be in the list of new features in the status
section at the of the document.

"SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs"
is rather a long title; what does 'Uniform' or 'HTTP' add? SOAP is
dead.
suggest "SPARQL 1.1 RDF Graph Management Protocol"
or RDF dataset

With all the additions especially property paths (a new query
language), update (data management language) and federated query
(remote query support) and I understand \~30 additional keywords are
being added beyond this draft for functions and operators, I see this
as a major change to SPARQL 1.0, more of a SPARQL 2. You should
consider renaming it.

II. Aggregates

I found the math in the aggregation and grouping sections rather
hard to understand so I also looked what MySQL and SQLite did, and
wrote my own diagram based on the data flow:

so for me it was easier to see the individual components/stages
(which roughly correspond to SPARQL algebra terms).

I had to make several of my own tests with my guess on what the
answers should be. With all the pieces for aggregate expressions:
grouping, aggregate expression, distinct, having, counting (count *
vs count(expr)) there needs to be several tests with good coverage.

As it is clear they are all optional, it probably is worth explaining
what it means when they are absent, such as group by + having with
no aggregate expression as mentioned above.

III. Bindings is a new syntax

BINDINGS essentially gives a new way to write down a variable
bindings result set. Even though it is discussed in the federated
query spec about using it for SERVICE, it's not restricted to that
by
the grammar or specifications.

My question is to ask if this is correct and to clarify in the
spec the intended use, whether or not it is intended for use with
SERVICE only.

IV. Section-by-section comments

Section: Status of this Document

Should mention property paths as new since that is a major addition
after SPARQL 1.0

Please link to the documents in the status, these are just text.

Sections 1-8

Skipped, they are same as SPARQL 1.0 I hope

9 Property Paths

I am unlikely to ever implement any of this, it's a second query
language inside SPARQL. How many systems implemented this before
the SPARQL 1.1 work was started?

10 Aggregates

I took all the examples in this section and turned them into test
cases where possible.

10.2

The explanation of errors and ListEvalE is rather opaque. It is
still not clear to me what is done with errors in GROUP BY, HAVING
and arguments to aggregate expressions. Some are skipped, some
are ignored and return NULL. Examples and tests will enable checking
this but the spec needs to be clearer.

Definition: Group and Aggregation were hard for me to understand.
The input to Aggregation being a 'scalar' meaning actually a set of
key:value pairs was confusing. It is not also not clear if those
are a set or an ordered set of parameters. This is only used today
for the 'separator' with GROUP_CONCAT.

10.2.1 HAVING

What happens when there is an expression error?

What variables and expressions can be used here and what is their scope?

10.2.2 Set Functions

Another confusing section. I mostly ignored this and did what SQL did.

None of the functions that I can tell, ever use 'err'.

10.2.3 Mapping from Abstract Syntax to Algebra

scalarvals argument is used here - I think this is called 'scalar'
earlier.

Un-numbered Section after 10.2.3: Joining Aggregate Values

Never figured out what this was trying to define but my code executes
the example.

11. Subqueries

(Ignored in my current work)

12 RDF Dataset

(Same as SPARQL 1.0 I assume so no comments)

13 Basic Federated Query

Yes, please merge in the text here.

14 Solution Sequences and Modifiers

( Aside: This is one of those SPARQL parts where everything mentioned
is
optional. Otherwise this section has no change from SPARQL 1.0, I am
just
mentioning it as a pointer of a trend. )

15. Query Forms

No comments.

16. Testing Values

16.3 Operator Mapping

Is it worth noting the new operators in SPARQL 1.1?

Operators: implemented isNUMERIC()

16.4 Operators Definitions

My current state of implementation of new to SPARQL 1.1 expressions

16.4.16 IF - implemented

16.4.17 IN - implemented

16.4.18 NOT IN - implemented

16.4.19 IRI - implemented

16.4.20 URI - implemented

16.4.21 BNODE - implemented

16.4.22 STRDT - implemented

16.4.23 STRLANG - implemented

No comments on the above

16.4.24 NOT EXISTS and EXISTS

I am lumping these together with sub-SELECT to implement.
My concern here is that the syntax gets super-complex since all the
graph
pattern syntax can now appear inside any expression syntax.

There is a filter operator "exists" that ...

Does this imply these can only appear in FILTER
expressions? Please clarify.

17 Definition of SPARQL

I looked at the 17.2.3 for aggregate queries and it was more helpful
than the math earlier. The pseudo code in Step 4 is a bit too
unclear. Is that an example implementation or the required one?

17.6 Extending SPARQL Basic Graph Matching

Ignored.

18 SPARQL Grammar

Clearly this is not complete; there are lots of notes to update it.

19 Conformance

If property paths are not removed, please add a conformance level
that includes SPARQL 1.1 without property paths.

Does SPARQL 1.1 Query require implementation of the dependent specs -
federated query and update? Looks to me that protocol may also be
dependent?

Raptor V1 was last
released in January 2010
and Raptor V2 seems pretty stable and working. I am
therefore announcing that from early 2011, Raptor V2 will replace
Raptor V1 and be a requirement for
Rasqal and
librdf.

End of life timeline

The next librdf release will support Raptor V1 and Raptor V2
(and requires Rasqal built with the same Raptor version).

2011

The next Rasqal release will support Raptor V2 only.

The next librdf release will support Raptor V2 only
(and require a Rasqal built with Raptor V2).

In the style of open source I've been using for the Redland libraries,
which might be described as "release when it's ready, not release by
date",
these dates may slip a little but the intention is that Raptor V2
becomes
the mainline.

I do NOT rule out that there will be another Raptor V1 release but it
will
be ONLY for security issues. It will contain minimal changes and not
add
any new features or fix any other type of bug.

Developer Actions

If you use the Raptor V1 ABI/API directly, you will need to
upgrade. If you want to write conditional code, that's possible.
The redland librdf GIT source (or 1.0.12) uses the approach of macros
that rewrite V2 into V1 forms and I recommend this way since dropping
Raptor V1 support then amounts to removing the macros.

Packager Actions

If you are a packager of the redland libraries, you need to prepare for
the
Raptor V1 / Raptor V2 transition which can vary depending on your
distribution's standards. The two versions share two files: the rapper
binary and the rapper.1 man page. I do not want to rename them to
rapper2
etc. since rapper is a well known utility name in RDF and I want
'rapper'
to provide the latest version.

In the Debian packaging which I maintain, these are already planned to
be
in separate packages so that both libraries can be installed and you
can
choose the raptor-utils2 package over raptor-utils (V1).

In other distributions where everything is in one package (BSD Ports
for example) you may have to move the rapper/rapper.1 files to the
raptor V2 package and create a new raptor1 package without them.
i.e. something like this

Raptor V1 package 1.4.21-X:

/usr/lib/libraptor1.so.1* ...

(no /usr/bin/rapper or /usr/share/man/man1/rapper.1 )

Raptor V2 package 2.0.0-1:

/usr/lib/libraptor2.so.0* ...

/usr/bin/rapper

/usr/share/man/man1/rapper.1

conflicts with older raptor1 packages before 1.4.21-X

The other thing to deal with is that when Rasqal is built against
Raptor V2, it has internal change that mean librdf has to also be
built
against rasqal-with-raptor2. This needs enforcing with packaging
dependencies.

This packaging work can be done/started as soon as Raptor V2 2.0.0
is released which will be "soon".

"You'll end up writing a database" said Dan Brickley prophetically in
early 2000. He was of course, correct. What started as an RDF/XML parser
and a BerkeleyDB-based triple store and API, ended up as a much more
complex system that I named Redland with the
librdf API. It does indeed have persistence, transactions (when using
a relational database) and querying. However, RDF query is not quite the
same thing as SQL since the data model is schemaless and graph centric
so when RDQL and later SPARQL came along, Redland gained a query engine
component in 2003 named Rasqal: the RDF
Syntax and Query Library for Redland. I still consider it
not a 1.0 library after over 7 years of work.

Query Engine The First

The first query engine was written to execute
RDQL which today looks like a
relatively simple query language. There is one type of SELECT query
returning sequences of sets of variable bindings in a tabular result
like SQL. The query is a fixed pattern and doesn't allow any optional,
union or conditional pattern matching. This was relatively easy to
implement in what I've called a static execution model:

Break the query up into a sequence of triple patterns: triples
that can include variables in any position which will be found by
matching against triples. A triple pattern returns a sequence of
sets of variable bindings.

Match each of the triple patterns in order, top to bottom, to bind
the variables.

If there is a query condition like ?var > 10 then check that it
evaluates true.

Return the result.

Repeat at step #2.

The only state that needed saving was where in the sequence of triple
patterns that the execution had got to - pretty much an integer, so that
the looping could continue. When a particular triple pattern was
exhausted it was reset, the previous one incremented and the execution
continued.

This worked well and executes all of RDQL no problem. In particular it
was a lazy execution model - it only did work when the application
asked for an additional result. However, in 2004 RDF query
standardisation started and the language grew.

Enter The Sparkle

The new standard RDF query language which was named SPARQL had many
additions to the static patterns of the RDQL model, in particular it
added OPTIONAL which allowed optionally (sic) matching an inner set of
triple patterns (a graph pattern) and binding more variables. This is
useful in querying heterogeneous data when there are sometimes useful
bits of data that can be returned but not every graph has it.

This meant that the engine had to be able to match multiple graph
patterns - the outer one and any inner optional graph pattern - as well
as be able to reset execution of graph patterns, when optionals were
retried. Optionals could also be nested to an arbitrary depth.

This combination meant that the state that had to be preserved for
getting the next result became a lot more complex than an integer. Query
engine #1 was updated to handle 1 level of nesting and a combination of
outer fixed graph pattern plus one optional graph pattern. This mostly
worked but it was clear that having the entire query have a fixed state
model was not going to work when the query was getting more complex and
dynamic. So query engine #1 could not handle the full SPARQL Optional
model and would never implement Union which required more state
tracking.

This meant that Query Engine #1 (QE1) needed replacing.

Query Engine The Second

The first step was a lot of refactoring. In QE1 there was a lot of
shared state that needed pulling apart: the query itself (graph
patterns, conditions, the result of the parse tree), the engine that
executed it and the query result (sequence of rows of variable
bindings). That needed pulling apart so that the query engine could be
changed independent of the query or results.

Rasqal 0.9.15 at the end of 2007 was the first release with the start of
the refactoring. During the work for that release it also became clear
that an API and ABI break was necessary as well to introduce a Rasqal
world object, to enable proper resource tracking - a lesson hard learnt.
This was introduced in 0.9.16.

There were plenty of other changes to Rasqal going on outside the query
engine model such as supporting reading and writing result formats,
providing result ordering and distincting, completing the value
expression and datatype handling data and general resilience fixes.

The goals of the refactoring were to produce a new query engine that was
able to execute a more dynamic query, be broken into understandable
components even for complex queries, be testable in small pieces and to
continue to execute all the queries that QE1 could do. It should also
continue to be a lazy-evaluation model where the user could request a
single result and the engine should do the minimum work in order to
return it.

Row Sources and SPARQL

The new query engine was designed around a new concept: a row source.
This is an active object that on request, would return a row of variable
bindings. It generates what corresponds to a row in a SQL result. This
active object is the key for implementing the lazy evaluation. At the
top level of the query execution, there would be basically one call to
top_row_source.getRow() which itself calls inner rowsources'
getRow() in order to execute the query to return the next result.

Each rowsource would correspond approximately to a SPARQL algebra
concept, and since the algebra had a well defined way to turn a query
structure into an executable structure, or query plan, the query
engine's main role in preparation of the query was to become a SPARQL
query algebra implementation. The algebra concepts were added to Rasqal
enabling turning the hierarchical graph pattern structure into algebra
concepts and performing the optimization and algebra transformations in
the specification. These transformations were tested and validated
against the examples in the specification. The resulting tree of "top
down" algebra structures were then used to build the "bottom up"
rowsource tree.

The rowsource concept also allowed breaking up the complete query engine
execution into understandable and testable chunks. The rowsources
implemented at this time include:

Assignment: allowing binding of a new variable from an input
rowsource

Distinct: apply distinctness to an input rowsource

Empty: returns no rows; used in legitimate queries as well as in
transformations

Filter: evaluates an expression for each row in an input
rowsource and passes on those that return True.

Graph: matches against a graph URI and/or bind a graph variable

Join: (left-)joins two inner rowsources, used for OPTIONAL.

Project: projects a subset of input row variables to output row

Row Sequence: generates a rowsource from a static set of rows

Sort: sort an input rowsource by a list of order expressions

Triples: match a triple pattern against a graph and generate a
row. This is the fundamental triple pattern or Basic Graph Pattern
(BGP) in SPARQL terms.

Union: return results from the two input rowsources, in order

The QE1 entry point was refactored to look like getRow() and the query
engines were tested against each other. In the end QE2 was identical,
and eventually QE2 was improved such that it passed more DAWG SPARQL
tests that than QE1.

So in summary QE2 works like this:

Parse the query string into a hierarchy of graph patterns such as
basic, optional, graph, group, union, filter etc. (This is done in
rasqal_query_prepare())

Create a SPARQL algebra expression from the graph pattern tree that
describes how to evaluate the query. (This is in
rasqal_query_execute() calling QE2 )

Invert the algebra expression to a hierarchy of rowsources where the
top rowsource getRow() call will evaluate the entire query (Ditto)

(If you want to see some of the internals on a real query, run
roqet -d debug query.rq from roqet built in maintainer mode and both
the query structure and algebra version will be generated.

The big advantage from a maintenance point of view is that it is divided
into small understandable components that can be easily added to.

The result was released in Rasqal 0.9.17 at the end of 2009; 15 months
after the previous release. It's tempting to say nobody noticed the new
query engine except that it did more work. There is no way to use the
old query engine except by a configure argument when building it. The
QE1 code is never called and should be removed from the sources.

Example execution

When QE2 is called by the command line utility
roqet, there is a lot going on
inside Rasqal and Raptor. A simplified
version of what goes on when a query like this is run:

is described in the following picture if you follow the numbers in
order:

This doesn't include details of content negotiation, base URIs, result
formatting, or the internals of the query execution described above.

SPARQL 1.1

Now it is the end of 2010 and SPARQL 1.1 work is underway to update the
original SPARQL
Query which
was complete in January 2008. It is a substantial new version that adds
greatly to the language. In the SPARQL 1.1 2010-10-14 draft
version it adds
(these items may or may not be in the final version):

Assignment with BIND(expr AS ?var)

Aggregate expressions such as SUM(), COUNT() including grouping
and group filtering with HAVING

Negation between graph patterns using MINUS.

Property path triple matching.

Computed select expressions: SELECT ... (expr AS ?var)

Federated queries using SERVICE to make SPARQL HTTP query requests

Sub SELECT and BINDINGS to allow queries/results inside queries.

Updates allowing insertion, deletion and modification of graphs via
queries as well as other graph and dataset management

The above is my reading of the major items in the latest draft SPARQL
1.1 query language or it's dependent required specifications.

Rasqal next steps

So does SPARQL 1.1 mean Rasqal Query Engine 3? Not yet, although the
Rasqal API is still changing too much to call it stable and another
API/ABI break is possible. There's also the question of making an
optimizing query engine, a more substantial activity. At this time,
I'm not motivated to implement property paths since it seems like a lot
of work and there are other pieces I want to do first. Rasqal in
GIT handles most of the syntax and is
working towards implementing most of the execution of aggregate
expressions, sub selects and SERVICE although no dates yet. I work on
Rasqal in my spare time when I feel like it, so maybe it won't be mature
with a stable API (which would be a 1.0) until SPARQL 2 rolls by.

I have just released Redland librdf library
version 1.0.11 which has been in progress for some time, delayed by the
large amount of work to get out Raptor V2
as well as initial SPARQL 1.1 draft work for Rasqal 0.9.20.

The main features in this release are as follows:

Virtuoso storage backend querying now fully works.

Several new convenience APIs were added and others deprecated.

Support building with Raptor V2 API if configured with
--with-raptor2.

The main change is to start to add to Rasqal's APIs and query engine
changes for the new SPARQL 1.1 working drafts. This release adds support
the syntax for all the changes for Query and Update. The new draft
syntax is available via the 'laqrs' query language name, until the
SPARQL 1.1 syntax is finalized. The 'sparql' query language provides
SPARQL 1.0 support.

On Query 1.1, the addition is primarily syntax and API support for the
new syntax. There is expression execution for the new functions IF(),
URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN() which
are noew usable as part of the normal expression grammar. The existing
aggregate function support was extended to add the new SAMPLE() and
GROUP_CONCAT() but remains syntax-only. Finally the new GROUP BY
with HAVING conditions were added to the syntax and had consequent API
updates but no query engine execution of them.

For Update 1.1 the full set of update operations syntax were added and
they create API structures. Note, however there seem to be some
ambiguities in the draft syntax especially around multiple optional
tokens in a row near WITH which are particularly hard to implement in
flex and bison (aka "lex and yacc").

The main non-SPARQL 1.1 related change is to allow building Rasqal with
Raptor V2 APIs rather than V1. Raptor V2 is in beta so this is not a
final API and is thus not the default build, it has to be enabled with
--enable-raptor2 with configure. When raptor V2 is stable (2.0.0),
Rasqal will require it.