This is to let you know what you’re getting yourself into. For example, you probably don’t want to add the rdfs:subClassOf relationship for everything: you’d be dealing with a heck of a lot of statements, enough to trap the bot for hours. Here, we can see that it’s a relatively reasonable subset of the model, so we can pass this construct result into the model:

I’d be interested to hear how people think this could be used, or if it’s useful in a more general sense. Would it be better to simply provide a temporary URI where people could fetch the data? Hm, that sounds almost like a useful service - POST data, get back a URI and some data after having parsed the data and stored it in a local location. Wonder if I’m the only person who could use that…

So, I implemented SPARQL CONSTRUCT queries in Julie yesterday (http://crschmidt.net/noets/115) along with ASK queries (http://crschmidt.net/noets/114). Really, the CONSTRUCT queries are pretty silly to have: who wants to see serialized triples on an IRC interface? but I was thinking: what if instead of spitting the triples into IRC, it sent them into the backend?

I’m sure that the people in charge of SPARQL thought of this long ago, but it just occured to me, that this is a simple way to achieve rules-like processing:

So often lately, I’ve seen some prominent people in the Semantic Web advising that the possibilities of using RDF as a backing store for an application are great, and that maybe people should use “SPARQL as a query language for the application”.

Stop. Don’t do that.

Right now, we’re in a situation where RDF implementations are really usable - if you are aware of their limitations, and avoid them. This is also true in MySQL: You don’t make every one of your queries include half a dozen joins, so you don’t want to do a similar thing in SPARQL. The problem is that with RDF query languages, unless you’ve spent a lot of time with them, you’re working on something that’s much more difficult to understand the work behind. It’s extremely easy to write a query that is not well optimized that take the application a very long time to compute, at the cost of making RDF and SPARQL look as if they are too slow.

They’re not.

For small size web applications, there is no real reason that a properly built site could not be extremely quick, and built on RDF tools. (At the current point in time, I won’t say the same for large applications: I don’t have any knowledge of what happens to triplestores once they get past 2 million triples.) However, most queries that people write are slow, simply because they are not optimized. Depending on how exactly the application-level query translator works, you may be dealing with something that’s not completely optimized, which can have an extreme negative impact on your query time.

An example? Well, my knowledge is mostly in Redland, so I’ll just toss out a query via julie, the redlandbot.

You have a multi-minute wait. (I’ll fill this in when it comes back: it’s been 3 minutes so far, and still going.) Ah, it finished. Just shy of 13 minutes. Someone less versed in Redland would look at that and say “Why?”

Redland’s query engine, Rasqal, works based on the triples it finds as it goes through the query. In this case, for the first query, it finds approximately 100 triples: “Here are the foaf:knows triples pointing from the crschmidt node. Now let’s go find their nicks.” It then has approximately 100 distinct subjects to match in the second part of the query. Now, look at the second query: You’ll notice that the first triple pattern is going to match a lot more triples than the first: in this case, I think it’s approximately 20,000 triples. Then, each of those triples will be matched against the second part of the query. It has to ask approximately 200 times as many questions to the triplestore behind the data. That’s 200 times as long to wait to get all the data out. Since most of the data will probably be “cold” (not cached in the MySQL table cache), you’ll not only be waiting to get it out, but you’ll also probably be emptying out the MySQL cache of anything but this one useless query. All because you got your triple patterns mixed up.

Perhaps this is just a Redland problem: I don’t know. I’ve not used any of the Java-based tools, and I don’t know of any other non-Redland tools for working with SPARQL against a large data store. However, it is an image problem when you advise or offer to “just use SPARQL”: People do not, by intuition, recognize that the above two queries will be any different. Since it’s extremely hard to notice the differences on something that is small scale, it’s hard to catch these mistakes when they start showing up without a fair amount of experience in the application level RDF tool. Ah, it finished, so back up to show you timeframes…

Before you start using SPARQL for an application query language, consider the application you’re using, and how you can optimize it. SPARQL queries can be made to run much faster, generally, if you have a good knowledge of what you’re doing. However, working with the raw triples using the methods the library provides will oftentimes provide a level of insight as to what’s actually going on under the hood that can be extremely useful for knowing how things can work better.

I think that people need to stop advocating SPARQL as the end-all, be all solution for everything in RDF. I’m sure that it can be great, and that there’s tons of great ways to use it. However, one of the most popular RDF engines does not work well with most SPARQL queries that I’ve seen people throw at it. The first step in learning how to properly use SPARQL (for any kind of time-neccesary things: anything over HTTP basically counts here) is learning how to properly manipulate the triples in the way that works best in the application without the query language.

Lately, I’ve been watching Chimezie play with Emeka, his RDF bot. It’s basically a Versa/4suite counterpart to Julie, the redlandbot (based, obviously, on Redland.)

I can’t say anything for his code, but I do know that through people working with Emeka, I’ve seen some Versa queries recently, and I have to say that they confuse the heck out of me.

I just read through the Versa Article on XML.com, and was helped not at all. The language itself makes sense when I read the examples, but I simply can’t come up with the way to do what I would in SPARQL. I’m sure that it’s really easy once you’re used to it, but to me, it seems like a query language with a 663 mode. Sure, I might be able to write it, with a reference handy, and execute it (hi Emeka!), but I sure as heck can’t read it.

I think I like SPARQL because it feels familiar: the turtle patterns are just turtle statements with some variables. The triple patterns *look* like triples. Versa doesn’t have this benefit: triple patterns in Versa become something along the lines of (all() ->rdfs:label->*) rather than (?s rdfs:label ?l). I think it may just be the fact that all the extra syntax confuses me: why put all these bits in the middle of my triples? They’re triples! Spaces are enough!

Anyway, this isn’t very helpful: I haven’t used Versa enough to have useful comments. Just know that reading this stuff, and even trying to wrap my head around the stuff Chimezie is ending to Emeka, I have problems. This probably is true for SPARQL/julie as well for most people - but for me it “just works”.

I’ve been doing some playing with goofy Javascript stuff lately to try to get my head wrapped around it, since I’m going to be needing to implement it in a few tools at work in the near future.

I’ve so far used it in
1. An admin interface for Athena’s email accounts,
2. An inventory listing for a work project
3. The newest one, a “suggestion” field for Wordnet searches against the RDF store I just imported this morning.

Danny alerted me to the existence of a new Wordnet dataset. I grabbed the full set, dropped it into Redland, and set up a sparql search against it. The top box there is the nifty one though: type in a string (say, apple) and watch the right side as a list of suggestions is populated.

I still need to get it actually doing a Google Suggest-like dropdown box, but haven’t had the time to hack WICK to do what I want as far as that goes.

I’m still learning, and as such, the code is sucky. I wouldn’t recommend reading it for an example: it’s a quick hack, but it works. Still many bugs to work out - for example, if you type apple, it still searches for app, appl, apple in the process. But I’ll get there. (Okay, so I just did a few bug fixes that make it much better, and switched the search mechanism to use MySQL rather than an 11 Meg PHP array. Much better now.)

Anyway, I think it’s cool. RDF people can mark it down in the “another SPARQL datastore”, Javascript people can mark it down as “Another idiot trying to use XmlHttpRequest and doing it wrong.”

One of the things I’ve noticed with my Image Region stuff, which I posted about recently, is that it’s slow. I didn’t really think about why: at first, a lot of it seemed to have to do with the client side XSL, or the CSS cropping of gigantic images.

However, I’m now realizing that this is using a regex with a pretty heavy query: The kind of query that I wouldn’t want anyone to run against julie, because it would just take too long.

The reason for this is Redland’s current REGEX implementation: It basically loads all the literals out of the store and does a regex against them after it has them, which is obviously not ideal.

With that in mind, I tried to think of interesting queries which could be done without requiring a regex, and came up with the idea of flickr images searches: show me a closeup of all the regions in a flickr image of mine.

So, now there’s an additional search box on my SPARQL interface: Flickr ID/URI. It then uses the foaf:page part of the photo to query against, which is obviously much faster.

Maybe I’ll expand this: let people put in any flickr photo ID, and display the information using XSLT against an RDF datasource, with a link to the output of the datasource. I’ve got all the tools to do it now running in Python locally, so I don’t think it would be too difficult: I would need to get some error parsing together though. I really wish I could tie PHP / Python code on the web together more easily though…

My post last night was a bit cryptic, so let me walk through a bit more clearly what I’ve been doing, since I seem to have picked up the interest of some more people.

I currently am using Flickr to annotate my photos: primarily because I like their image region annotations, and partially because their API offers me a way to get lots of data out that I’ve put in, which is useful to me. So, that’s what I’m using for photo annotation at the moment, which may change at any point.

Masahide has a flickr2rdf service: flickr2rdf takes a Flickr Photo page URI and exports RDF from it: For example, a picture of myself, my ex girlfriend, and Foghorn Leghorn can be seen, fully annotated, using XSLT+RDF, via the flickr2rdf tool.

Once I have the photo_id of a photo, I can collect all these statements together. Additionally, since I am using tags from GeoBloggers for geolocation, I have a tool which parses out these tags (using the Flickr API) and creates Geo data for them.

I add a few tracking statements: specifically, links to seeAlso the final RDF/XSLT view of the image, (again, Foghorn Leghorn example). I serialize the Model out from Redland, and get a directory full of files full of RDF singletons. From here, I use cwm to process the singletons into an abbreviated RDF/XML file. These files are then synced to the http://crschmidt.net/albums/flickr/ directory. Here, I use a couple little tricks to add an XSLT declaration as the first line of each file, so that the content negotiated version offers XML delivered as application/xml, rather than just application/rdf+xml (which Firefox won’t display in a browser).

Next step is to add each of these files into an RDF model. Since I’m still occasionally changing statements, I’ve been dropping the whole model and readding every time: this doesn’t take too long, as it’s only a few hundred files, and Redland is speedy quick.

So, now we have a database full of RDF statements. Fine. But that’s not too useful. So, I have my SPARQL query interface. Which is all well and good, for people who have lots of knoweldge of RDF. It can provide some cool results.

But it doesn’t really do anything *fun*. So last night, I added an optional checkbox, that said “If you ahve something in a specific query format, process an XSLT file against it”. I tweaked this XSLT from masahide’s example, linked yesterday, into what it is now, which you can see, if you’re interested.

Well, that’s all well and good, but most people don’t understand SPARQL enough to know what they should type in. What’s the use of having to learn a language just to see some pictures? So, my next step was to add a search box specifically for Regions: my sparql page has a search box now specifically for this purpose.

I realized after a couple times, though, that using client side XSLT to process the XML was really slow, clunky, and generally ugly. Not to mention that Mozilla’s XSLT doesn’t let me disable-output-escaping on variables: so, I installed php4-xslt, and started using that implementation on the server side.

Yeah, that’s all well and good too, but now my pretty RDF with queries and all went away! So, I added them back: at the end of the Foghorn Search, in a comment, you’ll see:

Generated using the XSLT stylesheet at http://crschmidt.net/xslt/imgreg.xsl against the data generated by the query:

Converted an XSLT Stylesheet to a different result format. Loaded ~400 RDF files into a Model, totalling 33,000 statements. Added an option to my Sparql Interface. Changed the default query. Made the extra option add the stylesheet.

Ran a query. Tweaked until it worked. Typed it all up here, to share with all of you.

Dave released a new Raptor and new Rasqal today. I’ve built both, and rebuilt my Python bindings so I no longer get segfaults (Almost thought it was a bug, then Dave reminded me of previous “bugs” which were my fault).

As a result, all of my tools on both zeus and athena are now running the latest and greatest in the way of SPARQL, meaning the new query syntax (and I believe, new XML output syntax). I still need to update the examples on my PHP pages, but julie’s code is all up to date.

While I was at it, I took the oppourtunity to do some cleanups that I’ve been wanting to do for a while: You can see the revisions on the rdfpython trunk in trac’s timeline, but here’s a summary:

* Did some rewriting on mortenf’s smusher. I now get owl:sameAs triples in the store, so I have a reversible process to some extent for smushing, as well as making the smusher look for the shortest URI rather than just grabbing the first node it sees as “canonical”. Of course, I did this after a lot of URIs got tossed in my last smushing run… ah well, live and learn.
* Moved more code to use the “parse_anything” function that I wrote, which uses heuristics and logic to try and guess what kind of content we’re dealing with. It depends a lot on Content Types, but is also something I can edit and reload without restarting the bot, which is a major boon for me. This means that if something is broken, I can fix it, and make it more robust, without any kind of guilty concious about flooding channels with joins/parts/quits.
* GRDDL support (with newest raptor) in parse_anything. Since ^add is really now parse_anything, this means that if you add a page with a GRDDL description Redland supports, you’ll get the triples out of it.
* Heuristics of queries, guessing which is which. (Really ‘dumb’ right now: it just looks for ” {”, and considers it Sparql if it has it.)

What does this mean to you, dear user?

Well, quite simply, it means that you will probably support more formats (RSS, SVG, HTML+GRDDL, Atom, Turtle, ntriples) with less work (it’s all done through ^add). You can run queries in either the old format (RDQL) or new format (Sparql), or store either one.

I know that I’m far too lazy to actually describe my images. I never do it. I write tools to help me, and I still don’t. So, my goal is to use tools which do it for me. With Masahide’s EXIF tools, flickr, and flickr2rdf, I can do this, with a little fudging to get the output to flow together better.

I have a lot of photos to describe, and I was going to get to work on it, when I reached for my keyboard… and found the cat laying on it. So, I switched to the other computer (zeus, rather than hermes) and got to work, creating a SPARQL interface for my photos. Maybe if I can search them, I’ll actually describe them.

I haven’t done a whole lot yet, but the start of my work is in place, with a nice SPARQL query against it. Of course, so far there’s only one photo, but this example should get you started, and if you care, you can check out the data to get you started.