SPARQL; useful, but not a game-changer

Not that anyone would ever mistake me for a query language guru, but that’s really part of the problem; I’m not a query language guru, because I’m a Web guru, and to a certain extent those two roles are incompatible.

The Web doesn’t do generic query, and it’s a better large scale distributed computing platform as a result. The cost of satisfying an arbitrary query is too large for most publishers to absorb, as they do when they internalize the cost of operating a public Web server and answer GET requests for free.

The Web does per-resource query, which is a far more tightly constrained form of query, if you can even call it that. It makes use of hypermedia to drive an application to the results of a query without the client needing to perform an explicit query. Think of a Facade in front of an actual query processor, where the user provides the arguments for the query, but has no visibility into the actual query being performed. FWIW, this isn’t an unfamiliar way of doing things, as it’s how millions of developers use SQL when authoring Web apps; a user enters “30000″ in a form field, hits submit, and then some back-end CGI invokes “select name, salary from emp_sal where salary > 30000″.

I’m confident that SPARQL will be used primarily the same way SQL is used today, and that you won’t see many public SPARQL endpoints on the Web, just as you don’t see many SQL endpoints on the Web. There’s nothing wrong with that of course, but I think it’s important to keep our expectations in check; SPARQL is likely not going to enable new kinds of applications, nor help much with multi-agency data integration, nor do much else that doesn’t involve helping us with our triples behind the scenes.

Comments:21

I echo your sentiments about being realistic about the reality of the value SPARQL brings to the table. I personally, think using a query syntax based on a relational model model to facilitate querying a knowledge representation is a little backwards, but that’s where the main thrust of standardizing RDF querying languages has been aimed.

Having said that though, I do think that though SPARQL will not change the face of web-applications, it *will* have more of an impact on web applications than did SQL primarily because it *does* help with multi-agency data integration: unambigious (or at least more so than it’s predecessors in the web space) representation is the crown jewel of RDF and RDF-related technologies and *is* the first major step towards multi-agency data integration.

I think in advocacy of Semantic Web, people tend to get carried away with the components and miss the bigger (engineering picture). It’s the combined value of: 1) relying on a ubiquitous transport layer (the current web) 2) unambigious knowledge representation – the primary reason why the web in it’s current state does per-resource query better than generic query is the lack of a common machine-understandable representation and 3) a common query language for software processes (agents) to work with

Though I think the choice of syntax leaves a bit to be desired, I wouldn’t agree that RDF querying languages (in general) will (in the end) only be as useful as their predecessor: SQL.

A thought that has crossed my mind that (may) be directly relevant:

Consider a GRDDL profile of XHTML documents that extracts XForms components into your RForms. By virtue of relying on RDF for representation and a mechanism for extracting unambigious RDF statements about a URL (a distributed resource), which includes visual mechanisms to edit itself, you have some amount of transport semantics that could be more useful to a software agent (pointed at the URL) than a top-heavy oversized service model (which are top-heavy precisely because they fight an uphill battle in not having a way to capture certain semantics in a universally understood way).

Ofcourse, this is a scenario where, you bypass the query language (it wouldn’t make sense to have a SPARQL endpoint over RForm data when you can extract it for ‘free’) to export a resource with a different (and more expressive) modality. However, the immediate value in doing something like this (where there isn’t much of an alternative with ‘vanilla’ web) is an argument (an indirect one) for the likelyhood that SPARQL will be more useful for the web than SQL was – mostly due to the value of RDF itself.

Actually, I think the key reason why we WILL see more Sparql endpoints than SQL endpoints (yuck, endpoints) is that Sparql as specified is readonly. That leaves privacy as the primary concern around exposing public Sparql services.

Plus I’ll be putting rather a lot of them public myself over the coming months :) (and I’ve never ever put a public SQL endpoint on the web)

Chimezie, thanks for the response. It sounds like our main disagreement is the utility for multi-agency integration. What I mean by that, if it wasn’t clear, is that because I don’t see public SPARQL endpoints becoming common, that integration will have to occur by simply GETting RDF from multiple sites, shoving that into a triple store, and at that point SPARQL can be useful on them. And in the 2nd paragraph above, I tried to explain why I felt public SPARQL endpoints wouldn’t be common. Do you disagree with those reasons? Let me elaborate on them in my response to Ian …

Ian – I reckon it’s trivial to configure a SQL database to make it read-only, so I’m not sure it’s that big a deal. I suppose it might help make it a little more attractive, but I don’t see it being enough of a difference to make SPARQL endpoints “common”.

I think the issue is much more subtle because you need to consider the costs to the would-be publisher. A large part of the reason the Web is so successful is because it managed to dramatically lower the cost of making data available to third parties. SPARQL changes that because instead of exporting the brain-dead-simple-and-cheap-to-support (because it’s trivial to optimize for) GET interface, it exposes a much more complex interface that cannot be easily optimized for. This makes the cost of executing an arbitrary query significantly more expensive.

If there were to be more SPARQL endpoints, I’d expect one or both of the following to be the case; that clients will need a pre-existing relationship to the publisher and need to authenticate to use the service, or that the publisher has a vested interest in seeing SPARQL succeed and therefore is willing to absorb those increased costs I mentioned.

I am a little late to the discussion!
SPARQL solves the major issue of querying a federation of Data Spaces such as: Blogs, Wikis, Discussion Forums etc.. There are going to be numerous SPARQL endpoints in time. The pieces are coming togther really fast in th form of Shared Ontolgies and Instance Data for these Ontologies etc..

The key to all of this is infrastructure for integrating disparate schemas via Ontology mapping which is what you get via RDF, RDFS, OWL etc.. With such in place, the power of SPARQL (Language, Protocol, and Results Serialization) is much clearer. Here is an example of SPARQL working a against real data from a SQL schema that’s mapped to SIOC: http://virtuoso.openlinksw.com/wiki/main/Main/ODSSIOCRef .

SPARQL is ultimately going to have a layer or two above it as part of the general evolution of the Web in general. It isn’t perfect, but it is headed in the right direction and better than alternatives (IMHO).

SPARQL solves the federation problem, I agree, but not the decentralization problem which I think a lot of people think it solves. The difference is that the former is data from a single authority (e.g. Bob has 4 data sources he wants to integrate), while the latter is multiple authorities (e.g. Alice has a data source, Mary has a data source, …).

I tried to explain why I felt this was the case in the post; that the SPARQL interface is too complex and therefore too costly to expose to the public.

Sean Martin06/09/21

Hello Mark,
It seems to me there are few other things you might consider factoring in to your thinking.

While it is true that open ended query is going to be a lot more expensive than simple web GETS (just look what happened to servers when we crazies moved from static content to dynamic page composition in the early 90′s) many found then that it was well worth the extra expense because of the new value generated for them was so immense and/or the cost savings were so significant. Note that we were forced to come up with a whole series of new technologies & techniques to scale up this functionality to meet the demand and at the time it really did look more or less insurmountable. But just look at what we can do on the web now as a result.

I would argue that the current crop of Web 2.0 companies generating their value using mash-up techniques to do fairly simple & limited data source integration are my (very) earliest proof points. While it may or may not be RDF/SPARQL in the end, the value that the kind of wide scale integration of information that they seem to promise is so enormous that it seems certain to me that technologies & standards that make it easier to do wide area federation and integration are going to have to exist in the not too distant future. Oh and another thing, the availability of digital data continues to rise exponentially – moving from a fire hose to a stream to Niagara falls in a relatively short space of time. You know that people are going to find interesting, valuable ways to combine it all.

Finally, remember that it gets much cheaper to do more compute as hardware (both cpu and storage), software and data center administration techniques improve and costs per cycle/per Gb continue to plummet – just look at the task Google takes on – I read just today that they plan to index a billion pages. This is possible both because the cost of compute has fallen and that they have found business value in taking on the task. All of this was more or less unimaginable just ten years ago and at least twice as more or less unimaginable eleven years ago.

I certainly agree that mashups are a big deal, but they all use GET AFAICT.

Regarding your first paragraph though, I think we agree about the mechanism, but disagree about the scale; my reasoning for concluding that public SPARQL endpoints won’t be common is based on that same reasoning, but I think it’s enough to drive its use to, say, a few dozen or hundred services rather than the millions many might envision. But it’s also not like SPARQL permits data to be exchanged that is otherwise unable to be exchanged; it just does it in a more expensive way than GET provides, which is why I expect GET to win.

I don’t buy the cheaper resource argument though. The cost will always be considerably greater than GET because there’s far more degrees of freedom in how a query can be constructed and therefore far fewer ways in which to optimize. We’re talking orders of magnitude difference here.

Chris Bizer07/03/08

Another late comment on this discussion.

I think we will end up with a mixture of RDF crawling and SPARQL queries over the crawled data.

It sounds like we’re in violent agreement, Chris, because you said “I agree that requiring each data source to provide a SPARQL endpoint is not a good idea” and that was really the point of my post.

My other comment to you – that SPARQL doesn’t have much to do with the Web – was premised on exactly that point, because SPARQL isn’t something that two independent agents on the Web need to agree upon.

You and I have discussed this endlessly, so I don’t expect to convince you; but I really must flag yr incredibly tendentious claims here for others who don’t know better. SPARQL does *not* “expose a much more complex interface than GET”. That’s just *wrong*, and I know that you know it’s wrong. Bad form, Mark.

SPARQL encodes queries exactly into GETs! The first SPARQL protocol client for Python is *dozens*, not hundreds, of lines of code, most of which is XML results handling. SPARQL servers are as simple to develop.

What you are hung up on, of course, is that we used WSDL 2.0 to specify the SPARQL protocol, and you think WSDL is bad for the Web, or something. Hey, fine, ride yr REST hobby horse for all it’ll gain you. I think that’s great if it works for you.

(I’ll point out briefly that the SPARQL protocol spec merely uses WSDL to formalize the protocol; it in no way requires anyone, developers or users, to use WSDL *in any way whatsoever*. None. All of Mark’s very vague claims about “complexity” notwithstanding.)

But, in reality, SPARQL makes nice, if not very adventurous, use of HTTP, using GET to pass around queries, and POST when the query is too long to serialize in a URL.

Everything else you’ve said about its complexity, and about how it violates WebArch, is suspect and disputed by other people who also know a few things about the Web.

I expect more fairness and accuracy when summarizing views you don’t agree with, Mark! :>

[Sorry for the delay in posting the comment - it got caught by Akismet.]

Kendall – I’m saying that SPARQL *is* useful! I look forward to using it myself, as I do a reasonable amount of work with RDF. I’m just pointing out its limitations based on my understanding of Web architecture.

You’re free to disagree of course, but it would be good if you could provide technical reasons why I’m wrong rather than an appeal to authority (as fine an authority as that is, assuming you mean DanC).

Regarding your “I know that you know it’s wrong” comment, I really truly don’t. Encoding a query into a URI is a significantly different approach than a more Web-like hypermedia based URI/form/GET interaction. Also, my position outlined here wouldn’t change had WSDL not been mentioned at all.

Jeryl – I didn’t say “more”, I said “not many”, where “not many” is measured relative to how many queryable data sources there are. I haven’t personally seen any SPARQL endpoints listed alongside RSS endpoints, though I don’t doubt that there are some. So what proportion of RSS feeds have SPARQL endpoints? 1 in 100000? 1 in 1000000? I’d count all of those as “not many”.

1. I think that SPARQL is a lot more flexible than one thinks at first. Of course one major use is for clients to query a server. But another one that seems very appealing is as a replacement for forms, as a way for servers to query clients. I have described a sketch of how this could work here:http://blogs.sun.com/bblfish/entry/restful_semantic_web_services
This would of course not at all be heavy on the server. Something to investigate.

2. As for clients asking servers one should not underestimate the complexity of queries sent to search engines. Search engines like AltaVista where I worked had some very long running queries. AV would let them run if there was enough CPU available. Most queries though were just one word queries (which had its own problems of course: how the hell do you know what someone is looking for when they just enter one word!) What is needed is for SPARQL endpoints to be toughened for the open world: you should be able to specify policies for how long queries can last, how much cpu they can use or how much percentage of cpu they can use, and so on. Without that kind of functionality I agree with Mark that SPARQL endpoints will not be viable (unless the engine only makes a very limited number of relations available).
Yes GET to static resources will always be the most common scenario on the web. Those resources are easy to represent, easy to cache, easy to optimise. SPARQL queries are really powerful. But one does not calculate the value of something by how many of those things there are. Without search engines the web would not be anywhere close to as interesting as it is. So SPARQL endpoints (server side) may not be numerous, but that does not mean they may not be game changers.

So from the reasoning above I see the following:
a. Every large company has a search engine, every large company will have a SPARQL endpoint
b. Every client will become a SPARQL endpoint.
c. Every application will become SPARQL conscious (and of course RESTful)
d. The above seems to point to a p2p SPARQL world

Each of these servers requires different type of SPARQL technolgies btw.

Your resource-consumption limitation idea is an interesting one, but my POV already accomodates it in the sense that all “queries” that fall below this threshold for a given publisher will be handled via forms, URIs, and GET rather than SPARQL-to-a-SPARQL-endpoint (see the cost internalizing paragraph in my post). You might argue, as you say there, that SPARQL queries are a kind of form, but I don’t think so because they give the client more expressive power than what a traditional form affords, and therefore increase costs for the server/publisher.

Another way to look at it is that different publishers will clearly have different resource consumption limits, and that there’s obviously going to be a lot more with low limits than there will be with high limits. So for any given query, the number of endpoints that will process it successfully will be low, thereby reducing the value of querying that data in that way. The only way to increase the number of endpoints that can process an arbitrary query is to keep the query really simple … which is the motivation behind a forms driven approach to “query”.

As for your point about search engines already having resource-consumption limits built in, I know that’s true. But it’s the same situation whether the query arrives via an HTML form submission or a SPARQL document, so I don’t think it matters for the purposes of comparing the two approaches.

Thanks for the update Kingsley, I do agree things have evolved, but I don’t see how any of those changes attempts to resolve the architectural differences I described. I did mention federation (as “multi-agency”), but didn’t point it out because of a lack of a standard which SPARQL-FED clearly addresses, but instead out of similar architectural reasons. So like SPARQL itself, I see SPARQL-FED suitable for pre-planned partnerships that have all a priori agreed to support federation, rather than the ad-hoc mashup aggregations we’re used to seeing on the Web.

Currently you have JavaScript disabled. In order to post comments, please make sure JavaScript and Cookies are enabled, and reload the page.Click here for instructions on how to enable JavaScript in your browser.