Introducing Spanner: From Documents to Linked Data Apps

TLDR? Spanner is a new product that automagically turns documents (of
any kind) into a full-featured semantic web—or, if you prefer, “Linked
Data”—application that is easily customized or extended via JavaScript.

Interested? Read on for the long play version, where I address some common
objections to semantic technologies to show how Spanner handles them.

From What to What?

A common lament about semantic technologies: where does the RDF and OWL
come from? If you need to integrate a lot of databases and other structured
sources, converting to RDF and OWL is feasible. If fact, once you know how
to do it, it’s pretty simple.

But what if your data is unstructured, i.e., what if most of it lives in
documents?

Spanner extracts information from documents and converts
that information into RDF and OWL. It’s especially good at entity
extraction when a gazette (a list of names of entities, organized by type)
is provided, but works reasonably well without a gazette, too. Spanner uses
machine learning to discover connections between documents, entities, and
between entities and documents; it will also learn categories or tags for
documents if some of the documents are already tagged. Finally, it will
extract keywords and a single key sentence from every document.

It works for pretty much any kind of MS Office documents, as well as HTML,
plain text, email, PDF, etc.

We call this part of the process “data bootstrapping” and it’s fully
automated (of course, if you provide gazettes or other inputs, the quality
of the process improves, but the only required input is documents).

We Just Want to Publish Linked Data

We’ve got you covered. The core of Spanner is a Linked Data publishing
solution: give it an RDF file or a SPARQL endpoint, and it will publish that
data as Linked Open Data automagically, with very minimal configuration.

Going even further, in Spanner 1.1, you won’t even have to convert information
to RDF or standup a SPARQL endpoint: Spanner 1.1 will support publishing
native RDBMS data as RDF dynamically, on-the-holy-crap-is-that-cool-fly…
If you don’t need to do anything else but publish Linked Data, Spanner has you
covered.

Making an Ontology or Schema is Too Hard

There’s good news and there’s better news.

You don’t need an ontology or schema before Spanner
performs data bootstrapping. Don’t have an ontology? Can’t find a
publicly available one? Don’t want to build one? Don’t worry about it.
That’s the good news.

The better news is that if you have an ontology or schema, the data
bootstrapping process will just work better. That’s the better news.

There is no bad news.

My Org isn’t Full of SemWeb Developers

You don’t need any semantic technology expertise to use Spanner. Most of
Spanner can be extended, customized, skinned, rearranged, or otherwise
manipulated by writing ordinary JavaScript code against the Spanner APIs,
which are thin and simple and RESTful.

The better news is that you probably won’t need to do much other than
customize the default look-and-feel because Spanner is quite feature-rich:

Anyone who can write JavaScript and use a RESTful
interface will be a savvy semantic technology developer when using
Spanner. And, yes, that means that there still isn’t any bad news.

NLP isn’t Perfect, Or: What About Data Quality?

That’s right—sometimes the results of Natural Language Processing are
awful. What does that mean for users or developers?

Let’s be honest: it means that you’re going to have to take-on some data
curation and data quality burden; but, hey, you knew that already. Any
system that helps you pivot from document-based information management to,
well, anything else…is going to require you to curate data. The trick is
doing that at the lowest possible cost.

We realized that, by using machine learning (both unsupervised and
semi-supervised) plus some other tricks, we could build a system that offers
users a flexible means to improve data quality explicitly, while also
allowing (and training) machine learning systems to improve data quality
automatically, too. (We’ll post more technical details about this aspect of
the system as the Spanner 1.0 launch date gets closer.)

Spanner lets regular users—who don’t know anything about any of this
technology stuff—build complex, flexible, ontology-driven apps, all from
very simple web pages…with no “technology leakage”.

There’s no such thing as a free lunch. Combining machine learning, training,
and data curation into one system is the next best thing.

Next Steps

Spanner’s been under development for a year and is in production at some of
our customers already. Everything I’ve talked about in this post is real.

Now we’re looking for a few more reference customers, i.e., early adopters
who are willing to be guinea pigs as we finish up the last bits of
polishing, etc. As the man says, if you need this stuff, you need it bad.
We’re targeting late Q1, 2011 for Spanner general availability to early
adopters, reference customers, etc.