Posts tagged with: hd streams

Last week I've been at MBC09, a conference about all things microblogging. It was organized by Cem Basman, who not only managed to get very interesting speakers on stage, but also a report into Germany's main TV news.

The conference started slowly, with a little underwhelming sponsored keynotes, but then quickly turned into a great source for inspiration, thanks to barcamp-style tracks during the rest of the conference. I particularly enjoyed the session about Communote (a microblogging system for corporate use), a talk by Marco Kaiser about near-client XMPP at seesmic, and a panel about "Twitter and Journalism" (really entertaining panel speakers).

As usual, I pulled a near-all-nighter to hack on a funky demo, only to get on stage and have a projector that didn't like MacBooks. Luckily, I was co-speaking with Sebastian Kurt who had slides for using Twitter as an interface to remote apps (todo list management and similar things), so I didn't have to fill the whole 30 mins stuttering about demos that no one could see. Anyway, given the circumstances, the session didn't go too badly. Interestingly, the Zemanta and Calais APIs triggered most of the questions.

I've now uploaded my slides (and added some screenshots of the demo prototypes), in case you're interested in this stuff or wonder what I would have talked about, or if you didn't see the demos after the session.

A couple of weeks ago I wrote about the exciting possibilities of LOD-enabled NLP APIs, and if they could bring another power-up to RDF developers by simplifying the creation of DIY Semantic Web apps. When Thomson Reuters released Calais 4.0 two days ago, I had a go.

The idea: Create a simple tool that aggregates bookmarks and microposts for a given set of tags (from Twitter, identi.ca, Delicious, and ma.gnolia), pumps them through Calais and Zemanta, and then lets me browse the incoming stream of items based on typed entities, not just keywords. Something like a poor man's Twine, but with a fully-fledged SPARQL API and content automagically enhanced from LOD sources. Check out this month's Semantic Web Gang podcast for more details about Calais.

I set myself a time limit of one person day, so I ended up with just a very basic prototype, but it already shows the network effect kicking in when distributed data fragments can be connected through shared identifiers. Each of the discovered facets can be used as a smart filter (e.g. "Show me only items related to the Person Tim Berners-Lee"), and we could also pull in more information about the entities, as we know their respective LOD URI.

Wish I had funds to explore this a little more, but below is a screenshot showing the "HD Streams" test app in action. It's basically sending each micropost and bookmark to the APIs and then does lookups to DBPedia, Semantic CrunchBase and Freebase to retrieve additional type information. Plus a set of SPARQL+ INSERT queries to later accelerate the filtering.

There are some false positives (e.g. the Calais NLP service is typed as a place), but the APIs offer a score for each detection and I've set the barrier for inclusion very low. The interesting thing is that the grouping of items in the facets column is actually done via LOD information. The APIs only return IDs (or URIs), say, for Berlin, but this reference allows HD Streams to pull in more information and then associate Berlin with the "Place" filter.

This, however, is only the most simple use. The really exciting next step would be smart facets based on the aggregated information. Thanks to SPARQL, I could easily add filters that dive deeper into the LOD-enhanced graph. Like "Filter by posts related to Capitals in Europe", or related to places within a certain lat/long boundary, or with a population larger than x, or about products by competitors of y.

Something the prototype is not doing is expanding shortened URLs. Those could be normalized. Calais 4.0 does URL extraction already, this would just be another SPARQL query and a little PHP loop. Then we could add a simple ranking algorithm based on the number of tweets about a certain URL. The current app took just about 12 hours of work, RDF's extensible data model accelerated development through all stages of the process (well, ok, not during the design/theming phase ;). I didn't have to analyze the data coming from the two APIs at all. No pre-coding schema consideraions. I just loaded everything into my schema-free RDF store and then used incrementally improved graph queries to identify the paths I needed. For geeks: Below is the SPARQL+ snippet that injects LOD entity and label shortcuts from Zemanta results directly into the item descriptions ($res is the URL of an RSS item or bookmark. hds is the namespace prefix used by HD Streams):

I've said it before, but it's worth repeating: RDF and SPARQL are great solutions for today's (and tomorrow's) data integration problem, but they are equally impressive as productivity boosters for software developers.