Posts tagged with: semsol

The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.

In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Archives indexer and monitor

The archives indexer fetches the by-month archives, extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store.

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:

For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.

Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).

Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch.

"Dynamic Semantic Publishing" is a new technical term which was introduced by the BBC's online team a few weeks ago. It describes the idea of utilizing Linked Data technology to automate the aggregation and publication of interrelated content objects. The BBC's World Cup website was the first large mainstream website to use this method. It provides hundreds of automatically generated, topically composed pages for individual football entities (players, teams, groups) and related articles.

Now, the added value of such linked "entity hubs" would clearly be very interesting for other websites and blogs as well. They are multi-dimensional entry points to a site and provide a much better and more user-engaging way to explore content than the usual flat archives pages, which normally don't have dimensions beyond date, tag, and author. Additionally, HTML aggregations with embedded Linked Data identifiers can improve search engine rankings, and they enable semantic ad placement, which are attractive by-products.

The architecture used by the BBC is optimized for their internal publishing workflow and thus not necessarily suited for small and medium-scale media outlets. So I've started thinking about a lightweight version of the BBC infrastructure, one that would integrate more easily with typical web server environments and widespread blog engines.

How could a generalized approach to dynamic semantic publishing look like?

We should assume setups where direct access to a blog's database tables is not available. Working with already published posts requires a template detector and custom parsers, but it lowers the entry barrier for blog owners significantly. And content importers can be reused to a large extent when sites are based on standard blog engines such as WordPress or Movable Type.

Step 2: Not-yet-imported posts from the generated blog index are parsed into core structural elements such as title, author, date of publication, main content, comments, Tweet counters, Facebook Likes, and so on. The semi-structured post information is added to the triple store for later processing by other agents and scripts. Again, we need site (or blog engine)-specific code to extract the various possible structures. This step could be accelerated by using an interactive extractor builder, though.

Step 3: Post contents are passed to APIs like OpenCalais or Zemanta in order to extract stable and re-usable entity identifiers. The resulting data is added to the RDF Store.

After the initial semantification in step 3, a generic RDF data browser can be used to explore the extracted information. This simplifies general consistency checks and the identification of the site-specific ontology (concepts and how they are related). Alternatively, this could be done (in a less comfortable way) via the RDF store's SPARQL API.

Step 4: Once we have a general idea of the target schema (entity types and their relations), custom SPARQL agents process the data and populate the ontology. They can optionally access and utilize public data.

After step 4, the rich resulting graph data allows the creation of context-aware widgets. These widgets ("Related articles", "Authors for this topic", "Product experts", "Top commenters", "Related technologies", etc.) can now be used to build user-facing applications and tools.

Use case 2: Improving the source blog. The typical "Related articles" sections in standard blog engines, for example, don't take social data such as Facebook Likes or re-tweets into account. Often, they are just based on explicitly defined tags. With the enhanced blog data, we can generate aggregations driven by rich semantic criteria.

Use case 3: Authoring extensions: After all, the automated entity extraction APIs are not perfect. With the site-wide ontology in place, we could provide content creators with convenient annotation tools to manually highlight some text and then associate the selection with a typed entity from the RDF store. Or they could add their own concepts to the ontology and share it with other authors. The manual annotations help increase the quality of the entity hubs and blog widgets.

The code bundles on the ARC website are generated in an inefficient manual process, and each patch has to wait for the next to-be-generated zip file. The developer community is growing (there are now 600 ARC downloads each month), I'm increasingly receiving patches and requests for a proper repository, and the Trice framework is about to get online as well. So I spent last week on building a dedicated source code site for all semsol projects at code.semsol.org.

So far, it's not much more than a directory browser with source preview and a little method navigator. But it will simplify code sharing and frequent updates for me, and hopefully also for ARC and Trice developers. You can checkout various Bazaar code branches and generate a bundle from any directory. The app can't display repository messages yet (the server doesn't have bzr installed, I'm just deploying branches using the handy FTP option), but I'll try to come up with a work-around or an alternative when time permits.

Running an R&D-heavy agency in the current economical climate is pretty tough, but there are also a couple of new opportunities for these semantic solutions that help reduce costs and do things more efficiently. I'm finally starting to get project requests that include some form of compensation. Not much yet (all budgets seem to be very tight these days), but it's a start, and together with support from Susanne, I could now continue working on Paggr, semsol's Netvibes-like dashboard system for the growing web of Linked Data.

An article about Paggr will be in the next Nodalities Magazine, and the ESWC2009 technologies team is considering a custom system for attendees which is a great chance to maybe get other conference organizers interested. (I see much potential in a white-label offering, but a more mainstream-ish version for Web 2.0 data is still on my mind. Just have to focus on getting self-sustained first.)

Below is a short screencast that demonstrates a first version of the sparqlet (= semantic widget) builder. I've de-coupled sparqlet-serving from the dashboard system, so that I'll be able to open-source the infrastructure parts of Paggr more easily. Another change from the October prototype is the theme-ability of both dashboards and widget servers. Lots of sun, sky, and sea for ESWC ;-)

I've been semi-silently working on something new. A combination of many semwebby things I came across and played with during the last 3 years or so:

semantic markup

smart data

an rdf clipboard

ajax

sparql sparql sparql

sparql + scripting

sparql + templates

sparql + widgets

lightweight, federated semweb services and bots

UIs for open data

semwikis

agile and collaborative web development

So, what happens when you put this all together? At least something interesting, and perhaps semsol's first commercial service. (Or product, this is all just LAMP stuff and can easily be run in an intranet or on a hosted server). Anyway, still some way to go. It's called paggr, the landing page is up, and today I created a first teaser/intro video.

I'll demo the beta (launch planned for November) at upcoming ISWC during the poster session (my poster is about SPARQL+ and SPARQLScript, the two SPARQL extensions that paggr is based on). I may have early invites by then.

As a preparation for the hopefully busy fall and winter months, though, I'll be on vacation for the next two weeks. No Email, no Web, no Phone. Yay!

I'm still working on the new website for ARC, but I managed to set up a group mailing list yesterday. It's a little (*cough*) experimental, based on ARC2 and Trice (another forthcoming semsol product). So, this is a shout-out to ARC users and developers with an invitation to subscribe and help me test that "DIY SPARQL Mailman" before I do a proper announcement for the new site and community tools (hopefully later this week).

Introduced semsol and gave a mini-talk on the Semantic Web at Düsseldorf's first Web Monday.

Ha, I haven't even fully made the move to my new (self-)employer yet, and Web Monday is already coming to Düsseldorf (joining about 20 other cities in Germany). The first event was yesterday and happened in the cool (style-wise) and hot (summer is back!) Lounge of the Mediadesign University.

I took the opportunity to introduce semsol to the local Web crowd, but also put on my SWEO hat and signed up for a short presentation. For better marketing, I've been thinking a bit about distributing a set of single-page tech flyers recently (called "SemWeb on a Slide", inspired by the classical "Semantic Web Illustrated" series, although I'm not there yet). So, I tried a first version , and given the feedback I think this sort of scoped material has a lot of potential. Someone already asked for a version covering semantic markup. Anyway, the other talks were way cooler than mine (at least for me ;), I especially liked Siggi Becker's "Utopia is not a Trend", and the presentation of MIXXT, which seems to be People Aggregator done right.

Just
in case you're waiting for email replies: I'm moving to the new office
for my semweb startup this/next week and probably won't be online again
before Mon 14th. Well, I'm sort-of online now, but this
~0K dial-up can't really be called online...

This is going to change everything. Well, almost. I will continue to work on my Semantic Web solutions, but there will be a major re-branding and finally a focused roadmap. My code experiments and projects are going to be critically reviewed and consolidated. (I can't tell yet what stuff is going to be continued, but I'll keep my SWEO commitments, esp. the knowee community project which is going to start in April).
Quite some orga action coming up, but I'm looking forward to a clean bengee.reboot()

I'll move from Essen to Düsseldorf, which is closer to Cologne, the DUS airport, and also a little away from the Web periphery here, with the Ruhr Valley still in reach, though.

The appmosphere wordplay is going to be discontinued. No German really managed to pronounce or remember it correctly, and the *-osphere naming is rather overused these days anyway.

The new brand will most probably be semsol.com which is going to be transformed to a Semantic Web Agency. (I've always been a frontend developer, combing this with an in-house RDF system will hopefully form a nice USP for the anticipated move towards info-driven Web apps.)

The open source RDF framework currently named semsol will get a new name (perhaps just "semsol suite", we'll see), and there will be more product-style solutions (a browser, an editor, a schema manager, etc.).

ARC will keep its name, but is going to be re-coded as ARC2 based on the experience and feedback obtained so far.

Yeah, I know, these predictions get inflationary, but when they mention the core ingredients I try to integrate in a product, it may be worth a post.

So, Richard MacManus' Read/WriteWeb team predicts for 2007:
"structured data will be a big trend next year""Widgets exploded in 2006 but will continue rising in 2007""Semantic Web products will come of age in 2007""social networks will probably also become more open - and data portability will start to occur"
Would be cool if they are right ;)

I've put up a little preview site for the SemSol framework today. Not much yet, just a basic description and some screenshots. A first public release is planned for Q1/2007, I'd like to test it with some other projects first.

Speaking of projects: you may have noticed that my 10/2006 re-re-launch version of the SemanticWeb.org site has been removed (if you noticed the relaunch at all), DERI is going to put more internal resources on the portal from now on. Although it would have been a great stress-test project, I have to admit that using already mature tools like Wordpress, Mediawiki, and Drupal reduces risks on their side and also frees a lot of resources here. The new site is going to get a conceptual change, but I'll try to make the already aggregated and manually created data available via some other community project. The new site I plan to test SemSol with is going to be a Semantic Social Networking Service which will also provide some editorial content (my personal little non-W3C SWEO initiative). The SNS will of course be RDF-based, so there is still a back-integration option for semanticweb.org, should it get a SPARQL upgrade at some later stage.

The great thing about creating apps with the SemSol framework is that I'll never have to worry about database tables and multi-table SQL queries again. One of the few "taxes" that come with such an RDF-everywhere approach, though, is the need for proper terms for each type of data structure (users, permissions, pages, posts, comments, events, etc).

Luckily, there are a lot of vocabularies available already, and today I noticed that usable namespace URIs were added to the HTTP Vocabulary in RDF Editor's Draft a couple of days ago. Now it was only a 30 minute routine to add request tracking to SemSol, and with SPARQL being so easy to use it should be simple to create basic usage reports as well.

Just a reminder: The call for the SemWeb Challenge 2006 ends this week (Friday, 14th). If you are working on an RDF app, consider participating. It's a lot of fun, you'll get incredibly useful feedback, and the organizers clearly deserve loads of submissions for running this community event!

Unfortunately, I'm not going to participate this year as I have to focus on coding during the next months in order to push my SPARQL CMS thingy to version 1.0. And maybe I should stop listening to StarWars tunes while I'm working on layout stuff..