Posts tagged with: paggr

Several months ago (ugh, time flies) I posted a screencast demo'ing a semantic HTML editor. Back then I used a combination of client-side and server-side components, which I have to admit led to quite a number of unnecessary server round-trips.

In the meantime, others have shown that powerful client-side editors can be implemented on top of HTML5, and so I've now rewritten the whole thing and turned it into a pure JavaScript tool as well. It now supports inline WYSIWYG editing and HTML5 Microdata annotations.

The code is still at beta stage, but today I put up an early demo website which I'll use as a sandbox. The editor is called Swipe (like the dance move, but it's an acronym, too). What makes Swipe special is its ability to detect the caret coordinates even when the cursor is inside a text node, which is usually not possible with W3C range objects. This little difference enables several new possibilities, like precise in-place annotations or "linked-data-as-you-type" functionality for user-friendly entity suggestions. More to come soon...

The previous post described a generic approach to BBC-style "Dynamic Semantic Publishing", where I wondered if it could be applied to basically any weblog.

During the last days I spent some time on a test evaluation and demo system using data from the popular ReadWriteWeb tech blog. The application is not public (I don't want to upset the content owners and don't have any spare server anyway), but you can watch a screencast (embedded below).

The application I created is a semantic dashboard which generates dynamic entity hubs and allows you to explore RWW data via multiple dimensions. To be honest, I was pretty surprised myself by the dynamics of the data. When I switched back to the official site after using the dashboard for some time, I totally missed the advanced filtering options.

In case you are interested in the technical details, fasten your data seatbelt and read on.

Behind the scenes

As mentioned, the framework is supposed to make it easy for site maintainers and should work with plain HTML as input. Direct access to internal data structures of the source system (database tables, post/author/commenter identifiers etc.) should not be needed. Even RDF experts don't have much experience with side effects of semantic systems directly hooked into running applications. And with RDF encouraging loosely coupled components anyway, it makes sense to keep the semantification on a separate machine.

In order to implement the process, I used Trice (once again), which supports simple agents out of the box. The bot-based approach already worked quite nicely in Talis' FanHubz demonstrator, so I followed this route here, too. For "Linked RWW", I only needed a very small number of bots, though.

Archives indexer and monitor

The archives indexer fetches the by-month archives, extracts all link URLs matching the "YYYY/MM" pattern, and saves them in an ARC Store.

The implementation of this bot was straightforward (less than 100 lines of PHP code, including support for pagination); this is clearly something that can be turned into a standard component for common blog engines very easily. The result is a complete list of archives pages (so far still without any post URLs) which can be accessed through the RDF store's built-in SPARQL API:

A second bot (the archives monitor) receives either a not-yet-crawled index page (if available) or the most current archives page as a starting point. Each post link of that page is then extracted and used to build a registry of post URLs. The monitoring bot is called every 10 minutes and keeps track of new posts.

Post loader and parser

In order to later process post data at a finer granularity than the page level, we have to extract sub-structures such as title, author, publication date, tags, and so on. This is the harder part because most blogs don't use Linked Data-ready HTML in the form of Microdata or RDFa. Luckily, blogs are template-driven and we can use DOM paths to identify individual post sections, similar to how tools like the Dapper Data Mapper work. However, given the flexibility and customization options of modern blog engines, certain extensions are still needed. In the RWW case I needed site-specific code to expand multi-page posts, to extract a machine-friendly publication date, Facebook Likes and Tweetmeme counts, and to generate site-wide identifiers for authors and commenters.

Writing this bot took several hours and almost 500 lines of code (after re-factoring), but the reward is a nicely structured blog database that can already be explored with an off-the-shelf RDF browser. At this stage we could already use the SPARQL API to easily create dynamic widgets such as "related entries" (via tags or categories), "other posts by same author", "most active commenters per category", or "most popular authors" (as shown in the example in the image below).

Named entity extraction

Now, the next bot can take each post's main content and enhance it with Zemanta and OpenCalais (or any other entity recognition tool that produces RDF). The result of this step is a semantified, but rather messy dataset, with attributes from half a dozen RDF vocabularies.

Schema/Ontology identification

Luckily, RDF was designed for working with multi-source data, and thanks to the SPARQL standard, we can use general purpose software to help us find our way through the enhanced assets. I used a faceted browser to identify the site's main entity types (click on the image below for the full-size version).

Although spotting inconsistencies (like Richard MacManus appearing multiple times in the "author" facet) is easier with a visual browser, a simple, generic SPARQL query can alternatively do the job, too:

Specifying the target ontology

The central entity types extracted from RWW posts are Organizations, People, Products, Locations, and Technologies. Together with the initial structures, we can now draft a consolidated RWW target ontology, as illustrated below. Each node gets its own identifier (a URI) and can thus be a bridge to the public Linked Data cloud, for example to import a company's competitor information.

Aligning the data with the target ontology

In this step, we are again using a software agent and break things down into smaller operations. These sub-tasks require some RDF and Linked Data experience, but basically, we are just manipulating the graph structure, which can be done quite comfortably with a SPARQL 1.1 processor that supports INSERT and DELETE commands. Here are some example operations that I applied to the RWW data:

For each untyped entity, retrieve typing and label information from the Linked Data cloud (e.g. DBPedia, Freebase, or Semantic CrunchBase) and try to map them to the target ontology.

Try to consolidate "obviously identical" entities (I cheated by merging on labels here and there, but it worked).

Data alignment and QA is an iterative process (and a slightly slippery slope). The quality of public linked data varies, but the cloud is very powerful. Each optimization step adds to the network effects and you constantly discover new consolidation options. I spent just a few hours on the inferencer, after all, the Linked RWW demo is just meant to be a proof of concept.

After this step, we're basically done. From now on, the bots can operate autonomously and we can (finally) build our dynamic semantic publishing apps, like the Paggr Dashboard presented in the video above.

Conclusion

Dynamic Semantic Publishing on mainstream websites is still new, and there are no complete off-the-shelf solutions on the market yet. Many of the individual components needed, however, are available. Additionally, the manual effort to integrate the tools is no longer incalculable research, but is getting closer to predictable "standard" development effort. If you are perhaps interested in a solution similar to the ones described in this post, please get in touch.

Let's face it, building semantic web sites and apps is still far from easy. And to some extent, this is due to the configuration overhead. The RDF stack is built around declarative languages (for simplified integration at various levels), and as a consequence, configuration directives often end up in some form of declarative format, too. While fleshing out an RDF-powered website, you have to declare a ton of things. From namespace abbreviations to data sources and API endpoints, from vocabularies to identifier mappings, from queries to object templates, and what have you.

Sadly, many of these configurations are needed to style the user interface, and because of RDF's open world context, designers have to know much more about the data model and possible variations than usually necessary. Or webmasters have to deal with design work. Not ideal either. If we want to bring RDF to mainstream web developers, we have to simplify the creation of user-optimized apps. The value proposition of semantics in the context of information overload is pretty clear, and some form of data integration is becoming mandatory for any modern website. But the entry barrier caused by large and complicated configuration files (Fresnel anyone?) is still too high. How can we get from our powerful, largely generic systems to end-user-optimized apps? Or the other way round: How can we support frontend-oriented web development with our flexible tools and freely mashable data sets? (Let me quickly mention Drupal here, which is doing a great job at near-seamlessly integrating RDF. OK, back to the post.)

Enter RDF widgets. Widgets have obvious backend-related benefits like accessing, combining and re-purposing information from remote sources within a manageable code sandbox. But they can also greatly support frontend developers. They simplify page layouting and incremental site building with instant visual feedback (add a widget, test, add another one, re-arrange, etc.). And, more importantly in the RDF case, they can offer a way to iteratively configure a system with very little technical overhead. Configuration options could not only be scoped to the widget at hand, but also to the context where the widget is currently viewed. Let's say you are building an RDF browser and need resource templates for all kinds of items. With contextual configuration, you could simply browse the site and at any position in the ontology or navigation hierarchy, you would just open a configuration dialog and define a custom template, if needed. Such an approach could enable systems that worked out of the box (raw, but usable) and which could then be continually optimized, possibly even by site users.

A lot of "could" and "would" in the paragraphs above, and the idea may sound quite abstract without actually seeing it. To illustrate the point I'm trying to make I've prepared a short video (embedded below). It uses Semantic CrunchBase and Paggr Prospect (our new faceted browser builder) as an example use case for in-context configuration.

And if you are interested in using one of our solutions for your own projects, please get in touch!

A combination of RDFa and Microdata would allow for separate semantic layers.

Apart from grumpy rants about the complexity of W3C's RDF specs and semantic richtext editing excitement, I haven't blogged or tweeted a lot recently. That's partly because there finally is increased demand for the stuff I'm doing at semsol (agency-style SemWeb development), but also because I've been working hard on getting my tools in a state where they feel more like typical Web frameworks and apps. Talis' Fanhu.bz is an example where (I think) we found a good balance between powerful RDF capabilities (data re-purposing, remote models, data augmentation, a crazy army of inference bots) and a non-technical UI (simplistic visual browser, Twitter-based annotation interfaces).

Another example is something I've been working on during the last months: I somehow managed to combine essential parts of Paggr (a drag&drop portal system based on RDF- and SPARQL-based widgets) with an RDF CMS (I'm currently looking for pilot projects). And although I decided to switch entirely to Microdata for semantic markup after exploring it during the FanHubz project, I wonder if there might be room for having two separate semantic layers in this sort of widget-based websites. Here is why:

As mentioned, I've taken a widget-like approach for the CMS. Each page section is a resource on its own that can be defined and extended by the web developer, it can be styled by themers, and it can be re-arranged and configured by the webmaster. In the RDF CMS context, widgets can easily integrate remote data, and when the integrated information is exposed as machine-readable data in the front-end, we can get beyond the "just-visual" integration of current widget pages and bring truly connectable and reusable information to the user interface.

Ideally, both the widgets' structural data and the content can be re-purposed by other apps. Just like in the early days of the Web, we could re-introduce a copy & paste culture of things for people to include in their own sites. With the difference that RDF simplifies copy-by-reference and source attribution. And both developers and end-users could be part of the game this time.

Anyway, one technical issue I encountered is when you have a page that contains multiple page items, but describes a single resource. With a single markup layer (say Microdata), you get a single tree where the context of the hierarchy is constantly switching between structural elements and content items (page structure -> main content -> page layout -> widget structure -> widget content). If you want to describe a single resource, you have to repeatedly re-introduce the triple subject ("this is about the page structure", "this is about the main page topic"). The first screenshot below shows the different (grey) widget areas in the editing view of the CMS. In the second screenshot, you can see that the displayed information (the marked calendar date, the flyer image, and the description) in the main area and the sidebar is about a single resource (an event).

Trice CMS editing view

Trice CMS page view with inline widgets describing one resource

If I used two separate semantic layers, e.g. RDFa for the content (the event description) and Microdata for the structural elements (column widths, widget template URIs, widget instance URIs), I could describe the resource and the structure without repeating the event subject in each page item.

To be honest, I'm not sure yet if this is really a problem, but I thought writing it down could kick off some thought processes (which now tend towards "No"). Keeping triples as stand-alone-ish as possible may actually be an advantage (even if subject URIs have to be repeated). No semantic markup solution so far provides full containment for reliable copy & paste, but explicit subjects (or "itemid"s in Microdata-speak) could bring us a little closer.

Conclusions? Err.., none yet. But hey, did you see the cool CMS screenshots?

In case you missed the tweets or a local announcement: The first Paggr application went online a few days ago. This year's ESWC Technologies Team pushed things a little further, with RFID tracking during the event and extended conference data that includes detailed session and date/time information (kudos to Michael Hausenblas for RDFizing even PDFs).

Based on this dataset, we provided a conference explorer and stress-tested the "Dog Food" server while at it. The system survived, but I also learned a lot. We used about 50 RDF stores for the different public and user-specific dashboards, which basically worked nicely. However, rendering non-ugly resource summaries requires a bit of endpoint hammering, and some of the more complex path queries resulted in timeouts. Yesterday, I had to create a mirror from the data dump to route a couple of widgets through a replicated (ARC :-) endpoint. But then this is also one of the powerful possibilities that come with semantic web technologies. You can often switch or double the back-end repository in no time, and without any code changes. (And as all the Sparqlets are created in a web-based tool, I didn't even have to upload a changed configuration file. I simply tweaked a SPARQLScript parameter.)

Anyway, there are a couple of publicdashboards, in case you'd like to give it a try (it's still an early version), I also embedded a short screencast below. The system is going to be moved to a DERI server when the conference is over, but the URIs and data will probably stay stable. (And no, it won't really work with IE yet.) More to come!

I just returned from a short, doc-enforced trip to Nice (awesome place, savoir-vivre and all that) and will fly to the NYC SemWeb Meetup in a few days. Before we went to France, I created another Paggr screencast. This one is the first to show the (user-facing) dashboard and widgets we plan to make available as a semantic conference explorer at ESWC 2009. Still some way to go, but I'm optimistic that we'll have a number of handy helpers online by the beginning of the event. I won't be able to attend in person, so I'm highly motivated to have at least a twitter and twitpic tracker up and running then.

Running an R&D-heavy agency in the current economical climate is pretty tough, but there are also a couple of new opportunities for these semantic solutions that help reduce costs and do things more efficiently. I'm finally starting to get project requests that include some form of compensation. Not much yet (all budgets seem to be very tight these days), but it's a start, and together with support from Susanne, I could now continue working on Paggr, semsol's Netvibes-like dashboard system for the growing web of Linked Data.

An article about Paggr will be in the next Nodalities Magazine, and the ESWC2009 technologies team is considering a custom system for attendees which is a great chance to maybe get other conference organizers interested. (I see much potential in a white-label offering, but a more mainstream-ish version for Web 2.0 data is still on my mind. Just have to focus on getting self-sustained first.)

Below is a short screencast that demonstrates a first version of the sparqlet (= semantic widget) builder. I've de-coupled sparqlet-serving from the dashboard system, so that I'll be able to open-source the infrastructure parts of Paggr more easily. Another change from the October prototype is the theme-ability of both dashboards and widget servers. Lots of sun, sky, and sea for ESWC ;-)

I'm currently writing an article about paggr for the Nodalities Magazine. As there is not too much to write about yet, I'm focusing on the basic idea (customizable Linked Data dashboards), its inspiration (TimBL's RDF Clipboard concept), enabling technologies and trends (Live Clipboard, widgets, AJAX homepages, sub-page-level interaction), and the user interface challenges related to generic interaction with Linked Data.

One thing that I thought might be worth sharing separately is the "Linked Data Value Spiral" below. It tries to illustrate that semantic data don't have a single-loop life cycle, but that re-distributing utilized ( = newly meshed/combined) information will create a self-enforcing "Linked Data ecosystem". I tried to associate the individual value creation processes with SemWeb market sectors. (RDF stores, for example, are typical information organization products, paggr tries to remove the bottleneck between utilization and re-distribution, etc.)

It's just an abstraction, the boundaries are of course blurry (a SPARQL endpoint can help with both utilization and discovery), but I still find the simple spiral and its segments handy to classify current products and companies. It even helped me a little to identify market opportunities and gaps:

The recent VoiD effort could have a significant impact on the whole Semantic Web progress.

Entity extraction providers like Zemanta and OpenCalais could play a huge role to boost creation processes.

What about "accelerator" products that offer a shortcut between utilization and creation (i.e. apps that create Linked Data while you are using them, with instant re-distribution)?

Is it a problem when a service like Freebase exports RDF but doesn't provide links to external datasets?

ISWC 2008 in Karlsruhe was just great. Even won the Semantic Web Challenge.

What can I say? I'm still smiling like on the pic on the left (credits: Keith Alexander). And you have no idea how urgently I need the money ;-)

paggrhasreceivedveryencouragingfeedback (or prematurepraise, rather), so I'm busily working on getting the beta out as soon as possible. Especially given that paggr wouldn't have had a chance to convince the judges without the great amount of Linked Data and all the painful spec work by the Semantic Web Community. The ball's in my court to actually deliver now.

There are some items left on my todo list before I dare sending out more invitation codes (some were added after feedback at ISWC):

improved RDF exporter for portals and individual widgets (just finished the first version, using a new thingy called poshRDF)

the widget and agent builders should be visual, more like the cool SPARQLMotion editor (I'm working on that now).

dropping a widget item on the canvas should auto-open a corresponding details widget

widgets should be able to "listen to" other widgets for auto-refreshs

a setup wizard that lets you specify initial accounts and data sources

I assume that a fully generic semantic widget and agent platform might be either over- or underwhelming, so I plan to provide a set of ready-to-run apps for paggr. Here are some ideas:

I've been semi-silently working on something new. A combination of many semwebby things I came across and played with during the last 3 years or so:

semantic markup

smart data

an rdf clipboard

ajax

sparql sparql sparql

sparql + scripting

sparql + templates

sparql + widgets

lightweight, federated semweb services and bots

UIs for open data

semwikis

agile and collaborative web development

So, what happens when you put this all together? At least something interesting, and perhaps semsol's first commercial service. (Or product, this is all just LAMP stuff and can easily be run in an intranet or on a hosted server). Anyway, still some way to go. It's called paggr, the landing page is up, and today I created a first teaser/intro video.

I'll demo the beta (launch planned for November) at upcoming ISWC during the poster session (my poster is about SPARQL+ and SPARQLScript, the two SPARQL extensions that paggr is based on). I may have early invites by then.

As a preparation for the hopefully busy fall and winter months, though, I'll be on vacation for the next two weeks. No Email, no Web, no Phone. Yay!

The visual-x mag just published a webinale report that contains a nice summary of my talk (and even a link to paggr). Phew, this means that at least some people were not scared off, which is great personally, but also (and more importantly) from a SWEO perspective.

I didn't really have a lot to talk about yet (I plan to do some screencasts for the paggr system), so this one was more for testing a number of different tools (the screencast itself is about paggr's soon-polymorphic drag and drop). I used a Windows box and found the following tools quite useful:

and a free FLV player I found on the web somewhere. It hopefully isn't doing evil things on my server now..

The tools make the technical side of things rather convenient, but it still was more work than expected, and I sound just horrible (sometimes close to "zank you for traffeling wiz Deutsche Bahn", if you know what I mean).