Mashups 101: Using XQuery for Data Selection

XQuery enables you to take advantage of the inherent structure of web pages to create easy-to-write and -maintain data extracts, a building block for new ad-hoc web applications called mashups.

by Dan McCreary

Mar 26, 2009

Page 1 of 2

eb 3.0 (aka the semantic web) is the transformation of the web into a data source for building new applications, which combine data in ways that the data source's original author may not have anticipated. Although most web designers don't think of their HTML web pages as data sources, the use of semantic markup is starting to fuel this new generation of applications that can extract the designers' data and create new applications from it (see Sidebar 1. How Did the Web Get Here?).

These new "ad-hoc" applications are called mashups, because they combine (or "mash" together) data in innovative ways to create new views of the web. Many semantic web standards and technologies are contributing to the creation of mashups. Some of these include Semantic Markup, CSS, RSS, and Atom, the Atom Publishing Protocol (APP), RDFa, GRDDL and plain old XML.

Mashups all share three salient characteristics:

They draw on sources of data directly on the web.

They transform, combine, and re-transform this data to create innovative new outputs. Maps and timeline displays are typical mashup output formats.

They can usually be done in a few hours. That means that the transformations are created rapidly in a high-productivity environment.

Regarding this last characteristic, most mashups would take a lot longer if you had to literally create relational database tables using a traditional RDBMS Data Description Language (DDL). Fortunately, modern systems can leverage metadata in XML files to automatically create in-memory data and indexes for high-performance mashups. You would need the CREATE TABLE statement only in legacy systems.

High-Level Architecture of a Mashup

Figure 1. The Architecture of a Mashup System: Here is the high-level architecture of a mashup system.

Note that you will need some tools to pull your raw data from the web into XML format. Many tools do this. For example, the eXist-db open-source database provides a program called HTTPClient that performs HTTP GET functions on an input URL. To enable it, you need to change the configuration file to load this module into the database. The HTTPClient library will transform any malformed HTML into well-formed XML.

Another feature of the above architecture is the ability to store relevant data directly into a local XML store. When users create mashups, those users frequently use the same data sets. If the application also performs a "store" operation on the incoming data, it does not need to perform an HTTP GET. Many systems like the eXist-db database also automatically index these files so that even very large collections (100,000s of documents) will have very fast retrieval times.

XQuery and Cloud Computing

If you are doing mashups with data hosted on cloud computing data sets and using XQuery to select the data, both the CPU time and I/O metrics will drop dramatically. This is because XQuery is one of the few systems that is designed to use indexes on unstructured documents and to be very precise about what data is extracted from these data sets. If you are paying for cloud computing infrastructure, you may want to consider XQuery as your data selection language.