March 15, 2008

XML and Modeling

Data modeling is a big thing at Burton Group - a significant amount of the airtime expended in the ether around the virtual water cooler is devoted to teasing out the way that models interact, the best language for expressing such models and what characteristics best define a good model. In a way this isn't surprising - many of the people within the organization are former application or systems architects, and as such have a common belief that nonetheless is one that application developers don't necessarily share: before you write a single line of code, you should have a reasonably deep understanding of what particular piece of the real world you are attempting to model in that code.

Modeling, for the record, is hard. It is in a very real sense an attempt to predict the future by trying to determine, for a give set of problems, what the final application that solves those problems will look like. What makes this particularly difficult in many settings is the fact that these predictions often have to be made before real intangibles - what development environments and languages the systems will be built in, whether people with the adequate skills to program the pieces can be found, whether new technologies will emerge that will eclipse what is currently under develop and so forth - can even be made.

Moreover, in a realm as fast-paced as business - or the governance over
business - any such application also needs to take into account that
the business problems to be solved tomorrow will almost certainly
deviate considerably from the business problems of today, so the
modeling has to take into account a certain degree of flexibility which
makes it much harder to build tightly coupled solutions, and points to
such tight couplings, artifacts of Tayloresque business efficiency, as
being incompatible with building robust applications in the face of
change.

There are a number of different ways to models (and nearly as many
interpretations of what in fact makes up a good model), but one of the
things that has emerged over the years is that it is in general far
better to start by designing the characteristics of your inputs and
outputs first before touching process. Why? Such things represent
invariants in the model, things that in general will not change
significantly over time.

Let's say that I'm developing a foreign exchange (forex) trading
environment. At the end of the day, the goal with such a forex system
is to enable transactions between buyers and sellers in the face of
constantly changing prices. Expressed in systems terms, this means that
you have, for each transaction, a statement of available resources for
sale, one or more bids on those resources (let's call it a bid-sheet),
and a contract on the final bid (which assumes identity of the
contracting parties). Everything else in the application - in terms of
the model - is irrelevant. While the specific values (the instances) of
each of these objects will change dramatically in any given day, what
is most significant is that the underlying data structures won't. They
are invariants.

There is a fairly significant distinction between a conceptual model
and a specific implementation of that model - the first is essentially the human contract, the representation that can be readily understood between participants in the model, while the second is usually written in some encoded format that may or may not be able to fully capture the fidelity of the conceptual model ... it's an approximation that is only as close as the underlying modeling language. However, especially if you can move
into a mode where you are attempting only to describe the invariants of
the system, the match points between conceptual and implementation models are usually pretty good.

In the XML arena, one such modeling language is the W3C XML Schema Definition language, or XSD. This particular language is meant to describe a largely hierarchical view of a given data environment, one in which you can theoretically place everything under a single overnode and thus be able to treat the data space as a giant tree.

XSD's limitations mostly have to do with external constraints. One characteristic of most data models is interdependency, where parts of a model are constrained in what values they can hold by other parts of the model. Such constraints can take the form of parts of the model only being relevant when a given property is in a certain state, can be calculations for certain property values based upon other values, or can be requirements that certain properties be ordered relative to other values (such as the start-date for a given event always being prior to the end-date).

Rick Jelliffe, CTO of Topologi, recognized this deficiency some years ago, and helped to establish an alternative form of schema language called Schematron. Despite the rather unorthodox name, Schematron is a very legitimate schema language, one that is devoted specifically to constraint-based, or rules-based, validation, where those constraints involve either inter-nodal constraints (the validity on one element is dependent upon the values or presence of other nodes in an element) or even inter-document-constraints (validity is based upon a set of elements in another document), neither of which are available to the XSD language.

Schematron works by establishing patterns consisting of one or more rules. Each rule in turn establishes an XPath-based pattern matching template that identifies a particular element or attribute in an XML structure. If the particular element is found, them the rule will apply a set of either assertions or reports, where an assertion is a test that checks to make sure a particular condition is true, while a report is a test that on the flip side tests for a particular false condition, each within the context of the rule itself. For instance,

<schema xmlns="http://purl.oclc.org/dsdl/schematron"> <pattern name="Contract date must be in the past"> <rule context="Contract"> <assert test="ContractDate < current-date()">ContractDate must be in the past.</assert> </rule> </pattern></schema>

In this particular case, the pattern is for the situation "Contract date must be in the past". The rule is within the context of Contract element - any time a <Contract> element is encountered, the rule is applied. The assertion in this case tests to make sure that Contract/ContractDate is in fact prior to the current date (the example uses the XPath 2.0 current-date() function).

The only tricky aspect here is that the warning message is "printed" only when the assertion fails - nothing gets printed when it succeeds. Given the role of schematron as a validation mechanism, this makes sense - you're not interested in success at any given individual element, only that success is achieved globally ... but you are interested in isolating where failures are occuring.

The <report> element is a little different in that when the report test succeeds, then the report is printed:

Here, the rule includes both an assertion to test that the contract is beyond the initial date and prints out a warning in the situation where the contract is about to go out of scope, thus showing that the role of a report is to provide status information about the particular instance, independent of any assertions about validity. Any given rule can contain multiple assertions and reports, and any pattern may contain multiple rules.

Schematron was set up as one of a set of schema languages by ISO, specifically, ISO/IEC 19757 - Document Schema Definition Languages (DSDL) - Part 3: Rule-based validation - Schematron. Schematron was originally intended to be parsed by XSLT (or XSLT 2), and indeed this is still the simplest implementation, but there are also increasingly a number of stand-along Schematron validators written in Java and C##.

Recently, the W3C began an activity to create a Services Markup Language, or SML (not to be confused with the SMIL multimedia standard). SML is intended to formally codify what had been up until then an ad-hoc integration of Schematron and XSD. Specifically, SML is used to embed Schematron documents within XSD documents, usually within the xs:appinfo element of a simpleType or complexType definition. For instance, consider the following test for an IP address (an example from the SML specification):

The rule in this case examines the situation where a given IP address is identified as being IPv4 or IPv6, and then creates two assertions dealing with each of the particular test cases. Note that the rule given here takes as it's context the current context it's bound in (which in this case is that of the element which has the complex type IPAddress). When an XML document is run against an SML compliant validator, this will generate the relevant messages for any element of type IPAddress.

Note that, especially since XPath 2.0 is now supported as a valid context and test language within Schematron, this also opens up the possibility of validating against external data resources. For instance, a rule for a <currency> element might include an assertion test such as:

<sch:pattern id="Currency is a supported one"> <sch:rule context="currency"> <sch:assert test="index-of(doc('active-currencies.xml')//store/@id,.)"> The currency '<sch:value select="."/>' is not one being currently traded.</sch:assert> </sch:rule></sch:pattern>

The doc() function returns the XML file (taking either a local or global URL address), and index-of() returns a sequence of the positions of elements that match the given context element - if no match is found, then the empty sequence is returned, which the assert treats as the equivalent of false(). The implication of this is that you can determine not only that a given currency is syntactically valid, but that it is in fact one of the currencies supported in your Forex practice, and it does so without requiring that you constantly update your schema every time you add or remove a currency from trading.

Schematron (and by extension SML) is not necessarily a heavy-weight rules engine (even Jelliffe acknowledges that it is essentially a "featherduster for the corners of your schema") but it nonetheless provides a powerful tool for establishing the interrelationships and complexities that are inherent with XML models while at the same time keeping those models sufficiently flexible in the face of changes in the "semi-invariants" - those things that may change (such as which currencies are currently available to trade), but do so slowly enough that they should be part of the model rather than part of the application layer.

Comments

While the SML/Schematron approach may be a reasonable enhancement of expressiveness for XSD constraints, just adding a "feather duster for the corners of your schema" (Rick Jelliffe himself) does not make a Service Modeling Language deserving such a challenging label. Surely a schema is kind of a model, but a model is more than a schema (even one with brushed corners).

Has there been much discussion about beefing up XML Schema's relational support (ie. better support for inter-document relationships as a first class citizen)? Using xs:anyURIs or whatever is ok, until you start needing typed relationships.