The best name I can think of this is Highly Generic Schemas but probably there is something better. (It is in the same kind of area that might be addressed by abstract base types in XSD and in the old architectural forms for that matter.) I have only tried it on one particularly problematic application so far, but it seemed to reduce the amount of programming time to implement a pair of transformations from initial data in to the intermediate format to publication out by about 2/3, which looks pretty good.

The application has some peculiarities: the data inputs are changing in schema as well as data (not only the immediate inputs, but all the components of the processes that feed the initial data are being renovated over time; the business rules are changing; and the outputs and their formats are changing regularly. In fact, the changes to the actual data values are relatively few each month, compared to the size and coordination effort of coping with the system and requirements changes. The output fields in each publication have many different variations too: calculated values for different markets and different statutory wording: coping with these small variations in records that were otherwise similar was difficult for developers, who have in the past felt there was always some fresh unstated gotcha lurking. The lack of specific-enough identification of fields created a problem in expressing and understanding requirements. And there is a must-have monthly deadline.

So the emphasis has to be on agility, clarity of markup, and hackability: it must be easy for the developers to understand immediately what the data in the intermediate format is, what kind of operations they will need to be looking at, and how to refactor it with minimal disruption and minimal bureaucratic impediments.

The Approach

The approach is to take the corollary of the requirement that the specific requirements are changing rapidly and that data from different sources with slightly different semantics may be present: viz "How do we have markup that reduces the amount of cross-referencing a developer needs to do?" i.e. how do we make as much of this context and metadata manifest?

The approach is a combination of

dividing the schema organization into parts that more closely fit the organization (shades of Conway's Law?), with the use of the highly generic schema, the controlled vocabulary and the arrangement (detailed below),

using highly generic element names which make is obvious the nature and purpose of the field: if it is a <flag> element then it will only be used for tests rather than printed, for example; but if it <rich-block> then it may have paragraphs and inline markup that needs to be dealt with,

giving each field quite long, unequivocal and explicit names (using @is) which prevents any need to resort to context to understand the semantics of the element (and maintaining this in a controlled vocabulary), and

denormalizing the data so that if the developer is interested in a set of information it is all there (i.e. both set-specific fields and fields that belong to the context) but if they are iterating over multiple sets (or multiple objects) then the fields relevant to the object are available immediately

We separate the concerns of the schema into three parts:

A highly generic schema (we use a RELAX NG Compact grammar like the one below and provide an XSD mapping of it using Trang) which has a minimal number of elements (about 20), is highly systematic, and has a low semantic fanout which I will explain below. The elements in this highly generic schema are not related to the particularly business at all, but the general class of document: we have containers object, set, metadata, and fields price, quantity, member, code, flag, text, rich-text, rich-block, object-ref and sort, all of which can have attributes @is (giving the controlled vocabulary name), @code (giving the machine processable value of the field), and @id

A controlled vocabulary (we use Schematron) which has complete lists of the specific names that can appear in @is attribute values. Anything that belongs to a business rule is in this schema.

An arrangement (we use UML and text) which says how information is arranged into the generic schema using the controlled vocabulary. For example, that we want to have one intermediate file per output (to maximize the independence of developers working in parallel on different outputs) and so on.

So the highly generic schema is fairly fixed, and can be validated against easily using the simple grammar, but without that validation checking any business rules. The idea of semantic fanout is that rather than having possibly hundreds of element names or rely on XML context to understand the meaning, it is better to figure out a small number of common names which makes the general processing semantics very obvious to the programmer: in this case we came up with about 10 field names, 3 containers names, and a handful of HTML elements:

Plus almost everything can have the common attributes metioned: @is, @code and @id.

The controlled vocabulary is a much more ad hoc (or, at least, extensible) affair: indeed, one of the aims is that by bringing the controlled vocabulary under the control of the developers, it provides a mechanism for them to cooperate where they can prototype changes or make initial versions without having to wait for prior review from the schema god.

## objects is the top-level element
## It contains a section in simplified HTML which should explain the purpose of the file
## and give links to codes and object types enough so that other developers can easily use it.
objects = element objects { html?, object+ }

## The object is the primary unit of organization.
## An object is equivalent to an important table, such as a Drug, or Brand
##
## Fields:
## @is The specific kind of object. Objects are very generic, so @is gives the details.
## This is a token, controlled by the project. Required.
## @code If the object has a natural code, this is where it goes.
## This is a token, depending on the data. Optional.
## @id A unique identifier.
## This may be any token, not just conforming to XML ID rules. Optional
## @sort A sort key that may be used for objects of this kind
## name A clear name for the object. For higher-level objects this is probably for
## information and clear markup, not for direct usage.
## metadata These are housekeeping elements used to tie into the feeding system and
## help with any effectivity requirements
## set All information in an object is grouped into sets. This allows more accurate
## representation of containers
object = element object { attribute is { xs:token }, attribute code { xs:token }?, attribute id { xs:token }?,
attribute sort { text }?, element name { text }?, metadata?, set+ } # eg drug

## A set is a group of related information.
## A set is equivalent to an important collection, in particular for a line in a schedule
##
## Fields:
## @is The specific kind of set. Sets are very generic, so @is gives the details.
## This is a token, controlled by the project. Required.
## @code If the set has a natural code, this is where it goes.
## This is a token, depending on the data. Optional.
## @id A unique identifier.
## This may be any token, not just conforming to XML ID rules. Optional
## metadata These are housekeeping elements used to tie into the feeding system and
## help with any effectivity requirements
set = element set { attribute is { xs:token }?, attribute code { xs:token }?, attribute id { xs:token }?,
metadata?, data-object+ }

## A rich and generic set of data objects are allowed.
## These use @is and @code to keep business requirements (almost) completely out of this schema.
##
## price Any simple price. Units can be represented using @units if needed.
## quantity Any simple quantities of any amount. Units can be represented using @units if needed.
## code Any simple code value. (The value of the element is the code, not the @code attriute.)
## flag Any boolean value. True is signalled by having the element. False by not having the element.
## Any binary information that will simplify formatting the data etc. can be a flag.
## member The member group code
## goes as a data value, not a @code.
## object-ref This is a reference to an object elsewhere in the XML, by the target's @id.
## These might be used, for example, for if shared restrictions were useful
## The sort attribute is an order key that may be used for ordering the object.
## text A simple plain text field. Be careful to only use this when necessary
## rich-text A rich-text field. This text is marked up using the simplest HTML blocks
## a An HTML link to related information.

In this conventional case, we might have an XML Schema which has abstract base types (or elements) for objects, text etc and use type derivation (or substitution groups) for the specific schema.

The version of this instance using a highly generic schema is a permutation which takes the generic information (the abstract type or substitution group head) and uses it as the element name (i.e. the generic identifier), and denormalizes (repeats) common fields:

I think this second way has advantages over the first when maneuverability is a concern.

XML Schemas is weak in the area of maintenance: if you want to mark that a certain element is obsolete but still can be used, there is simply no reliable way. In Schematron, you just make a report element that lets you know that an old name is being used, but that it is not significant for validation. Which makes it a poor choice when the schema is changing rapidly. And to have to look up the (abstract) types in the schema is a level of indirection (and knowledge about schemas) more than should be necessary: many good XML developers have or want or need little awareness of XML Schemas. Furthermore, using an XML Schema will inevitably mean that making a schema change cannot be done ad hoc but will have to be negotiated with some supervising party.

And the lack of denormalization means that the developer using the data will always have to do an extra step of looking up: if the city field is part of the Address element then it won't be in the BillableAddress and the OfficeAddress, while if it is part of those two it won't be part of the Address element.

Functions make using a Highly Generic Schema and XSLT2 easier

In this approach, access to data using XSLT2 has some extra safeguards too: we use custom functions for XPaths used in xsl:for-each or other selection and sorting cases. This overcomes the common problem that long XPaths need to be documented, but infrequently are: having to make up a function name is a way to tightly couple simple documentation (i.e. the name) with an XPath. So function accesses look like this: