This paper addresses the translation of BUFR bulletin into a canonical XML. By “canonical”it is meant to imply a standard XML, repeatable across implementations, which contains allthe information available in the BUFR bulletin.

BUFRitself is highly structured

(both model and format), but most existing decoders createstructured text and/or XML which do not address the hierarchical processes in BUFRGeneralised Coordinates which is necessary to address ISO and OGC data models andformats.

This is the main target of the paper.

Transforming BUFR to an ISO/OGC XML data model and format is the ultimate aim, but thevastness and detail of the 450 BUFR Tables with 1500 “descriptors” and millions of potentialmodified descriptors make it impossible to transform directly and automatically to ISOFeature Catalogues.

The of automatically transforming all BUFR features, inherited features, discriminatedfeatures, association, attributes and coverages into ISO/OGC feature catalogues, is beyondour capabilities at present. It would also produce a specialist’s feature catalogue, rather than acatalogue for less experienced users which is generally intended.

Instead it is recommended that the canonical XML should be developed and used toidentify features required in restricted feature catalogues for individual communities,such as for Aviation, for Hydrology or for particular public requirements such asINSPIRE.

which would allow the different catalogues to be maintained in line withthe WMO BUFR maintenance process.

2

BUFR to XML

What is so difficult about converting BUFR to XML? In Table 1Column 1 is a truncateddecode from a BUFR decoder. This is strongly structured data (as is a BUFR bulletin, butheavily coded) and so can convert to XML readily. Columns 2 and 3 are two forms of XMLwhich are direct transposes of the BUFR decode of column 1.

What is so difficult?

Where’s the problem?

What was the value in doing this? What is the extra value of XML over structured BUFRdecodes?

Draft 0.7

13/12/2013

3

of25

3

Comments on Table 1:

Before expanding on these questions, it is convenient to make several comments on arbitraryoptions which have been made in Table 1.

3.1

BUFR Coding and Decoding.

3.1.1

BUFR Coding gets much of its compressive powerfrom separating descriptor namesfrom descriptor values but retaining a 1-1 mapping. Then both the descriptors and thevaluesare compressed by referring everything, descriptorIDs, names, descriptions, code tables,unitsand value scaling, to tables. Everything is a reference. Standard groups of descriptorsare tabulated as Table D descriptors–

in apre-definedbit length. Furthercompression can also be performed bygrouping similar values together and repeating thevalue transformations but to sample scaling.1

3.1.2

BUFR Decoding is the process of reversing the coding: referring to the tables;expanding the descriptor sequences; removing the coding descriptors; re-establishing thevalues. An extra stage, sometimes, is toexpand the code table references and remapping thedescriptors and values to descriptor value pairs (with or without including the units, names ordescriptions).

1

It is instructive that the December 2009 W3C compression mechanism EXI Efficient XML Interchangehttp://www.w3.org/XML/EXI/

could have been, and still could in laterprocessing. Replications stay to help parse the decoded list. Initially Table D descriptors werealso retained in the XML conversion, but it was quickly recognised that once expanded theyhad little residual parsing value: they are not modules or proper collections, but really onlymacros to help parsimonious coding.

3.2.2

Instead of the common BUFR descriptor ID of 0 01 001, this document usesF0X01Y001. This is a convenience in a number of ways, but ultimately XML

forces a changefrom the

WMO ID. XML element names and IDs are

defined as NCNames2. These

may notbeginwith a numbernor

contain spaces. It was convenient to delimit the BUFR ID with F, Xand Y. The CREX-ified

version of B01001 is less recognisable and there are circumstanceswhereclear

delimiting into F, X and Y is necessary in further processing the XML.

3.2.3

The XML in column 2 is a sort of “strong” or “hard typing” where the element namesare the BUFR IDs, and consequently and full BUFR schema will require as many elementtags as there are BUFR descriptors: around 1500. However for checking indentation andbranch matching later in this paper, this form is used extensively. It also simplifies the use ofXPATH definitions. It was convenient to enclose the whole subset in a <Subset> </Subset>root tag, and replications in a <Repl> </Repl> tag.

3.2.4

In Column 3,is an example of “weak” or “soft” typing, where the element name isgeneral and the specific identity is assigned to an attribute, usually an IDREF. Where theschema for Column 2 would be extensive, for Column 3 it would be brief. Another differencein this version is thatmissing values are listed as empty tags <descriptor ../>. They couldhave retained the

value “MISSING” as in column 2 ordescriptors with

missing values couldhavebeen

removed completely. All the element tags are either <coordinate> to denote BUFRGeneralised Coordinates with Table B classes with X less than 10 and <descriptor> todemote Classes of 10 or greater.Perhaps more ISO-like terms are needed.

3.2.5

In the example of Column 3, this allows the BUFR ID to be assignedto an XMLattribute of bufrID. This could be used as a reference to Table B and to the Code Tablesexpressed in XML,if

the BUFR IDs were expressed as XML IDs. This would allow standardXML cross referencing (with certain expansions in the XML which have been removed forclarity).It is an easy XSLT conversion to convert all F0XxxYyyy element tags into xmlIDREFs

3.2.6

Indeed there are lots of possible expansions of the XML. These would load up thecomponents of the BUFR tables into the XML, particularly for further processing. However,once the BUFR is in XML, it is possible to use standard and universal software, written inXML-enabled languages (even in XSLT which itself is XML), to augment or extractinformation as required.

2

http://www.w3.org/TR/REC-xml-names/#NT-NCName

Draft 0.7

13/12/2013

7

of25

4

The Problem…:

The problem

with the XML transpositions in Table 1 is that there is considerable BUFRcoding still to be resolved. The problem is the way that Generalised Coordinates (BUFRclasses less than 10, and particularly class 8, significance qualifiers) are used which is not

decoded in Table 1.

4.1

BUFR hierarchical elements

4.1.1

Generalised Coordinates are the way BUFR handles at least 4 mechanisms in XMLand in ISO 19100 standards, which includes ISO 19136, the ISO standard of OGC’s GML(Geographic Markup Language).

4.1.2

The BUFR tables are a mix of two processes, modelling tables and coding tables.Unfortunately there is not a totally clean separation, and that is one way in which a directcomparison with ISO is a problem. Table D, Table C and the replication descriptors aremostly coding constructs, while Table B, the Code and Flag tables, the Common Code tablesand Table A are mostly modelling constructs. The ISO standards aremainlyconcerned withmodelling, because the coding and formats of instances are derived from the models.

4.1.3

GML is a language to createapplication schemas, which are XML schemas forparticulardomains

of interest. GML is based on geographicfeatures.

A feature in ISO termsis an “abstract representation of a real world phenomenon”. We might

regard a way ofspecifying temperature as one of our features types. Types of features are defined in featurecatalogues, and there are 4 ISO standards related to features, feature catalogues and catalogueregistries. These define the types of features, but individual features are found ininstances

ofthe feature types. For example, a temperature measurement at

a

specific

place and time is aninstance of a temperature feature within a BUFR bulletin(a BUFR

“instance”).

4.1.4

The other way in which a direct comparison between the BUFR modelling tables andISO feature catalogues is a problem, is because BUFR tables define ways tocreate

features,as well as containing simple features. Feature catalogues, on the other hand, effectivelycontain complete

(and

small compared to BUFR)

lists of features. Although the ISOmodelling process allows ways to derive features from other features using concepts ofinheritance

(in ISO19110) andderived features

(in ISO19126), these areintendedto befullyenumerated

in a

feature catalogue.

4.1.5

There are far too manypotential

BUFR features and derived features to listexhaustively in a complete BUFR feature catalogue. It is possible though to define a subset ofBUFR features in a catalogue for a specific community. EUROCONTROL’s WXCM is aproject to model an aviation meteorological feature catalogue.

4.1.6

Coming back to the ways in which BUFR Generalised Coordinates handle XML andISO.

1)

BUFR has no immediate hierarchy between descriptors, except through GCs. GCshandle some of XML’s flexibility in parent/child and ancestor/descendant structures.

Draft 0.7

13/12/2013

8

of25

2)

Features have feature attributes, (quite distinct from an XML attribute) which aresubordinate properties which help distinguish one feature from another. Someexamples:

a)

Class

2 defines instrumentation information

b)

F0X04Y080 defines averaging periods

c)

F0X08Y023 defines statistical roles (e.g. maximum or minimum)

3)

ISO recognises some features ascoverages.Coveragesare geographical (lat/long)distributions of features with a different value at every location, which can also bedependent on a second variable with a defined value.Geographic features are moreoften an object (such as a bridge) which has an unchanging position.

In fact (almost)all meteorological features are coverages. BUFR GCs, particularly of class 8 oftendefine the second variable, e.g. vertical significance defines the role which the secondvariable carries: a sounding will give temperature, RH, wind speed and direction atsuccessive standard levels in the vertical.

4)

GCs define derived features. A dry-bulb temperature of F0X12Y001 can be modifiedby significance qualifiers to define screen temperature, ground temperature, concrete,sea surface, soil depth, ocean depth, and upper atmosphere temperatures (althoughthere are also some specific descriptors which are the same derived features–

e.g. seasurface temperature is F0X22Y049–

amongst others!)

4.1.7

GC’s can exhibit several such functions at the same time. For example in a sounding,Class 8 vertical significance qualifiers group dependent features of temperature, RH, winds aswell as signifying the role of the independent vertical variable which is the coverage term.

4.1.8

This potential confusion between ISO modelling functions suggests

two things: thatthe ISO modelling functions have a degree of arbitrariness if one BUFR function canreplicate their duty; and that automatic conversion to a GML application will be difficult toachieve.

4.1.9

There are yet more complications: GC’s in a BUFR decoding havestart and endterms, or start, change and end terms. Some GC’s (e.g. temporal values) when they arerepeated, define ranges rather than point values.

4.1.10

So, in Table 1, the GC’s contain both coding and modelling information which mustbe further processed to give explicit descriptions of the instances of BUFR features.

Actually, this is donealready-

whenever BUFR bulletins aredecoded and stored indatabases,

but not in BUFR form

(e.g. broken into components in a relational database).Unfortunatelythis is usually done for multiple BUFR bulletin types in multiple applicationswhich are veryspecific toindividual

database

implementations. There is little commonality.

5

Why should we want to do this?

Why should we want to decode BUFR terms into XML,in particular specified by a GMLapplication schema?

Draft 0.7

13/12/2013

9

of25

5.1

Motivation

5.1.1

There aremany reasons of which4

are:

1.

consistent with WIS, to refactor WMO standards to be compatible with ISOgeographic standards;

2.

to bring our standards up-to-date and so to gain economic advantage (cheapcommodity software will be able to use our data with little effort);

3.

XML/GML format areaimed at Web Service delivery of data on the basis of a WSrequest, and will carry

extractions and aggregations of data, e.g. a WS request

for allthe temperatures for Western Europe;

4.

within Europe,there are

two pieces of legislation (SES for aviation and INSPIRE formany environmental themes) whichmandate

ISO 191xx standards in data modellingand derived data formats.

External requirements for ISOstandardswill

only

increase.

5.1.2

BUFR was designed more than 20 years ago, in the era when SGML was beingdeveloped and long before XML. Over the years some of the design features have beencompromised, for example the distinction between modelling and coding has not been fullyadhered to

in later additions to the BUFR tables.

On the other hand, over the last decade, ISO, W3C and OGC have been developinggeographic standards which have considerable commonality with WMO standards. The newstandards are gaining traction, intellectually and commercially, in the widerworld

outsideWMO. It is inevitable that there will be considerable economic advantage

to us and to ourusers to provide data in a widely accepted

standard form.

5.1.3

However, inlinking BUFR to ISO

it is recognised that

mappingtheconcepts,vocabularies and practices between them is difficult, particularly when werequirecontinuityin exchanging the gigabytes of the existing daily data exchange

over the GTS.

Thisuse of ISO standards

is consistent with WIS.

6

Whatis the aimin creatinga BUFR-XML?

6.1

The ISO–OGC task

6.1.1

As explained above, there are lots of choices to make in creating XML in any case.Since one eventual target is to create XML based on a GML applicationSchema, this XMLwould look like the Observation and Measurement models from OGC3. Many of these formsare quite natural for BUFR data. The process of creating a GML application schema forenvironmental data based on O&M and all the ISO standards is to befound in the Solid Earthand Environmental GRID (SEEGRID) Twicki describing the Hollow World applicationschema template4.

These are extensive documents which describe the modelling process and the Hollow Worldprocess gives a detailed prescription on how to do this.

6.1.2

However at the root of it all is the requirement to identify features.

3

http://www.opengeospatial.org/standards/om

4

https://www.seegrid.csiro.au/twiki/bin/view/AppSchemas/HollowWorld

Draft 0.7

13/12/2013

10

of25

6.2

Individual Community requirements

6.2.1

For the requirements of particular community (for example, Aviation), it is notconceptually difficult to identify thefeatures of interest which are a small subset of the BUFRdescriptors.

It is much less clear how to specify the process generally.

6.2.2

Many BUFR recipients will have gone through the process themselves: decodingBUFR then further parsing the output to

extract recognisable “features” to store in relationaldatabases. Whereas some databases use BUFR structurally, in many others BUFR is just theexchange format, discarded after decoding and further parsing.

What is needed is a formal link between the BUFR

decode and the target feature catalogue(s).

6.3

The Aim of this paper

The target of this paper is an intermediate XML, a “canonical”

XML in which the principlesof the BUFR model are imbedded, and using which the features can be formally definedusing XML dialects such as XPATH.

These formally defined features can be listed in feature catalogues.

7

How do we decode the Generalised Coordinates?

7.1

Basic Assumptions about Generalised Coordinates

7.1.1

The way in which Generalised Coordinates are transformed into an XMLrepresentation relies on the understanding that BUFR GCs modify each BUFR descriptorupon which they operate5.In the XML

here,

a GC is treated as a parent or ancestor of thedescriptor. Multiple GCs modifying a descriptor keep their order as an ancestral tree

as theyopen. They are closed

in inverse order maintaining proper XML validity where parentopening and closing tags wrap a child tag.

7.1.2

As in Table 1, Column 3, GCs with X value less than or equal to 9 are distinguishedas “coordinates” and descriptors with class C greater than 9 as simply descriptors.Alternatively they might be called “modifiers” and “features” which is more ISO-like.

7.1.3

The set offeatures

modified by a GC until it is closed

or changedinvalue,

is thescope

of the GC. Indeed when an existingGC

is changed, this is representedin the XMLbythe closure of the scope of the existing GC and the opening of a new GC. This new GC isactually the same GC ID but with a new value. GCs with a missing value have a null scope.

5

There are known to be exceptions, for example there are GCs or GC branches which can stand alone without achild descriptor/feature and GC’s may assign a “role” to a GC rather than a feature. These are addressed later.

Draft 0.7

13/12/2013

11

of25

7.1.4

Issue

One problem recognised indevelopingthis paper is an apparently inconsistenttreatment of missing GCs in the examples tested. Sometimes they are not closed

appropriately, and sometimes there are missing or unspecified values which

open aGC. This should be clarifiedby the ET DR&C.

7.1.5

A second property is

to recognise the

type

of a GC.For this algorithm, the

types

includesingle opening and closing GCs.GCs can be doubled to define a range or duration.These must be recognised

as the first and second opening double GC and will be closed in thenew sequence by a single closing GC

even if the range is changed by a new double GC withthe same ID.

In the XML output a Double GC is replaced with a single GC with two values.

Featuresor simple descriptors are just that, and have no other discrimination.

7.2

An outline algorithm

7.2.1

The algorithm to parse the Generalised Coordinates is described in outline formalthough it has been implemented in a pilot programme written in perl. Theperl programparses BUFR from binary into the decode sequence as in Table 1 Column 1, then decodesGCs. It then writes out the decoded BUFR in valid XML. Theprogram does not work for allBUFR

functions

–

specifically it does not deal with secondary BUFR

compression,

thequality control operations or

Table C bit map coding.

7.2.2

TheperlGCparsing algorithm is a pilot and is not designed for efficiency–

there aremultiple(forward)passes through the BUFR sequence. However since the BUFR sequence isreplicated across each subset,

parsinga subset

is not lengthy. The greatest time taken isreading and parsing the BUFR tables at the beginning of each run.

7.2.3

The perl algorithm requires that the BUFR decoded sequence (such as Table 1Column1) is held

in an ordered list of structures. Each structure holds the importantproperties of the descriptor such as the ID, the F X and Y codes separately, the units and thedecoded value and a namespaced descriptor name (this last property is novel, these areavailable from the author, but are not required for the algorithm). The structures property set

can beexpanded as

required, for example identifying GC scope and type.

The list of structuresmust be capable of being

added to

and reduced, both internally and atthe front or end if required.In the pilot program this was done by forming a second list whichholds the pseudo-XML list which is the original BUFR list with amended opening andproperly closed GCs.Perl arrays of referencesto hash structureswork very

well

here.

7.2.4

During the parsing, at any point in traversing the sequence, often a secondary traverseneeds to be instigated. In order

to keep track of changed or closed GCs which are added tothe pseudo-XML list, stacks representing thedepth of the

GC ancestor tree are

needed for thecurrent point of each traverse.

As GCs open or close the stack should be “pushed” or“popped” with the GC ID as a stack is a LIFO structure.

7.2.6

However, in BUFR there is no requirement to maintain the opening closing order asin XML. This can be illustrated easily. Let A, B, C and D be different GCs. /A, /B, /C and /DDraft 0.7

13/12/2013

12

of25

are closing GCs, although the end of a subset implies that all opened GCs be closed and thatthen, closing GCs are not necessary. Here x is any non-GCdescriptor.

BUFR allows:

Start

A

x

B

x

/A

/B

end

While XML requires

<root>

<A>

<x/>

<B>

<x/>

</B>

</A>

</root>

Where </B> must come before </A> since tag A is the parent of tag B.

7.2.6

This must be handled properly. Table 2 illustrates the sequence. The new task is thatempty GC opening and closing sets can arise. These should be recognised as constructs of thealgorithm and removed from the pseudo-XML list. If the removal leaves any further emptyGC tags in the pseudo-XML list these are removed.

7.3

Issues with the basic assumptions.

7.3.1

The algorithm presumes that GCs must modify a descriptor. They can not stand in ahierarchy empty of features. However one exception has been encountered and a solutionfound.Other exceptionsare likely

toexist, but have

not yet beenrecognised

in testing

so far.

7.3.2

The modifier F0X08Y021 significance qualifier assigns a role specifying a timesignificance qualifier to a set of time/date qualifiers. This assigns roles such as “TimeSeries”, “Time Average”, “Accumulated”, “Forecast” etc.. The algorithm to correct XML-valid trees will close all GC/modifiers internal to F0X08Y021, then close F0X08Y021 and

reopen the internal modifiers externally to the F0X08Y021 branch. It will then iterate throughthe nested empty tagset removing empty tags then removing the F0X08Y021 tag!

7.3.3

Particular examples occur in BUFR bulletins holding BUOY data. Here theF0X08Y021 code value is 26, assigning “Time of last known position” to the time stamp.

7.3.4

A workaround was found to stop the modified modifier tagsets being deleted. In aninitial step at the start of the parsing sequence all occurrences of Class 08 were detected in aDraft 0.7

13/12/2013

13

of25

traverse. If the scope of the Class 08 modifier included a feature/descriptor, no further actionwas taken. If however no feature was found, a generic feature was inserted just before theClass 08 modifier closed. Arbitrarily an F0X35Y000 feature with a (non-existent) value of 0was chosen. This is the FM number for International and Regional codes.

7.3.5

Issue:

This workaround retains the information for later processing, butit is certainly notan optimal solution. A better one needs to be found.

but where the higher precision set is introduced after descriptor 100(the 2.F0X12Y001 has BT1_)-

BT2 and the namespace is delimited by the underscore–

anallowable XML character.

8.2.6

Missing elements are retained for testing. In practice these could and probably shouldbe deleted.<Repl>

<F0X08Y002 name="BCS_surfaceVerticalSig">

<value units="CODE TABLE ">21</value>

<F0X20Y011 name="BObs_cloudAmount">

<value units="CODE TABLE ">6</value>

</F0X20Y011>

</F0X08Y002>

</Repl>

Vertical significance (surface observations))

has aFirst instrument detected cloud layer).

8.2.8

Issue forfurtherdevelopment

Code/Flag tables need to be distinguished. They

need

to be treated differentlydownstream.

8.2.9

Here the modifier gives aSignificance

Qualifier

to the child features. In ISO termsthis would be a feature attribute (probably a role)

of the features. Butfeature attributes arechild properties of features. So ina pseudo-

OGC O&M format this might be coded as

<Repl>

<feature bufrID=”F0X20Y011”

name="BObs_cloudAmount">

<value units="CODE

TABLE "

code=”6”>6 oktas</value>

<featureAttribute bufrID=”

F0X08Y002”

name="BCS_surfaceVerticalSig">

<value units="CODE TABLE " code=”21”>

First instrument detected cloud layer

</value>

</featureAttribute>

</feature>

</Repl>

8.2.11

However it is possible to create feature catalogues

much

in the way in which all

current

BUFR recipientspopulate

RDBMS columns. That is to formally identifyrequiredfeatures and to list these in a feature repository with a formal mapping to the finalisedCanonical XML. This would be used both for server extraction from BUFR bulletindatabases, or alternatively translated to local RDBMS implementations.

8.2.12

A

hidden capability would be with the BUFR maintenance process.This mappingwould make it

possible to identify at an early stage any future BUFR Code changes whichwill have a knock-on effect on the feature catalogues. This could be communicated with theowners of the feature catalogues, which need NOT be the responsibility of DR&C or even ofWMO.

Transforming to the formal Canonical XML would also be a useful stage in validating anynew type of BUFR bulletin.

Draft 0.7

13/12/2013

17

of25

9

XPATH Identification of features in a Canonical BUFR-XML

XPATH6

is a language to address parts of an XML document. It is used in XSLT and withXLINK in XPointer (to include fragments of an XML document)forinclusion ofparts ofremote documents. It is an integral part of XQUERY, ad XML querying language.

9.1

Formal reference to Canonical XML

9.1.1

An XPATHreference

using the XML styleof Table 3 might be used to address anydry-bulb temperature in any BUFR-XML document:

//F0X12Y001 | //F0X12Y101

This will findallnodesanywhere in the document ( using //) the elementnameF0X12Y001(low precision temperature)and

(using |)

all nodes

with

the element

name

F0X12Y101 (highprecision temperature).(This is to identify the feature not really to select the data. To selectthe data along with station name, position and data time a more complicated XPATH couldbe used, or alternatively a more developed XQUERY.)

9.1.2

Unfortunately

the XPATH abovewould also detectalldry-bulb temperatures even ifthey are

modified by a significance qualifier, which might be screen temperature (probablywhat is wanted), ground temperature, concrete, sea surface, soil depth, ocean depth, or upperatmosphere temperatures (probably what is not wanted).

9.1.3

So an air-temperature feature would have an XPATH definition which allowedunmodified dry-bulb temperatures (or perhaps only identified as screen temperatures).

Thesecan be expressed as

XPATH functions

–

but not concisely. A full definition is out of placehere. However a specific community will use a restricted set of BUFR bulletins–

probably arestricted set of Table A types and this information can beused to make an

XPATH

definition:

//BUFRmetadata[dataCategory = “Surface data–

Land”]/../../data//F0X12Y101

Would select all elements named “F0X12Y101” which are descendants of the element“data”, which has siblings(metadata–

not specified)which have

grandchildren (../../ come uptwo generations) with element

“BUFRmetadata” which have dataCategory elements with thevalue “Surface data–

Land”.

9.1.4

This is complicated but precise. Specifying the XPATH in a Feature Cataloguedefining an “Air Temperature” feature would allow an XQUERY to be defined whichextracted all the information from a set of BUFR messages, which could be converted to anSQL query to a specific database. This would be set up automatically, and not manually.

10

Recommendation

Rather than take up the onerous task of automatically transforming all BUFR features,inherited features, discriminated features, association, attributes and coverages into ISO/OGC

6

http://www.w3.org/TR/xpath/

Draft 0.7

13/12/2013

18

of25

feature catalogues, it is recommended that the Canonical XML should bedeveloped andusedto identify features required in restricted feature catalogues for individual communities, suchas for Aviation, for Hydrology or for particular public requirements such as INSPIRE.

Further work would be needed for particular implementations, but there it would be possibleto define formal definitions using W3C and ISO standards.