CML: Syntax and Semantics

An important property of SGML is that it clearly separates the syntax of
information from the semantics. Since these may not be concepts you've
come across, here's a simple introduction, and I hope that computer
scientists and SGML experts will forgive imprecisions.

Some analogies might help. Syntax could represent the nuts and bolts that
a car is built from, whilst semantics describes what you want to use it for.
Syntax is the frequencies that TV or radio use, whilst semantics is the
programming.

Syntax

Most legacy systems (existing chemical software) have a sizeable amount
of code dealing with problems like:

Will it display on a Mac?

What happens to lines longer than 80 characters?

How do I pass a carriage-return?

Are there any non-printing characters in this whitespace?

Can I send it through a mailer?

How do I read a VAX file?

and, later:

Which part of the program do I pass this data to?

How many more lines are there in this file?

Has this file been transferred without corruption?

How do I read in a triangular matrix?

Can I strip the table headings of this output before I send
it to another program?

Does the whitespace in this table represent a data item?

Am I expecting to read another table before I come to the bibliography?

These are all syntactic problems that are not related to molecules (or any
other discipline) and SGML allows you to tackle them independently of
what information means and should be used for .

I have configured CML so that you don't need to worry about some of these
problems, and others (e.g. how control characters are passed) have
well-established mechanisms in SGML. SGML allows you to determine the
abstract structure of a document, for example: "first we have a FOO,
which must contain one or more XYZZYs but no BARs. Then there is either
a WIDGET or a PLUGH (but not both) and then no more than one FOO". Note that
what a FOO is doesn't matter at a syntactic level.

Essentially this is all that SGML gives you - a validated abstract document
structure, with validated character data. For example:
"<PAIN QUANT=10>Danger!</PAIN>
could be part of a valid SGML document, but without defined semantics, what
it means could depend upon the language spoken in the country :-).

The syntactic components of CML are:

The DTDs. These must never
be altered without consent as they are used to validate the structure of the
CML documents. Changes will almost certainly cause the parsing to abort.

A parser (we use sgmls). The parser takes an SGML document
(e.g. 1ins.cml) and checks the syntax against the DTDs. If it's syntactically
incorrect the parser gives a (partially intelligible) error message and aborts.
If it's OK, a transformed document (1ins.esis) is output. This is not only
easier for a postprocessor to read, but the default values have been added.
(The document has also been normalised, i.e. closing tags have been added
where required.)

Important note. SGML allows for considerable 'minimisation'
(i.e. components which can be unambiguously inferred by a parser can be omitted
- either to save space of make the document more readable.) For example in
HTML you can write:
<UL>
<LI>Item one
<LI>Item two
</UL>
There is no need for a closing </LI> on each line because it's possible
to work out when the <LI> should be terminated. Note, however, that
you need to know how to write an SGML parser to do this, and it's non-trivial.
One way round this is for the software to read the output of the parser
(the *.esis file); if necessary the later parsing software from J.Clark
(SP, SPAM and nsgmls) can be used to produced normalised CML files. Note
that you must have access to the DTDs if you are going to parse or
normalise an SGML file.

Semantics

Semantics can be added to a parsed CML document in several ways:

Humans reading the document (preferably with a browser like cmlcost).

Linking the terms in the document to a glossary, so that further
meaning is added.

Inputting the document into a program which has been written or adapted
to take CML input and to 'know' the meanings of the terms.

Transforming the document into some other semantically-rich format.

We haven't really started on CML semantics yet - there are many possibilities:

running cmlcost with links to a glossary of terms.

retrieval of glossary items which can be used for semantic
validation ("Is this a reasonable value for this item?", "Are the values of
FOO and BAR in this file comaptible?").