MGML - an SGML Application for Describing Document Markup Languages

Why MGML?

The Standard Generalized Markup Language is the most
fully developed specification of the use of descriptive markup languages for
electronic documents. The idea of descriptive markup is
simple and powerful, and in fact has proved to be a basic requirement for many
advanced information processing applications.

Unfortunately, the adoption of SGML has proved surprisingly difficult,
expensive and slow, given that the underlying ideas are simple and
self-evidently good. Some of the perceived reasons have included:

The SGML standard itself is large, complex, and difficult to
understand.

The standard specifies several optional and advanced markup features, some
of which remain unimplemented.

Some of the features of SGML have proven counter-productive
in practical use.

Practical use of SGML requires learning several other languages,
including the language used to write DTD's, various stylesheeting and
formatting languages, and the SGML/Open Entity Catalogue language.

The design of SGML takes little account of the contemporary theory of
formal languages and finite automata. One practical result is that SGML
parsers are unable to make use of some advanced tools and techniques made
possible by that theory. Consequently, they are large and complex pieces of
computer software; as such they (a) suffer from reliability problems, (b) have
in practice proven difficult to integrate into applications, and (c) change
slowly in response to advances in software and document processing
technology.

Nonetheless, there remains a consensus that SGML's basic design partition
into entities, elements, and attributes is correct and useful. One result is
a common tendency, in strategic projects involving SGML, to avoid using many
advanced features and operate within the bounds of a highly restricted subset.
This approach has generally met with success. However, this restricted subset
has been re-invented by each successive group that has attacked the
problem.

It is our opinion that SGML exhibits an extreme case of the "80-20
syndrome"; that is to say, 80% of the benefit is gained by applying only
20% of the machinery. It is the goal of
this project to formalize the definition of this useful subset, which we call
Minimal Generalized Markup Language, MGML.

The design goals are that MGML shall:

be an SGML application, and process a proper subset of
SGML documents

provide full support for the basic mechanisms (entities, elements, and
attributes) which have made SGML successful

unify the syntax of the meta-langage and the generated languages (the DTD
and the instances)

be defined by a simple, compact, formal specification that allows the easy
implementation of MGML processors by taking advantage of standard
formal-language technology.

exclude those portions of the SGML design which impair ease of
understanding, use, and portability

The Specification of MGML

The syntactic structure of MGML, enabling markup to be destinguished
from data, is hardwired and has been straightforwardly and completely
implemented using lex-style
regular expressions .

MGML is based on the Document Structure Definition (DSD). A DSD is a set
of structure definitions that apply to all documents of a given class. The
required content and structure of a DSD are defined by the MGML Reference DSD.
The behavior of a conforming MGML processor is defined in the list appearing
below in this document, and in commentary text attached to the structure
definitions in the MGML Reference DSD. These behavior specifications and the
MGML Reference DTD together constitute the sole and complete definition of
MGML.

The MGML Reference DSD defines a total of 21 elements and 18 attributes.
In printed form, it occupies only 5 pages.
An electronic form may be obtained
here.
To help in understanding, a real SGML DTD for the MGML Reerence DSD
may be obtained
here.

A reference parser for a slightly earlier version, including fairly
complete entity processing, implemented as two lex
modules, one C module, and one yacc module, comprised about 1000 lines
of code.

A conforming MGML processor shall:

Optionally, for any DSD, write a corresponding SGML declaration and SGML
Document Type Definition which define a class of documents including all those
accepted as valid by an MGML processor with respect to the DSD. Thus, every
MGML document is an SGML document.

Scan the text of each element's content to distinguish markup and data.

Replace entity references by their entities.

Validate the element and attribute structure against the model described
by the DSD.

Supply all defaulted attributes.

Provide to an external processing system (a) complete information about the
entity, element, and attribute structure and (b) access to its content.