Many readers will find the concepts covered in this chapter difficult to
grasp at first reading. Do not worry if you do not understand the role of any
part of the SGML declaration at first reading. You are not meant to at this
stage! The reason for asking you to read quickly through this chapter at the
beginning of this explanation of SGML is that restrictions imposed by the SGML
declaration are fundamental to understand many of the rules in SGML. Terms
introduced in this chapter will be used throughout the remainder of the book.
When you return to this chapter to remind yourself of the concepts being
referred to by these terms you should find that the summary of the term given in
this chapter will explain the restrictions imposed on other SGML constructs.

When interchanging documents it is important that each transmitted code has
a well defined function. In addition it is important that document markup can be
correctly distinguished from codes that form the text of the document.

The rules defining the meanings of the constructs used by a particular
language are known as the syntax of that language. Two
distinct types of syntax have been defined for SGML:

an abstract syntax is used to specify how SGML markup
should be constructed in terms of abstract concepts such as delimiter roles and
character classes

a concrete syntax is used to define how these abstract
concepts have been coded within specific sets of SGML documents.

This chapter will introduce you to many of the terms used to describe the
SGML's abstract syntax. The use to which the abstract syntax is put will be
explained in the following chapters.

One particular concrete syntax, called the reference concrete
syntax, has been formally defined within ISO 8879:1986 to provide a
reference against which variant concrete syntaxes can be compared. It is a
requirement of conforming SGML systems that they be able to parse documents
conforming to the reference concrete syntax.

Each SGML document transferred to another system should be accompanied by a
declaration, called the SGML declaration, which defines the
coding scheme used in its preparation. Figure 4.1 shows
the SGML declaration that should be used if a document is transmitted without an
SGML declaration. (Such documents referred to as basic SGML documents.)

The SGML declaration starts with a markup declaration open
(mdo) sequence consisting of the codes <! .
The declaration is closed by a matching markup declaration close
(mdc) angle bracket (>) at the end of the
declaration.

The rest of the first line of the SGML declaration consists of the letters
SGML followed by a delimited string containing the number and date
of the ISO standard in which SGML is defined ("ISO 8879:1986").
This statement indicates which version of the standard was used to prepare the
following declarations.

The second line of the default SGML declaration contains some text bracketed
by pairs of hyphens. Text entered in an SGML markup declaration between pairs of
hyphens is treated as a comment. In this case the comment acts
as a heading explaining the purpose of the following entries.

The names of the six main clauses that make up an SGML declaration are shown
in the first column of the SGML declaration. They identify:

The declaration for SGML's reference concrete syntax given in the SYNTAX
clause shown in Figure 4.2 contains eight subclause
definitions, each identified by a keyword. These define:

the decimal numbers of any codes which the program is to ignore because
they are control characters (SHUNCHAR:
shunned characters)

the syntax character set, consisting of a base character set
(BASESET) declaration followed by a description of how these
characters are to be used to define the concrete syntax (DESCSET:
described character set)

which codes represent function characters required by the
syntax (FUNCTION)

the naming rules to be applied when defining element,
attribute and entity names (NAMING)

the markup delimiters to be used in the document (DELIM)

reserved names used within markup declarations (NAMES)

the quantity set required for the document (QUANTITY).

The base character set used for the reference concrete syntax is that
defined in international standard ISO 646.
This 7-bit character set, known as the International Reference Version (IRV), is
used as a starting point for all international standards that define character
sets, e.g.
ISO 6937,
ISO 8859 and ISO/IEC 10646.

Note: A revision of ISO 646 took place in 1991. The revision (ISO
646:1991) matches the American Standard Code for Information Interchange (ASCII)
used by many computer systems. (ISO 646 does allocate different names to some of
the control characters, but these names do not affect the way these codes are
used.) In addition it has been identified that the ISO 2022 Escape sequence used
for ISO 646 in the SGML reference concrete syntax was incorrect: it should have
been ESC 2/8 4/0. Strictly speaking, therefore, the reference
concrete syntax should be updated to read "ISO 646:1991//CHARSET
International Reference Version (IRV)//ESC 2/8 4/0". In practice it
is likely that the next revision of SGML will adopt the 16-bit version of
ISO/IEC 10646:1993 as its default code set.

The described character set portion of the reference concrete syntax
character set description shows that 128 characters, starting from position 0 in
the list, should be mapped to identical positions in the reference concrete
syntax. Figure 4.3 shows the 128 codes defined in ISO
646.

Codes with values less than 32, and that with a value of 127, have been
allocated to control functions, while the 95 codes with values between 32 to 126
are associated with printable (data) characters. Note that the character numbers
entered in the SHUNCHAR section of the syntax clause shown in
Figure 4.2 are those defined as control codes within ISO
646, e.g.:

There are, however, certain control codes that are significant within an
SGML document, not as characters but as codes which serve particular functions.
These codes are identified in the FUNCTION section of the syntax
definition. In the case of the reference concrete syntax four functions are
defined:

Record End (RE)

Record Start (RS)

the space character (SPACE)

the horizontal tab code (TAB).

The carriage return code (13) is used as the Record End
code for the reference concrete syntax, with the line feed code (10) being used
for the Record Start. The special rules that apply to the
processing of these codes are explained in the section headed The effect of record boundaries in Chapter
11.

The Space character (32) is treated as
a function character because it has a special function as a separator
within SGML markup declarations. The Tab code (9) can also be used as a
separator, but as it does not have exactly the same role as the space it is
placed into a special group of separator characters identified
by the SEPCHAR control word.

Additional function codes can be specified by adding to the list a triplet
consisting of:

a function name of up to 8
alphanumeric characters starting with a letter

a function class keyword

the decimal number of the code used to activate the function.

The types of function class that can be identified in SGML are:

SEPCHAR - separator character

MSOCHAR - markup-scan-out character

MSICHAR - markup-scan-in character

MSSCHAR - markup-scan-suppress character

FUNCHAR - unspecified form of function character.

The most commonly used function classes are SEPCHAR, which is
used for all codes that can separate the component parts of a markup declaration
(in addition to RE, RS and SPACE), and
FUNCHAR, which is used to identify system specific functions.

Note: Markup scanning is suppressed between codes defined as
markup-scan-out characters and codes defined as markup-scan-in characters, and
for the code immediately following a markup-scan-suppress character.

The NAMING section of the syntax clause identifies which
characters can be used in tag or entity names, and in SGML unique identifiers.
By default SGML presumes that names can
only start with alphabetic characters, in either shift, with subsequent
characters being alphanumeric. The LCNMSTRT and UCNMSTRT
entries in the syntax clause allow other, non-alphanumeric, characters to be
defined as name start characters, the
LCNMCHAR and UCNMCHAR entries defining which
non-alphanumeric characters can be used as name
characters after a name start character.

The reference concrete syntax only allows alphabetic characters to be used
as name start characters, but within names the unaccented alphanumeric
characters (a-z, A-Z and 0-9) can be supplemented by full stops and hyphens.

Note: Digits cannot be used as the first character of an SGML name.

Other characters that are required as parts of tag, attribute or entity
names, or within unique identifiers, must be declared as valid name characters
by putting the appropriate characters in the uppercase and lowercase name start
or name character strings. The position of the entries in the string is
important as characters in position n in the lowercase string may be
replaced by the character in position n in the uppercase string during
parsing. If there is no uppercase equivalent the lowercase character must be
repeated in the uppercase string (and vice versa).

The NAMECASE entries of the syntax
clause show that, by default, the reference concrete syntax allows uppercase
substitution of lowercase characters within element and related markup (GENERAL
YES) but for entity names such substitution is not permitted (ENTITY
NO). This allows different entity declarations to be defined for &Eacute;
and
&eacute;, etc., while allowing <p> and
<P> to be treated identically

The GENERAL SGMLREF entry in the DELIM section
of the syntax clause shows that the general default set of SGML
delimiters are used in the reference concrete syntax.
Figure 4.4 lists these default delimiters and shows the
formal name assigned to the identifier.

Note that some codes are assigned more than one meaning. This is because the
meaning of a markup delimiter is dependent on the context in which it is
encountered. There are 10 different markup contexts:

The
SHORTREF SGMLREF entry in the DELIM section of the
syntax clause shows that the standard set of SGML short reference delimiters,
shown in Figure 4.6, can be used in conjunction with the
reference concrete syntax.

In the concrete reference syntax most punctuation characters can be used as
short reference delimiters, though tag delimiters (&, <,
/, !, ? and >), and
certain other significant symbols (e.g. apostrophe, backslash, full stop and the
general currency sign) are excluded. Six special code sequences are also
defined, five of which allow common word processor line ending conventions to be
used as short reference strings.

The QUANTITIES entry at the end of the syntax clause also
requires the presence of the SGMLREF keyword to indicate that
unless otherwise specified the default quantity set
will be used. Figure 4.7 shows the default quantity
limits.

Reserved Name

Value

Purpose

ATTCNT

40

Maximum number of attribute names and name tokens in an attribute
definition list

ATTSPLEN

960

Maximum length of a start-tag attribute specification

BSEQLEN

960

Maximum length of blank sequence mappable to a short reference string

DTAGLEN

16

Maximum length of data tag string

DTEMPLEN

16

Maximum length of data tag template or pattern template

ENTLVL

16

Maximum number of nesting levels for entities

GRPCNT

32

Maximum number of tokens in group (one level)

GRPGTCNT

96

Maximum number of tokens at all levels in a model group (data tag groups
count as 3 tokens)

GRPCNT, which restricts the number of
elements within a single model group to 32.

These entries often need to be increased from their default values. When
SGML is revised it is anticipated that the default values will be changed to 32,
2048, 32 and 64 respectively. Most SGML parsers already default to these, or
higher, levels, though they should still warn users when the standard values
have been exceeded.

The BASESET and DESCSET clauses in the character
set description (CHARSET) that starts the SGML
declaration are used to define the character set used within an SGML document.
By default the ISO 646 character set used for markup is defined as the first
component of the document's character set. This default document character set
can be extended by referencing other ISO character sets. For example, the 96
character supplementary set of Latin accented characters, as defined in
ISO 8859/1, could be added to the
document's character set by placing the following entries underneath the
standard DESCSET entry in the CHARSET clause at the start of the SGML
declaration shown in
Figure 4.1:

Note: The above character set was proposed as the definition to be used
for the internationalized version of the HyperText Markup Language (HTML) on the
World Wide Web in August 1996.

The described character set portion of the default document character set
description shown in Figure 4.1 defines the purpose of
the characters in ISO 646 more clearly than the matching entry in the syntax
clause. It can be interpreted as:

the nine control codes starting from 0 (e.g. 0-8), the two control codes
starting from position 11 (11 and 12), the 18 control codes starting from
position 14 (14-31) and the control code in position 127 are not used within the
document (they are, therefore, considered to be non-SGML characters)

the two control codes starting at position 9 (e.g. Tab and Record Start)
and the one in position 13 (Record End) have special significance within the
document (they are SGML function codes)

there are 95 data characters, starting from position 32 (the position of
the space, which is also one of the special function characters).

The capacity set used with the reference concrete syntax is shown in
Figure 4.8. This
reference capacity set restricts the total number of stored
markup characters within an SGML document to 35000 characters, but places no
restrictions on the capacity of any one of the component parts of the markup,
which can take up all 35000 bytes of strorage if required. In some large
documents it is possible for this default total capacity to be exceeded

Note: Most current SGML systems will ignore the default capacity set
restrictions, perhaps providing a warning message to users if the default limits
are exceeded. Modern large-memory systems do not have the memory restrictions
that were typically found in desktop systems of the 1980s, where it was
important to warn users of large documents that they could exceed the program's
memory allocation. Many of the existing restrictions defined in the capacity set
clause will be removed when ISO 8879 is next updated.

By default the SCOPE clause of an SGML declaration is the
whole document (i.e. the syntax is used in both the document prolog and the
document instance). If, however, the character set defined in the syntax section
is only used to markup the text (i.e. all declarations have been coded using the
reference concrete syntax) the default
SCOPE DOCUMENT entry can be changed to read
SCOPE INSTANCE.

The last clause in the SGML declaration can be used to transmit any
application-specific information (APPINFO)
needed to process the document. For example, a document that uses the ISO/IEC
10744 Hypermedia/Time-based Structuring Language (HyTime) application of SGML
would have an entry reading
APPINFO "HyTime". When no application specific
information needs to be exchanged the default entry of APPINFO NONE
applies

ISO 8879 also identifies some special sets of alternative concrete syntaxes.
The most important of these are:

the core concrete syntax

basic and core multicode concrete syntaxes

public concrete syntaxes.

The core concrete syntax is exactly the same as the
reference concrete syntax except that the SHORTREF entry in the
DELIM section is followed by NONE rather than
SGMLREF. A document prepared using the core concrete syntax is
referred to as a minimal SGML document.

Where the code extension techniques defined in ISO 2022 are being used to
extend the character set beyond the 95 characters available in the reference
concrete syntax, the multicode basic concrete syntax defined
in Annex D of ISO 8879 can be used. If the short reference facility is not
required the equivalent multicode core concrete syntax can be
used.

Where characters outside the standard ISO 646 unaccented Latin alphabet are
required in markup, variants of the reference concrete syntax will be needed.
Each such variant concrete syntax can be publicly declared as
a
public concrete syntax and given a public identifier
that can be used to call it from within the SGML declaration. For example, a
German variant concrete syntax might be identified as:

The most famous variant concrete syntax is that used for the HyperText
Markup Language (HTML). In the definition of Version 2.0 of this language, in
RFC 1866, the following SGML declaration was specified:

This SGML declaration specifies the following changes to the default SGML
declaration:

the document character set is extended to include the 96 accented and
special character forming up the ECMA-94 Latin Alphabet Nr. 1 character set

Note: As these characters have not also been specified as part of the
SYNTAX clause they cannot be used within markup, only within the
text of the document instance.

the total capacity to be reserved for SGML token storage has been increased
to 150,000 octets, with the capacity for group and entity storage within this
total capacity also being extended to the same limit

the maximum length for an attribute specification has been extended from
960 to 2100 characters (to allow for long URLs within attribute specifications,
etc.): the maximum length of a markup tag, including its attribute
specification, has also been increased to 2100 characters

the maximum length of literal strings has been increased to 1024 characters
(again to allow for long URLs and other program parameters)

the maximum length of names has been extended to 72
characters, the limit being determined by the preferred maximum line length for
HTTP transmission, which requires that a space occurs at least every 72
characters)

the maximum length of processing instructions has been increased to 1024
characters

the maximum number of nested tags has been increased to 100

the maximum number of tokens in a model group has been extended to 150,
which must be split into no more than 64 groups

formal public identifiers must conform to the formal rules for their
definition

a special application for the use of SGML document access (SDA) attributes
to identify types of HTML elements is associated with the document type
definition via the APPINFO clause.

For version 4.0 of the HTML DTD, which supports multiple languages and the
use of bidirectional texts, the following SGML declaration should be used to
invoke the full ISO/IEC 10646 character set:

Warning: Before using the extensions listed below you should ensure
that both the document creation and document receiving systems can process these
additional features.

Two extensions to the facilities provided by SGML declarations have been
defined in the form of optional annexes to ISO 8879:

Annex J: Extended Naming Rules

Annex K: Web SGML Adaptations

Annex J allows SGML to make full use of the extensive ranges of characters
provided in the ISO/IEC 10646 character set by providing for the specification
of ranges of characters and for identifying name characters for which case
substitution is not permitted.

Annex K provides additional controls for optional features of SGML, and
relaxes some of the previously mandatory restrictions to allow for situations
where parts of the document type definition may not be accessible due to network
constraints. Annex K also includes facilities for identifying where externally
defined constraints on the use of declarations have been defined.

The following example shows how these extensions can
be used to create an SGML declaration that defines the syntax used for the World
Wide Web Consortium's Extensible Markup Language (XML):

When both the extensions in Annex J and those in Annex K apply to an SGML
declaration the minimum literal at the start of the definition is extended to
read "ISO 8879:1986 (WWW)", where WWW
stands for World Wide Web.

The ability to switch off quantity checking by
specifying QUANTITY NONE in the SYNTAX clause.

Two additional delimiters, to allow character
references to be entered using hexadecimal numbers and to allow the
null end-tag start delimiter to differ from that
used to indicate the position of the null end-tag.

Because ISO/IEC 10646 character sets are typically displayed as 'planes' of
256 characters (16 columns of 16 characters) it is often easier to reference
them using a hexadecimal (base 16) number than a decimal (base 10) number. For
this reason the Web SGML adaptations include a new delimiter name, hexadecimal
character reference open (HCRO), which can be used in the DELIM
section of the SYNTAX clause. A typical use of this option is:

DELIMS GENERAL SGMLREF HRCO "&#38;#x"

Note particularly the use of an embedded (decimal)
character reference, which indicates that the
delimiter starts with an ampersand (&) code. This form of
double escaping is required to ensure that an error is not reported when the
parser checks the contents of the string defining the delimiter.

At the end of the SYNTAX clause an optional new entry can be
added to specify named character data entities that can be used to escape markup
characters. It is recommended that all characters that are defined as the first
character in a markup delimiter be provided with escape entity names to allow
them to be used within the contents of elements or entities. For example, XML
defines the following set of default entities:

ENTITIES "amp" 38 "lt" 60 "gt" 62 "quot" 34 "apos" 39

This declaration states that the characters used to identify the start or
end of markup declarations, processing instructions, elements, attribute values
and entity references within a document instance can be identified using
predefined character data entities as follows:

In the MINIMIZE section of the FEATURES clause
the options that can be associated with SHORTTAG
have been extended. Instead of just saying NO to indicate that
minimization of tags is not allowed or YES to say that all forms
of tag minimization are allowed, you can now select each of the minimization
options individually. If the new option is used then three new keywords must be
added to the declaration, and each of these keywords must be followed by entries
that consist of a keyword identifying a minimization option followed by an
appropriate value.

NETENABL, to indicate whether null
end-tags can be enabled using the new NETSC delimiter for all
elements (ALL), for empty elements whose end-tag immediately
follows the start-tag (IMMEDNET) or for NO elements.

ENDTAG, to indicate options that apply to end-tags, which
are:

EMPTY, to indicate whether empty
end-tags are to be allowed (YES) or not (NO)

To enable the use of end-tags with empty
elements users can add the optional
EMPTYNRM YES empty element ending rules specification to the end
of the MINIMIZE part of the FEATURES clause. When
this empty element normalization option is activated, omission of the end-tag is
controlled by the tag omission rules of the element. If the default rules for
the automatic omission of end-tags from empty elements are to apply then
EMPTYNRM NO must be specified.

Note: If this option is specified the options for implicit definitions
must immediately follow it.

To specify that public identiifers must be entered in the form of Internet
Uniform Resource Names (URNs) you can now add URN YES after the
FORMAL NO option at the end of the FEATURES clause.
By default, or if no entry is specified, URN NO will be specified
to indicate that no checking of public identifiers is required.

Note: If this option is used it must be immediately followed by the
options listed under the Other new features heading below.

When the URN feature has been added to the FEATURES clause it
must be followed by specifications relating to the following new options:

KEEPRERS, to indicate whether Record
End and Record Start codes found between elements in mixed content are to be
retained (YES) or not (NO)

VALIDITY, to indicate which of the following types of
validity checking is to be applied to elements in the document instance:

TAG, to indicate that checking is only required to ensure that
the document is fully tagged, i.e. that every non-EMPTY element within the
document instance has both a start and an end-tag

TYPE, to indicate that checking is required to ensure that
the element and its attributes are permitted in the current context (the
default condition when validity is not specified)

TAG-TYPE, to indicate that checking is required to ensure
that the element and its attributes are permitted in the current context, and
that non-EMPTY elements has both a start and an end-tag

NOASSERT, to indicate that no validity checking needs to be
applied to elements in the document instance.

ENTITIES, to indicate the types of checks to be performed on
entities referenced in the document instance, which can be defined using the
following options:

NOASSERT, to indicate that no validity checking of entity
references is required

REF, to indicate whether ANY entity references
can occur or only references to INTERNAL entities, or that
document instances can contain no entity references other than those that
reference predefined data character entities (NONE)

INTEGRAL, to indicate whether or not elements must be
integrally stored so that every element starts and ends in the same entity (YES)
or not (NO).

Note: If the REF option is present it must be immediately
followed by the INTEGRAL option. The NOASSERT option
must be used on its own.

When the Web SGML Adaptations are being used the APPINFO
clause can be extended by adding a SEEALSO statement which is
followed by a public identifier that
references a file that contains information on any additional constraints to be
applied by applications using the SGML declaration. For example, the constraints
specified in the XML specification can be referenced using the following
statement:

APPINFO NONE SEEALSO "http://www.w3.org/TR/PR-xml-971208"

More than one public identifier can be specified if appropriate. If no
additional rules apply the entry can either be omitted or changed to SEEALSO
NONE.

When the Web SGML Adaptations are being used a shortened form of SGML
declaration can associated with a document type declaration. This takes the form
of an SGML declaration reference to an externally stored
SGML declaration body. The format of the shortened declaration
is:

<!SGML name external-identifier? >

where name is a reference concrete syntax name used to
identify the SGML declaration and the optional external-identifier
identifies the external entity which contains
the clauses of the SGML declaration body. (If it is omitted the system is
assumed to be able to use the name to find the relevant SGML
declaration: this is equivalent to having an unqualified system identifier of
SYSTEM for the external identifier)

The file referenced must start with the minimum literal that indicates which
version of the SGML declaration is being used, followed by definitions for each
of the clauses that make up the SGML declaration body. Comments can be
interspersed between clauses, but may not precede the mimimum literal.

If the system knows where to find an SGML declaration known as XML
all that needs to be added to the start of the file is <!SGML XML>.

References

Readers wishing to know more about the role of the SGML declaration should
refer to the following books: