SGML '93 Conference Report, by Michael Popham

Credits

The following report was obtained from the Exeter SGML Project FTP server as Report No. 9, in UNIX "tar" and "compress" (.Z) format. It is unchanged here except for the conversion of SGML markup characters into entity references, in support of HTML.

THE SGML PROJECT SGML/R25
CONFERENCE REPORT
SGML '93
BOSTON, MA, USA 6TH-9TH DECEMBER 1993 Issued by
Michael G Popham
22nd December 1993
BACKGROUND
The conference opened with a welcome from Norm Scharpf of the
GCA. This year, around 450 attendees were anticipated, about
50% of whom were attending their first SGML conference. Norm
announced that the GCA will be taking over the maintenance of
the AAP/ISO 12083 DTDs from EPSIG.
In his initial remarks, the Conference Chair (Yuri Rubinsky) noted
that this was the largest SGML event in history. The size of the
conference meant that it was split into two tracks running
concurrently, one for SGML novices, the other for experts. (I
attended the technical track for experts).
SESSIONS ATTENDED
1. "The SGML Year in Review" - Yuri Rubinsky, B
Tommie Usdin, and Debbie Lapeyre
2. Keynote Address: "TEI Vision, Answers and
Questions: SGML for the Rest of Us" - Lou Burnard
(Text Encoding Initiative)
3. Poster Session
4. Reports from the Front
5. Multi-company SGML Application Standard
Development Process" - Bob Yencha (National
Semiconductor), Patricia O'Sullivan (Intel
Corporation), Jeff Barton (Texas Instruments), Tom
Jeffery (Hitachi Micro Systems, Inc.), Alfred
Elkerbout (Philips Semiconductors)
6. "Archetypal Early Adopters? Documentation of the
Computer Industry"
6.1 " Information Development Strategy and Tools -
IBM and SGML" - Eliot Kimber and Wayne Wohler
(IBM)
6.2 Eve Maler (Digital Equipment Corporation)
6.3 "Implementation of a Corporate Publishing System"
- Beth Micksh (Intergraph)
6.4 Jon Bosak (Novell)
7. International SGML Users' Group Meeting
8. "Real World Publishing with SGML" - Terry Allen
(Digital Media Group, O'Reilly & Associates, Inc.)
9. "HyTime Concepts and Tools" - Dr. Charles
Goldfarb (IBM)
10. "HyTime: Today's Toolset" - Dr. Charles Goldfarb
(IBM) and Erik Naggum (Naggum Software)
11. "Charles Goldfarb, Please Come Home: SGML, the
Law, and Public Interest" - Ina Schiff(Attorney at
Law) and Allen H Renear (Brown University)
12. "Online Information Distribution" - Dave Hollander
(Hewlett Packard)
13. "Xhelp: What? Why? How? Who?" - Kent Summers
(Electronic Book Technologies)
14. "Digital Publications Standards Development (DPSD):
A Modular Approach" - Horace Layton (Computer
Sciences Corporation)
15. "A Practical Introduction to SGML Document
Transformation" - David Sklar (Electronic Book
Technologies)
16. "SGML Transformers: Five Ways" - Chair: Pam
Gennusa (Database Publishing Systems Limited)
17. "The Scribner Writers Series on CD-ROM: From a
Great Pile of Paper to SGML and Hypertext on a
Platter" - Harry I. Summerfield (Zandar
Corporation), Anna Sabasteanski (Macmillan New
Media)
18. "The Attachment of Processing Information to SGML
Data in Large Systems" - Lloyd Harding (Mead Data
Central)
19. ISO 12083 Announcment" - Beth Micksch
(Intergraph Corp.)
20. Reports from the SGML Open Technical Committees
21. "A Technical Look at Authoring in SGML" - Paul
Grosso (ArborText)
22. "Implementing an Interactive Electronic Technical
Manual" - Geoffrey von Limbach, Bill
Kirk(InfoAccess Inc.)
23. "The Conversion of Legacy Technical Documents into
Interactive Electronic Technical Manuals: A NAVAIR
Phase II SBIR Status Report" - Timothy Billington,
Robert F. Fye (Aquidneck Management Associates,
Ltd.)
24. New Product Announcements and Product Table Top
Demonstrations
25. Poster Session
26. "Implementing SGML Structures in the Real World"
- Tim Bray (Open Text Corp.)
27. "User Requirements for SGML Data Management" -
Paula Angerstein (Texcel)
28. "A Document Query Language for SGML Databases"
- Ping-Li Pang, Bernd Nordhausen, Lim Jyh Jang,
Desai Narasimhalu (Institute of Systems Science,
National University of Singapore)
29. Closing Keynote - Michael Sperberg-McQueen (Text
Encoding Initiative)
1. "The SGML Year in Review" - Yuri Rubinsky, B Tommie Usdin,
and Debbie Lapeyre
The full text of "The Year in Review" will be published in <TAG>
and also posted to comp.text.sgml.
The review was loosely split into a number of categories, the first
of which focused on Standards Activity. The interim report of the
review of ISO 8879 will be published in <TAG>, however it is
clear that changes to the SGML Standard will be required. ISO
10744, the HyTime Standard is being adopted by IBM, whilst
TechnoTeacher are producing relevant tools; also, at least 3
HyTime-related books are currently in preparation. A revised 3-
level version of DSSSL will be out early next year, and the SPDL
Standard is now ready to go to press. Information on ISO 12083
(the fast-tracked version of the revised AAP DTD) will be given at
this conference.
User group activity - the Swedish group has been very active in
the last twelve months as has the Japanese SGML forum, which
attracted 400 people to an open meeting on SGML. Erik Naggum
was welcomed as the new Chair of SGML SIGhyper.
Major Public Initiatives - SGML Open was founded this year,
more information was given later in the week. The NAA and IPTC
(both major news industry bodies), have been working on an
SGML-based universal text-format, the "News Industry Markup
Language" [?], for interchanging news service data. Co-ordinators
of The Text Encoding Initiative (TEI) met with people developing
the World Wide Web (WWW) to discuss the production of
HTML+ (a revised version of the Hypertext Markup Language, the
markup scheme recognized by WWW browsers). The TEI have
now completed all their major goals and some supplementary
work, which is now publicly available (via ftp). The
DAVENPORT group made an amicable split into DAVENPORT
and CApH; more details were given later in the conference. In the
US, 18 other states have followed Texas in requiring text books to
be produced following using SGML and following the ICADD
guidelines (the International Committee for Accessible Document
Design) - several companies have said they will be providing
tools to handle the ICADD tagset. By 1995, all companies in the
US will be able to provide their financial information according to
EDGAR. British Airways is developing an ATA DTD-based
system, and Lufthansa already have an SGML-based system in
place (details of which were given at SGML `93 Europe).
Publications - SGML has received an increasing amount of
coverage in the mainstream computer press. Prentice Hall will be
publishing a series of books to do with open information
interchange, under the guidance of Charles Goldfarb. Kluwer will
publish "Making Hypermedia Work" a handbook on HyTime by
Steve de Rose and David Barnard. A new version of Eric van
Herwijnen's "Practical SGML" will be available early in 1994.
Van Nostrand Reinhold will be publishing a manager s' guide to
SGML. Exoterica released their "Compleat SGML CD-ROM" in
1993, and will be releasing a conformance suite CD-ROM next
year. Elliot Kimber has written a HyQ tutorial (HyQ is the query
language described in the HyTime standard), which is available
via ftp.
Major corporations and government initiatives - The American
Memory Project (run by the Library of Congress), has chosen to
use SGML to create a text base of materials. IBM has developed
an internal SGML-based system, called IBMIDDOC, which they
will use to create, manage and produce/deliver all their product
documentation. The OCLC have been selected to develop an
SGML-based publishing system for use by the ACM. The British
National Corpus, a 100 million word tagged corpus, is to be made
available next year. Springer Verlag is currently producing 50
journals a year using an SGML-based system, and next year this
figure will rise to 100 [150?]. Various patent/trademark bodies,
including the US Patent and Trademark Office, the European
Patent Office and the Japanese Patent Office, are adopting SGML-
based approaches. In France, SGML is being used by a number of
key players in various industries (maritime, aerospace, power
industry, publishing), whilst SGML uptake in Australia is also
increasing. UCLA is adopting SGML for all its campus-wide
information publishing, both on-line and on paper, as is the Royal
Institute of Technology in Sweden.
Miscellaneous - Adobe has an agreement with Avalanche to
develop SGML filters to move information to/from Acrobat.
Lotus Development Corporation is looking at incorporating SGML
awareness into a future version of Ami Pro. Microsoft announced
the development of an SGML-based add-on for Word (to be called
"SGML Author"). The American Chemical Society is updating its
SGML DTDs, whilst the IEEE is developing a suite of DTDs for
publishing its standards. The Oxford Text Archive now has about
100 works tagged with TEI-conformant SGML, which are
available for ftp over the Internet. The first SGML summer school
was organized in France, and attracted 35 attendees. Joan Smith,
the "Godmother of SGML" and founder of the International
SGML Users' Group, has retired this year; Yuri remarked that her
presence will be greatly missed and thanked her for all her efforts
over the years. And in the "believe it or not" category, came the
news that IBM will be producing an SGML-tagged CD-ROM of
interviews published in Playboy magazine between 1962 and
1992.
2. Keynote Address: "TEI Vision, Answers and Questions:
SGML for the Rest of Us" - Lou Burnard (Text Encoding Initiative)
This presentation looked at the wider relevance of the Text
Encoding Initiative (TEI) - its origins, goals and achievements. It
included an overview (tasting?) of the TEI's DTD pizza model and
the TEI class system, and a look at some TEI techniques and
applications.
The TEI comes primarily from the efforts of the Humanities
research community. Sponsors include the Association for
Computational Linguistics (ACL), the Association for Literary and
Linguistic Computing (ALLC), and the Association for Computing
and the Humanities (ACH). Funding bodies include the US
National Endowment for the Humanities, the Mellon Foundation,
DG XIII of the European Commission, and the Social Science and
Humanities Research Council of Canada.
The TEI addresses itself to a number of the fundamental problems
facing Humanities researchers, although their findings are widely
applicable throughout academia and beyond. It looks at the re-
usability of information, particularly with regard to issues of
platform-, application-, and language-independence. It accepts the
need to support varied information sources (such as text, image,
audio, transcription, editorial, linking, analysis etc.). The
developers of the TEI's guidelines have also given careful
consideration to the interchange of information - issues such as
what and how to encode, generality vs. specificity, providing an
extensible architecture and so on.
The basics of the TEI's approach were published in the first draft
of "Guidelines for the Encoding and Interchange of Machine-
Readable Texts" (also known as TEI P1). Specialist workgroups
were set up as part of the process to develop the guidelines. They
focused on identifying significant particularities, ensuring that the
guidelines were independent of notation or realization, avoiding
controversy, over-delicacy, or inadequacy, and seeking
generalizable solutions. The second draft, TEI P2, covers such
areas as: segmentation, feature-structures, certainty, manuscripts,
historical sources, graphs, graphics, formulae and tables,
dictionaries, terminology, corpora, spoken texts, and so on. The
consequences of the TEI's approach mean that they have focused
on content, not presentation. The guidelines are descriptive, not
prescriptive in nature - and any redundancy has been cut out.
The aim of the TEI's work (in addition to producing the
guidelines) was to create a modular, extensible DTD.
To date, the TEI have produced a number of outputs. It has
created a coherent set of extensible recommendations (contained
in the guidelines). It has made available a large number of SGML
tagsets, which can be downloaded from several public ftp servers
around the world. In addition, the TEI has developed a powerful
general purpose SGML mechanism (using global attributes, etc.).
TEI P2 serves as a reference manual for all of these aspects. The
current version of TEI P2 is now available as a series of fasicules
(available via ftp) which outline the basic architecture and core
tagsets. For each element, TEI P2 provides descriptive
documentation, examples of usage, and reference in formation. It
also contains some DTD fragments (i.e. tagsets) which can be
used to create TEI-conformant applications.
Lou then raised the more general question of "How many DTDs
does the world really need?" - to which there are several possible
answers. One massive and/or general (vague) DTD might meet
the lowest common denominator of peoples' needs, but the TEI
could only develop it if they adopted a "we know what's best for
you"-philosophy. At the other extreme, one could argue that the
world does not need any generalized DTDs, because no-one could
write a DTD (or set of DTDs) which could truly address all the
specific concerns of individual users. An intermediate alternative
would be to develop "as many [DTDs] as it takes" to support a
particular level of user requirements. However, the TEI believes
that it is possible to adopt an entirely different approach to the
three (extremes) mentioned above, which it calls "The Chicago
Pizza Model".
The Pizza Model allows users to make selections from a
prescribed variety of options, in order to meet their particular
needs as closely as possible. The overall model assumes that a
pizza is made up of a choice of bases, mandatory tomato sauce and
cheese, plus a choice of any combination from a given list of
toppings. In terms of the "TEI Menu", the base consists of a
choice of one tagset to describe prose, verse, drama, transcribed
speech, letters and memoranda, dictionaries, or terminology. The
TEI's header and core tag sets are mandatory (i.e. they are the
equivalent of the cheese and tomato sauce which come with every
pizza), after which the user can select one or more tagsets to
describe linking, analysis, feature structures etc.
The mandatory parts of the TEI's model, cover a number of vital
aspects. The TEI header consists of an AACR2-compatible
bibliographic description of an electronic document and its
sources. The header also contains standardized descriptions of the
encoding systems applied, codebooks and other metadata, and the
document's revision status. The core tag set provides markup for
highlighted phrases (e.g. emphasis, technical terms, foreign
language matter, titles, quotations, mention, glosses etc.), '"data"
(e.g. names, numbers, dates, times, addresses), editorial
intervention (e.g. corrections, regularizations, additions,
omissions, etc.) as well as lists of all kinds, notes, links and
pointers, bibliographic references, page and line breaks, verse and
drama. Lou gave some examples of the use of core tags, and how
to customize the TEI DTD to rename elements, undefine elements
and select tagsets.
The TEI have also adopted a class system. Element classes consist
of semantically related elements which may all appear at the same
place in model. Attribute classes are made up of semantically
related elements which share a common set of attributes. The
classes are implemented using parameter entities, which
simplifies documentation and facilitates controlled modification
of the DTD by the user. Lou gave an example of how new
elements could be added to a class using the TEI's technique.
Unfortunately, due to time constraints, Lou did not present his
slides on the TEI's use of global attributes, nor how the problems
of alignment, using identifiers, pointers and links are dealt with.
Several major scholarly publishers (e.g. Chadwyck Healey, Oxford
University Press, Cambridge University Press) have begun to
adopt the TEI's guidelines and implement their techniques. The
same is true of a number of large academic projects - such as the
Women Writers project (at Brown University), CURIA, the
Wittgenstein Archive, the Oxford Text Archive, etc. - and some
Language and Research Engineering (LRE) projects (e.g.
EAGLES, and the British National Corpus). The TEI's work is
also being taken up by librarians and archivists (such as in the
Library of Congress' American Memory Project, at the Center for
Electronic Texts in the Humanities (CETH), and so on).
Developing a TEI conformant application still requires the
essential processes of systems analysis/document design. Once
this has been done, the designers can choose the relevant TEI
tagsets using the "pizza model" approach. Any restrictions,
modifications or extensions must be carefully identified and
documented (to help ensure the usefulness of the tagged data to
later scholars). The TEI DTD (i.e. its various constituent tagsets)
is now available for beta test, and can be downloaded from a
number of sites around the world (e.g. ftp from sgml1.ex.ac.uk
[144.173.6.61] in directories tei/p2/drafts and tei/p2/dtds; or send
email to listserv@uicvm.uic.edu containing the single line: sub tei-
l <Your Real Name>)
Once the testing of the TEI P2 DTD is complete, any revisions
will be incorporated into a final version of the "Guidelines..." to be
published as TEI P3. The next phase of work will involve the
development of application-specific tutorials (including electronic
versions), development of appropriate software tools (e.g. TEI-
aware application packages), and the creation of new tagsets to
extend the TEI's guidelines to support new kinds of application.
3. Poster Session
Poster sessions formed a much larger part of this year's conference
schedule than on previous occasions. The idea behind the session
is that they allow speakers to give a short, informal presentation on
any topic, after which they are available for questions and
discussion. Each speaker can support his/her presentation with
one or more specially created posters which may consist of
anything from summary diagrams or a list of points, to the full-text
of a presentation.
It would be impossible to provide full details of all the poster
presentations. However, given below are the title and extracts
from the poster abstracts for each of the presentations mentioned
in the programme. (N.B. Some poster sessions were put on
impromptu, and these are not show below). The posters were
loosely grouped into categories.
SGML Transformations:
"From SGML to Acrobat Using Shrinkwrapped Tools" - the
transformational process the BCS uses to transform its documents
into pdf files freely distributed over the internet. (Sam Hunting,
BCS Magazine)
"SGML Transform GUI" - describes a language-based, syntax-
independent GUI for SGML structure and style semantic
transformation, that supports both declarative and procedural
processing models. (Michael Levanthal, Oracle Corporation)
"A Tale of Two Translations" - a comparison of the development
of translation programs using Exoterica's OmniMark, and
Avalanche's SGML Hammer. (Peter MacHarrie, ATLIS
Consulting Group)
"Data Conversion Mappings" - mapping old data formats to new,
using 1 to 1, 1 to many, 0 to 1 and 0 to many conversions, and how
these mappings can be automated. (David Silverman, Data
Conversion Laboratory)
"DTD to DTD Conversion for Producing Braille Automatically
from SGML" - following the techniques created by the
International Committee on Accessible Document Design
(ICADD) to produce braille, large print and voice synthesized
books from SGML source files. (David Slocombe, SoftQuad)
"Let's Play UnTag!" - untagging an SGML document to get a
proprietary format. (Harry Summerfield, Zandar Inc.)
"Introducing Rainbow" - a DTD archetype for representing a
wide variety of proprietary word processor data formats to
facilitate proprietary-to-SGML interchange and transformation.
(Kent Summers, Electronic Book Technologies)
"Converting Tables to SGML" - converting legacy table data
from typesetting files or formatted visual representation into
SGML. (Brian Travis, SGML Associates)
Business Case for SGML:
"Fear and Loathing in SGML: Life After CALS" - an overview
of a recent study of SGML products and markets. (Antoinette
Azevedo, InterConsult)
"Designing Open Systems with SGML" - the business role and
benefits of using SGML, and how to design an SGML based
system. (Larry Bohn, Interleaf)
"The Commercialization of SGML" - a review of the strengths
and benefits of SGML and its current perception in the
commercial business world. (Allen Brown, XSoft)
"SGML: Setting up the Business Case" - an approach to making
the business case for SGML. (Eric Severson, Avalanche
Development Corporation)
"Document Management Lingo: Why Executives Buy SGML" -
a framework for selling SGML in relation to document
management. (Ludo Van Vooren, Interleaf)
How To......
"To 'INCLUDE' or 'EXCLUDE' That is the Question" - the use
of INCLUSION and EXCLUSION in DTDs. (Bob Barlow,
Barlow Associates)
"Communicating Table Structures Using Word-processor Ruler
Lines" - a method for writers to indicate the structures of tables
using simple, mnemonic "ruler line encoding". (Gary Benson,
Fluke Corporation)
"Pre-Fab Documents: Modularization in DTD design" - groups
of related document types are traditionally described in large,
monolithic DTDs. If the related document types contain similar
structures, they can be described as a series of related DTD
modules. (Michael Hahn, ATLIS Consulting Group)
"SGML + RFB = Access to Documents" - how Recording for the
Blind, Inc. (RFB) provides E-Text materials to print-disabled
persons.
"Handling Format/Style Information" - using FOSIs to describe
formatting/style information, and how to develop FOSIs using the
Output Specification DTD provided in MIL-M-28001B. (Denise
Kusinski and Pushpa Merchant, World Computer Systems Inc.)
"Remodeling Ambiguous Content Models Through ORs" - using
factorization to avoid ambiguity in model groups resulting from
improper use of occurrence indicators and OR connectors. (John
Oster, McAfee & McAdam, Ltd.)
"Reuse and conditional processing in IBM IBMDOC" - [not
listed in programme] (Wayne Wohler and Eliot Kimber, IBM)
"An easy way to write DTDs" - [not listed in programme]
([speaker unknown])
Case Studies:
"SGML and Natural Language Processing of Wire Service
Articles" - the Mitre Corporation's use of Natural Language
Processing and SGML-tagging to add value to news-wire articles
and other kinds of document. (John D Burger, MITRE)
"SGML Databases" - (Mike Doyle, CTMG Officesmiths)
"Active Information Sharing System (AISS) SGML Database API"
- the applications interface to the AISS SGML database to
produce a total solution integrated SGML environment. (Hee Joh
Lian, Information Technology Institute (ITI) Singapore)
"AISS Document Formatting API" - a processing model of a
native SGML fomatter, the issues involved and architectural forms
to define them. (Yasuhiro Okui, NIHON UNITEC CO. LTD.)
"AIS Document Management API" - controlling workflow using
SGML and HyTime. (Roger Connelly, Fujitsu; Steven R
Newcomb, TechnoTeacher)
"Paperless Classroom" - building an interface to large amounts
of information through the use of SGML-based hypermedia
technology. (Barbara A Morris, Navy Personnel Research and
Development Center)
"Integrating SGML into On-line Component Information
Delivery" - a comparison of using manual and SGML-based
processes for database loading. (Javier Romeu, Info Enterprises
Inc. - A Motorola Company)
"SGML Support for Software Reuse" - using SGML to markup
software for reuse (John Shockro, CEA Inc.)
"What is an IETM?" - what constitutes an "Interactive Electronic
Technical Manual" (IETM)? The different classes of IETM and
how to build one. (Geoffrey von Limbach, InfoAccess
Corporation)
"Use of SGML to Model Semiconductor Documents" - tagging
information on electronic components, which can be used to
produce printed documents and treated as machine-readable data.
(Various speakers, Pinnacles Group)
"Producing a CALS-Compliant Technical Manual" - ensuring
that the DTDs, Tagged Instances, and FOSIs support the users'
needs for creating CALS compliant Air Force Technical Manual
Standards and Specifications. (Susan Yucker and Matthew
Voisard, RJO Enterprises Inc.)
HyTime:
"SGML/HyTime and the Object Paradigm" - a comparison of the
object-oriented and SGML/HyTime ways of representing
information. (Steven R Newcomb, TechnoTeacher Inc.)
"An object-oriented API to SGML/HyTime documents" - the
design of some of the HyMinder C++ library (developed by
TechnoTeacher), and a comparison of SGML/HyTime constructs
and HyMinder object classes. (Steven R Newcomb,
TechnoTeacher Inc.)
"SlideShow: An Application of HyTime Application" - the
system design and architectural forms used to create a sample
HyTime application called SlideShow. (Lloyd Rutledge,
University of Massachusetts).
"HyQ: HyTime's ISO-standard SGML query language" - a
discussion of the main features and advantages of HyQ. (Steven R
Newcomb, TechnoTeacher/Fujitsu/ISI)
Technical Gems:
"A Document Manipulation System Based on Natural Semantics"
- Natural Semantics, its relationship to SGML, and the results of
some document manipulation experiments. (Dennis S Arnon,
Xerox PARC; Isabelle Attali, INRIA Sophia Antipolis; Poul
Franchi-Zannettacii, University of Nice Sophia Antipolis).
"Digital Signatures Using SGML" - a schema for digital
signatures of electronic documents using SGML. (Bernd
Nordhausen, Chee Yeow Meng, Roland Yeo, and Daneel Pang
Swee Chee, National Information Infrastructure Division).
"CADE - Computer Aided Document Engineering" - a
framework of six methodologies for the Document Development
Life Cycle. (G Ken Holman, Microstar Software Ltd.)
"Using SGML to Address the 'Real' Problems in Electronic
Publishing" - using Requirements Driven Development (RDD)
for the generation of information capture and production
environments, so as to ensure the balance of the supply and
demand for data. (Barry Schaeffer, Information Strategies Inc.)
"Recursion in Complex Tables" - a recursive-row table model,
that can model the structure of most tables with multi-row
subheads. (Dave Peterson)
4. Reports from the Front
In this session, several speakers briefly outlined major SGML-
related industry activities.
Beth Micksh summarized the history and purpose behind parts of
CALS MIL-M-28001, dealing with the use of SGML. The latest
version MIL-M-28001B was made available last summer, and it
supports the electronic review of documents. Early in 1994, the
publication of a "MIL SGML Handbook" is expected; it will cover
some of the fundamental and general aspects of SGML, as well as
providing some valuable CALS-specific information.
Terry Allen described the work of the Davenport Group. He
outlined their general purpose and main members, full details of
which are given in the publicly available material circulated by the
Group. the DOCBOOK DTD has been the Davenport Group's
major accomplishment to date - and v2.1 should be ready for
release in the week commencing December 13th 1993;
announcements will be posted to the comp.text.sgml newsgroup.
Steve Newcomb talked about CApH, a breakaway group from
Davenport, which focuses on Conventions for the Application of
HyTime. They intend to provide guidance on how to use HyTime
for anyone who wishes to adopt ISO 10744 for the interchange of
multimedia and/or hypertext information. CApH will provide
general policies and guidelines on how to tackle typical problems
(e.g. how to generate a master index for a set of documents which
all have their own separate indexes), but will not deal with how to
enforce such policies
The next speaker discussed the joint efforts of the International
Press and Telecommunications Council (IPTC) and the
Newspapers Association of America (NAA) to devise a base set of
tags for marking up newswire text. This work will effect not only
the providers of this kind of information (e.g. companies such as
Reuters), but also the newspapers and broadcast services which
make use of it, and the database/archiving specialists (such as
Mead Data Central who will want to store it.
Eddie Nelson spoke about the ATA DTDs, which he stressed were
designed for interchange only. The ATA DTDs will influence the
documentation activities of all the manufacturers, component
suppliers, operators etc. in the commercial aviation industry. To
date, the ATA has released 8-9 DTDs, mostly dealing with the
major types of technical manual.
Dianne Kennedy discussed the Society of Automotive Engineers
(SAE) J2008 standard, which has been prompted by the emissions
regulations in the Clear Air Act. By model year 1998, all new
automobiles sold in the US must be supported by documentation
conforming to the J2008 standard. J2008 is actually a suite of
standards covering such aspects as the use of text, graphics and the
interchange of electronic information - it is not just a DTD.
Also, J2008 includes a data modelling (database) approach, which
is separate from the actual documentation considerations;
information required by the former is essentially relational in
nature, whilst that for the latter is hierarchical. Careful thought
has been required to ensure that the documentation DTD will
sufficiently support mapping into (i.e. populating) the data model.
The next meeting of those involved in developing J2008 will take
place in January 1994.
5. Multi-company SGML Application Standard Development Process" -
Bob Yencha (National Semiconductor), Patricia O'Sullivan
(Intel Corporation), Jeff Barton (Texas Instruments), Tom
Jeffery (Hitachi Micro Systems, Inc.), Alfred Elkerbout
(Philips Semiconductors)
This presentation described the work of the Pinnacles Group, a
consortium of the five major semiconductor manufacturers, to
develop a common form of electronic product datasheet for
interchange between themselves and their customers.
Product datasheets are relatively few (typically <10,000 per
company) and small (typically < 100 pages), but they are very
complex in terms of their structure and content. The decision to
adopt a common, SGML-based electronic solution, was based on
the need to simultaneously resolve business problems (i.e. collect
and deliver information efficiently), and respond to market
pressures (i.e. customers wanted the information quickly and in
electronic form).
Developing the DTD jointly by all the members of the Pinnacles
group ensured harmonization, distributed the development costs,
and encouraged the development of tools for both information
providers and users. The speakers repeatedly stressed the
importance of having access to the knowledge of content experts
during the document analysis phase, and the benefits of having
observers to ensure continuity between the various analysis
sessions that were held. A cumulative document analysis process
was strongly recommended.
The draft DTD is due out at the end of 1993. Following a period
for review and revision, it is expected to become an industry
standard by April 1994. Each individual company still needs to
consider how it will customize the DTD for its own use, and how
the standard will be implemented within the company.
The speakers felt that if members of any other industries were
considering forming a similar group, companies should join early,
plan carefully, and ensure that the anticipated benefits are
continually "sold" to participants and stakeholders throughout the
entire development process.
6. "Archetypal Early Adopters? Documentation of the Computer
Industry"
6.1 " Information Development Strategy and Tools -
IBM and SGML" - Eliot Kimber and Wayne Wohler (IBM)
Wayne Wohler began by describing the "BookMaster Legacy".
BookMaster is a GML application used by IBM Information
Developers to create IBM product documentation. GML is very
like SGML, although it lacks a concept comparable to the DTDs
of ISO 8879. BookMaster is a fairly extensive authoring language,
which has met IBM's information interchange and reuse
requirements for several years. However, now that IBM supports
more platforms and delivery vehicles (and wishes to interchange
information with other enterprises), and to answer the growing
demand from users, IBM has decided to migrate its Information
Development (ID) operations to SGML.
Eliot Kimber described how this migration is being carried out.
The procedure first involved the design of a processing
architecture (InfoMaster) on which to base the application
language and semantics (IBMIDDOC). Tools had to be found for
authors, editors and users, existing data had to be migrated to the
new environment, with documentation and educational materials
being developed along the way.
InfoMaster is an architecture for technical documentation that
defines the base set of element classes for technical documents; it
defines how DTDs should specify the semantics of such
documents and how programs should use the information. IBM
drew on the HyTime concept of Architectural Forms to
standardize the application semantics, and to facilitate the
interchange of information between different DTDs.
IBMIDDOC is based on industry standards, and was designed
without bias towards any particular processing model. It makes
uses of explicit containment to describe all containing
relationships and uses elements as the basis of all processing
semantics. All relationships between elements are treated as
(HyTime-conforming) hyperlinks. IBMIDDOC does not use
inclusions or exclusions, short references, or #CURRENT
attributes.
Documents conforming to IBMIDDOC are organized into
conventional high level structures (e.g. prolog, front-, body- and
back-matter), each of which can contain recursive divisions.
Below the division level, elements are classified either as
information unit elements (e.g. paragraphs, lists, figures etc.), or as
data pool elements. (e.g. phrases and other flowed material).
IBMIDDOC supports multimedia objects, and hyperlinks that can
either be cross references or explicit hyperlink phrases. Any
element can be a hyperlink source or target anchor, and
IBMIDDOC also supports HyTime's nameloc, dataloc, and name
query features.
6.2 Eve Maler (Digital Equipment Corporation)
(I missed this session, as I was slogging round Boston trying to get
my ailing laptop repaired.
6.3 "Implementation of a Corporate Publishing System"
- Beth Micksh (Intergraph)
(I also missed this session, as I was still slogging round Boston
trying to get my ailing laptop repaired - this brief write up is
based upon the copy of Beth's overheads included in the
conference proceedings).
Intergraph has 15 documentation departments with 120 full-time
writers and a number of third party documentation suppliers. They
publish and maintain over 400, 000 pages per year at an annual
cost of $20million. The objectives behind developing and
implementing a corporate publishing system were to standardize
and facilitate the document creation and maintenance process, and
to create a corporate documentation production system.
The new system would be required to provide a standard
documentation source format capable of supporting four different
document types. The system would need to be robust enough to
handle the production of large software and hardware manuals, (in
all the required data formats), and also facilitate the reuse of
source data in a multi-platform environment. In addition, it should
also make possible the provision of on-line information, allow the
translation of existing system data, and support multilingual
documents.
SGML was the obvious solution to many of these problems, and
was adopted accordingly. The anticipated benefits to both the
corporation and to users were comparable to those that have been
outlined previously in other presentations (i.e. cost savings and
improved productivity, consistent document structures, greater
information interchange and re-use etc. etc.). The new system was
implemented in a two phase process - the first phase designed to
prove the principle concept(s); the second to produce the
production ready system. Development was done jointly by three
divisions (systems integration, electronic publishing and corporate
publishing services) using a modular approach.
One DTD provides the structure necessary to support all the
required variants of Intergraph user documentation. Filters have
been developed to allow the conversion of legacy data from
existing tagged ASCII and FrameMaker formats.
The success of the introduction of the new system has depended
upon the cooperative efforts of all concerned- developers, input
from users, and support from management.
6.4 Jon Bosak (Novell)
(I also missed this session),
7. International SGML Users' Group Meeting
This was the mid-year meeting of the International SGML Users'
Group (ISUG), the AGM having been held at SGML Europe`93 in
Rotterdam earlier this year.
After a welcome from Pam Gennusa (President of the ISUG),
representatives from many of the Users' Group National Chapters
addressed the meeting. Most of the Chapters claimed a
membership of around 45-70 individual members and a handful of
corporate members. Several people took the opportunity to
announce the recent formation of new National/local Chapters
(e.g. in Denmark, Sweden, US Tri-State, US Northern California),
whilst others expressed an interest in setting up such groups (e.g.
in US Alabama/South East [Beth Micksh] and Ottawa [Ken
Holman]). Pam announced that Richard Light, an independent
consultant based in the UK, had taken over from Francis Cave as
the Treasurer of ISUG following a vote by the ISUG Committee.
Some Chapters had been extremely active in the preceding twelve
months, staging numerous well-attended events (often vendor-
oriented). The first event staged by the newly formed Swedish
Chapter attracted around 200 attendees to a special one-off SGML
conference, although they do not yet have a clear idea of the size
of their ordinary membership.
Several Chapter representatives/members reported feelings of
apathy within their groups. Some Chapters had held only one or
two events in the past year and were having difficulties attracting
active members or developing programmes which would re-
enthuse existing members to support Chapter activities. Members
from those Chapters which had organized several successful
events outlined the strategies and policies that they had used, and a
brief discuss ensued for the benefit of existing and newly forming
Chapters.
Pam spoke briefly about the software which has been released
through the ISUG (i.e. the ARCSGML Parser Materials, and the
IADS software). She wished to re-emphasize that any software so
released is not in any way endorsed, approved or checked by the
ISUG. The ISUG does not have the resources to undertake
software investigation or evaluation, but is willing to consider
facilitating the distribution of any software which might be of
interest to members of the SGML user community. Pam also
outlined the relationship that has been established between ISUG
and the SGML Open industry consortium; the ISUG has a non-
voting place on the committee of SGML Open, and has put
forward a proposal that ISUG members may be willing to
participate in a case study exercise for SGML Open.
Brian Travis reminded all present that the monthly newsletter
"<TAG>" is available at a special subscription rate to ISUG
members. Anyone interested should contact him for more details
(Phone: +1 303-680-0875 Fax: +1 303-680-4906). Daily editions
of the <TAG> newsletter were also being circulated throughout
the duration of the conference.
The next meeting of the ISUG will be the AGM, to be held at
SGML Europe `94 (Montreux, Switzerland, May 15-19th 1994).
8. "Real World Publishing with SGML" - Terry Allen
(Digital Media Group, O'Reilly & Associates, Inc.)
O'Reilly's online "Whole Internet Catalog" first appeared in
printed form, produced using a troff-based approach. The
structure of the information was very loose, with the title being the
only common structural element shared between all the entries.
Terry described how he managed to develop an online version of
the "Whole Internet Catalog" by using a combination of sed and
awk scripts to translate the source files into versions tagged with
HTML markup. HTML, the Hypertext Markup Language, is a
DTD developed for providing information on the World Wide
Web (WWW) - a global network of information servers linked
together over the Internet.
HTML was first designed as a tag set, and although it can be used
as at DTD there is no requirement that it should be. The HTML
aware browsers which are used to access information on WWW
tend to treat HTML as if it is a procedural (rather than a
descriptive) markup language. It is left to the creator of the
information to decide whether or not to validate markup against
the HTML DTD, as current implementations of the browsers to
not parse any document they process, and are very (perhaps too)
forgiving of any markup discrepancies. HTML is fundamentally
lacking in terms of imposing any strong structural conventions (the
occurrence and ordering of many elements are often optional), and
its designers appear to have made some rather surprising decisions
- such as the use of an empty <P> element to indicate the breaks
between paragraphs.
Terry described how he has decided to filter all his files into a
more constraining but still simple DTD. The source files for the
"Whole Internet Catalog" are now more like records in a database
than narrative text files, but they can be easily converted (using
OmniMark and an awk script) into HTML. This approach has
proved successful up till now, because Terry has been working
alone to maintain and process the data. New authors for the
Catalog will need to be trained and provided with tools to support
authoring with Terry's DTD. Limited testing has shown that the
most successful approach to adopt with new authors is likely to
involve the use of SoftQuad's Author/Editor and a template file of
tags to generate the source information. Terry said that his
experiences had shown that the authoring/editing process was less
likely to be error prone if HTML attributes were actually
represented as ordinary elements in his DTD - and if good use
was made of display facilities (e.g. use of fonts and colour) to
facilitate at-a-glance structural checks by humans. He has also
adopted a macro-based approach to map the source SGML files
into gtroff for printing on paper - although developing more
robust filters may be required in the future.
Terry felt that the lessons he had learned were that the production
and handling of SGML source is (currently) likely to be done in an
heterogeneous environment (i.e. authoring using one set of tools,
linking and processing using another, and so on). He is still
looking for good, cheap tools which will support the use of
arbitrary DTDs, but an ideal future tool might be a sophisticated
browser (such as Xmosaic) which also provided an authoring
mode which supported user-supplied DTDs. Terry reported that
he has also been closely following the development of HTML+, a
revised version of HTML, which might provide a more
robust/constraining DTD.
9. "HyTime Concepts and Tools" - Dr. Charles
Goldfarb (IBM)
Dr. Goldfarb began by providing a very brief overview of SGML,
and the advantages to be gained from its use. He showed a simple
example of some typical SGML markup, and then discussed the
impact of SGML since its release as an ISO standard in 1986.
SGML has become the dominant tool for producing structured
documents, and has been widely adopted by industry, government
and education. However, the real impact of SGML is that
information owners, not programs, now control the format of their
data; information creators/providers are no longer at the mercy of
the proprietary solutions developed by vendors.
Widespread adoption of SGML, will in turn encourage the
creation of Integrated Open Hypermedia (IOH) information and
systems. IOH information is integrated in as much as all
information is linkable, whether or not it was specially prepared
with linking mind. It is "Open" because the addressing of the
linked location is not bound to a physical address until the link is
"traversed" and the "anchor" accessed. Whilst "Hypermedia"
represents the union of hypertext (information which can be
accessed in a random order) and multimedia (information
communicated by more than one means eg. text + graphics + audio
+ animation etc.)
HyTime, the Hypermedia/Time-based Structuring Language
(ISO/IEC 10744), is an application of SGML that has been
developed to enable IOH. HyTime standardizes the most useful
and general constructs and concepts to do with linking
hypermedia/time-based information. It facilitates hyperlinking to
any information object (whether or not it is SGML), has a rich
model for representing time, and supports an isomorphic
representation of time and space. The success of SGML and
HyTime is being driven by the fact that users are now demanding
products that use real Standards, rather than those which merely
offer proprietary solutions.
HyTime standardizes the use of SGML for hypertext and
multimedia. It provides sets of standardized SGML attributes
(called "architectural forms"), which convey the information used
by the useful/general hypermedia constructs mentioned in the
paragraph above. Architectural forms can be recognized by
suitable processing software, and decisions or actions taken on the
basis of the values of the attributes. HyTime extends SGML
hierarchical structures by facilitating the lexical modelling of data
content and attribute values, and also by supporting inheritance of
attribute values. HyTime extends SGML hyperlinking (ie.
IDREF) capabilities, and adds co-ordinate structures (called Finite
Co-ordinate Spaces) to handle the alignment and synchronization
of information objects in time and space. Using HyTime means
that we no longer have to deal with a single SGML document, but
can make seamless use of whole libraries of documents or pieces
of documents.
Dr Goldfarb then talked about the development and release of two
Public Domain HyTime tools: ObjectSGML and POEM.
ObjectSGML is an Object-Oriented SGML parser which supports
incremental parsing, entity structures as they were originally
envisaged during the development of ISO 8879, and the processing
LINK feature. It also offers native HyTime support - validating
architectural forms, handling location addressing, and processing
HyQ queries and properties. The source code will be made
publicly available for free.
POEM, the Portable Object-oriented Entity Manager, provides a
platform-independent interface to real physical storage. It
supports ISO/IEC 9070 public identifiers for universal object
identification, ISO/IEC 10744 Standard Bento (SBENTO) and ISO
9069 the SGML Document Interchange Format (SDIF). It has no
parser dependencies, so it can be used with any of the existing
SGML parsers. As with ObjectSGML, the source code will be
made publicly available for free.
ObjectSGML and POEM are the results of Project YAO,
conducted by the International consortium for free SGML
software development kits (SDK). Participants included Yuan-ze
Institute of Technology (Taiwan, ROC), IBM (in both the US and
France), Naggum Software (Norway) and TechnoTeacher Inc (in
the US). Between them, the development team had extensive
experience of developing SGML-aware tools and systems.
Implementations of SDK will use the standard C++ class library,
be entirely platform-independent, and be based on proven
products.
The architecture of ObjectSGML is build around a low-level
"parser event" API, a variable-persistence cache, a high-level
"information object" API, and uses POEM for entity management.
Most existing SGML parsers use the low-level "parser event"
approach (ie. passing all start/end tags, attributes, data etc. when
found, whilst only retaining the current structural context). The
use of a high-level "information object" API means ObjectSGML
will provide access to both the element and entity structure of a
document; addressing can be done using the HyTime location
model ("proploc" and HyQ). The variable-persistence cache will
be maintained by the application in a proprietary format; it will
allow rapid access to information found by the parser event API,
and can be optimized for an application. Using such a cache
avoids re-parsing, since it will hold such values as "next event",
"element", "entity", "parsing context" etc.
Alpha test versions of the software were shipped to test sites last
Thursday. The results of testing will determine what revisions
may be required. However, the current intention is that both
ObjectSGML and POEM should be publicly available no later
than the end of the first quarter of 1994.
10. "HyTime: Today's Toolset" - Dr. Charles Goldfarb
(IBM) and Erik Naggum (Naggum Software)
SGML was developed for the benefit of information owners. It
requires that information representation should be independent of
applications and systems, which means that more than one
representation is always necessary: an abstract (logical)
representation, one or more perceivable presentations, and an
internal storage representation. The real storage of information is
obviously platform dependent at some stage, but SGML liberates
information from such dependencies through the use of entities
and an entity manager.
SGML entities can take a variety of forms. External identifiers,
such as DOCTYPE, ENTITY, LINKTYPE, NOTATION and
SGMLall declare (and some also reference) entities. A system
identifier is really a "locator" in as much as it specifies the
physical location where an entity can be found; it involves the use
of a Storage Object Specification (SOS), and a storage manager.
There is no requirement that an entity should occupy all the
content of a storage object, so the manager must be able to extract
substrings (as well as handle things like record boundary insertion
and omitted system identifiers).
The use of public identifiers means that registered and formally
identified entities can be recognized by conforming SGML
systems. The practical implementation of the formal registration
procedures outlined in ISO 9070 has yet to be finally sorted out
by the selected registration authority (the GCA). Until this is
done, it is perfectly possible to formally identify public entities
using ISBNs - and this approach has already been adopted by
companies such as IBM. It should also be remembered that the
entities associated with public identifiers can also exist in different
versions; for example some SGML software requires that DTDs
are precompiled before the system can use them.
Erik Naggum spoke briefly about Formal System Identifiers
(FSIs). An FSI lists a Storage Object Specification, giving details
of the storage system type, storage object identifications, record
boundary indicators, and substring specifications. The record
boundary indicator was felt to be necessary, because many people
now move files between Dos, Unix, and Mac systems, and each of
these has a different way of indicating record boundaries. Erik
showed an example of how files on two different types of systems
could be concatenated and passed to an Entity Manager as if they
were a single unit.
Interchange facilities involve a separation of the virtual entity
management from the real physical storage entity management.
SDIF (ISO 9069) details how SGML objects can be combined into
a single stream for the purposes of interchange. SBENTO is
described in the HyTime standard (ISO/IEC 10744) and whereas
conventional BENTO uses a directory-based approach to control
the packing of objects into a single stream, SBENTO uses SGML
entities to make the process simpler. It is also possible to
interchange SGML objects packaged using conventional archiving
tools (e.g. PKZIP).
Good entity management is very important to the success of any
SGML-based system. Entity structure is a "virtual storage system"
which isolates the element structure from system dependencies,
and allows storage-related properties and processes. Entity
structure is (literally) the foundation of SGML; it supports both
SGML-aware and non-SGML aware storage and programs, and
also allows the successful interworking of both types.
The soon-to-be released POEM (Portable Object-oriented Entity
Manager) announced in Dr Goldfarb's previous presentation,
implements all the principles of good entity management. A copy
of the POEM specification (version 1.0, alpha level 1.0 ) was
distributed to attendees.
11. "Charles Goldfarb, Please Come Home: SGML, the
Law, and Public Interest" - Ina Schiff (Attorney at
Law) and Allen H Renear (Brown University)
(Ina Schiff was unable to give her part of this presentation, which
was given on her behalf by another speaker.)
The conventional wisdom is that SGML is best-suited to long-
lived, structured documentation such as technical manuals.
However, the purpose of this presentation was to suggest that it
could be effectively applied to the handling of structured legal
documentation of the sort regularly produced by attorneys on
behalf of their clients.
Attorneys are used to researching and supplying the information
content that comprises the key parts of most legal documents.
However, they generally leave the structuring of the information
and the inclusion of legally required ("boilerplate") text to
paralegal and secretarial staff. Adopting an SGML-based
approach to the creation of such structured documents would mean
that attorneys would no longer have to rely on other members of
staff to correctly structure their texts and include required
elements etc. A well-featured SGML-based system could easily
provide a good authoring and editing environment in which to
create and revise these kinds of documents.
It is possible to imagine a future in which electronic multimedia
structured documents are acceptable as submissions in court. If
such documents were also archived in the main legal text
databases, it would greatly facilitate the generation, delivery,
interchange and reuse of legal information. This would represent
a better service to clients, and hopefully lessen the tremendous
amount of paperwork which is currently required as part o f the
legal process.
Allen Renear strongly argued for the collaborative development of
any DTDs for the sorts of documents mentioned above. He
suggested that the legal community could usefully benefit from
modelling the approach adopted by the academic community when
developing the Text Encoding Initiative's (TEI) Guidelines.
Allen gave an entertaining account of how he had been called as
an expert witness to defend Ina Schiff's use of SGML to produce
her own structured documents, after an opponent who had lost a
case to Ina was contesting the size of her fees (having had costs
awarded against them). The case against Ina suggested that by
entering the data content herself, she had actually been performing
a secretarial task, and so the work should have been charged at an
appropriate rate; they also alleged that Ina had made substantially
less-than-average use of paralegal and secretarial assistance when
preparing her case. Allen argued that her use of an SGML-based,
structured document authoring environment had allowed her to get
on with the job of producing information content, and to do it
more efficiently; Ina won the case. Allen said he would now be
looking forward to the time when attorneys could be taken to court
for not using SGML when preparing documents for a case
12. "Online Information Distribution" - Dave Hollander
(Hewlett Packard)
There are several current barriers to the delivery of online
information. Authors only have a limited selection of authoring
tools, they are not used to reusing information, and the information
they receive/deliver can be inconsistent. Publishers have no
standard tools to process the information they receive from
authors. Whilst customers have specific hardware and software
requirements, which may be unique to them. This means that
throughout the process of developing online information, different
convertors are required for every environment - and this creates
particular difficulties for large companies like HP, who now
produce 3-4 gigabytes of information each month.
The kinds of information that might be distributed online vary
considerably. A typical list might include such things as: context-
sensitive and task-oriented application help, online reference
manuals, multimedia (graphics, audio, video), hypertext,
information that is conditional on the current environment, history
or other factors, and so on.
HP have come up with a short term solution, which is not ideal in
SGML terms, but fits their purpose. The Semantic Delivery
Language (SDL) is a delivery format defined by a DTD; it
provides an intermediate language/format in which to deliver
information, and it also facilitates tool development and
information reuse. SDL's development was entirely driven by
practicalities, rather than a wish to experiment with SGML-based
techniques.
Achieving good performance was the number one issue, so
documents were broken down to allow for multiple entry points.
The designers of SDL also had to give up some SGML features
(eg markup minimization), use pre-calculated counters for
chapters etc., use a DTD which would work with simple/cheap
parsers, and use a small number of elements. Certain parts of the
documents (e.g. the table of contents, indexes etc.) are pre-
computed before display to improve performance.
SDL had to be flexible. The system had to support font and page
specifications which are separated out from the document content
(but which allow flexible displays). A normalized set of semantics
is included, although the designers also allowed fourteen different
types of system dependencies [?]. There are a variety of link
types, but SDL is not (yet) HyTime compliant. A version id is
placed on all containing elements to support version
control/viewing if the tools being used are powerful enough.
SDL is intended to provide a structure which can help a reader get
to the right information at the right time. Source semantic
identifier attributes allow individual and/or groups of elements to
be easily identified - but this approach is not ideal, and HP may
adopt a HyTime based approach in future. SDL's designers put
alot of thought into the development of filters; from the first, SDL
was planned with filtering in mind. The inclusion of display level
semantics facilitates filtering from SDL, and from other
procedural formats that are still being used.
SDL is a complete information model, in that it includes
formatting (DSSSL-type) information as well as structuring
information. SDL hierarchical modelling also makes it easier for
non-SGML aware programmers to develop tools quickly, and gets
them interested in the concepts of SGML. However, the real value
of SDL lies in the fact that HP's customers can now get consistent
information distributed online.
13. "Xhelp: What? Why? How? Who?" - Kent Summers
(Electronic Book Technologies)
Xhelp a standard online help solution for the Unix/X-Windows
environment. Xhelp was not produced by EBT in isolation, but is
the results of collaboration between numerous X/Unix developers.
The current situation is that each vendor often has their own
solution to providing online help - which makes the whole
situation very complex, and makes life difficult for the end users.
Current solutions to this problem either provide less effective help,
or are more expensive.
The people involved in producing online help (writers, designers,
programmers, managers etc.) are not always able to collaborate,
and any new solution has to bear this in mind. Online help must
be consistent, and should exist as a service which is separate from
the client applications. Applications and online help often have
different release cycles, so online help should remain uncoupled
from applications in order to support incremental updating of
content. Moreover, any solution to the problems of providing
online help must be truly cross-platform for both applications and
documents.
The Xhelp solution makes use of the existing standard
communication layer in X. Information representation and display
favours using SGML, but the display system also supports
PostScript, ASCII etc. to allow for the easy inclusion of legacy
data.
Kent then spoke about the Xhelp architecture (which separates the
communication logic, data representation, formatting and display
tool layers), and the Xhelp protocol. He described the various
parameters that comprise the Xhelp client message content, and
outlined the advantages to be gained from using the Xhelp
architecture in relation to the Xhelp protocol and Xhelp.dtd.
Advantages to be gained from Xhelp procedural approach
included such factors as: programmers no longer have to do
anything to support on-line help; time of writer/programmer
collaboration is reduced; the fallback algorithm works to provide
some help, even if the user's query cannot be answered directly;
supports context-sensitive and task-oriented help; offers "fill-in-
the-blank" templates for linking to help information. The
Xhelp.dtd means that authors gain reusable skills, since they
become familiar with a single content model and set of authoring
tools. Other advantages of Xhelp included: no installation, and
help can be distributed independently of applications; good
performance; encourages a good environment for
business/competition.
Kent talked through the online help production process from the
programmers' and writers' points of view, comparing the
traditional approach with an approach based on the use of Xhelp.
The Xhelp approach requires far fewer steps (4 as opposed to 12)
and the programmers' input is reduced to zero. Xhelp provides
numerous benefits to Unix systems vendors, independent software
vendors, and to end users. It provides a cost-effective, cheaper
solution, with positive benefits for all involved.
(The Xhelp developers forum and DTD is maintained on the
O'Reilly & Associates ftp site)
14. "Digital Publications Standards Development (DPSD):
A Modular Approach" - Horace Layton (Computer
Sciences Corporation)
DPSD is a three phase programme to streamline and modernize the
acquisition, preparation, production and distribution of
information for the US Army. The developers hope to have the
finished standard available next summer.
The MIL-STD-361A standard, is the flagship product of the DPSD
programme. It is task-oriented, and consolidates six former
technical manual standards into one. The DTD eliminates
chapters, sections, and most paragraph numbering requirements,
and focuses on structuring information content rather than style or
formatting issues. Horace talked through the evolution of MIL-
STD-361A, and the products produced by the end of DPSDphase
II (in June 1993) - which including several DTDs and a number
of FOSIs. He then looked at the programme for DPSD phase III.
The concept behind MIL-STD-361A is that the majority of
maintenance manuals, regardless of the level of maintenance,
contain similar functional requirements. Knowledge of technical
manuals and analysis of technical manual data allows one to group
functional requirements into modules of information. Creation of
DTDs for these modules allows use of the same modules wherever
the same functional requirements are imposed. There is only one
DTD requirement for each technical content volume. The
approach can be used for all levels of maintenance (operator, unit,
direct support/general support, and depot). A single DTD is used
to assemble all the required modules into a complete technical
manual (TM).
The DTDs in MIL-STD-361A are content driven, and comply
with both ISO 8879 and MIL-M-28001. The DTDs are quite
small in size (8-10 pages), and currently only seven DTDs cover
all the Army's TM requirements contained in MIL-STD-361A.
The DTDs are intended to be easy for authors to understand and
use.
Horace showed a diagram of the MIL-STD-361A technical manual
structure, and another giving an overview of the MIL-STD-361A
concept. He then briefly described the relationship between the
MIL-STD-361A and MIL-D-87269 (IETM - Interactive
Electronic Technical Manual) standards. Horace closed with a
description of the field test/proof of concept objectives, and the
test sites and schedule. Testing is due to finish by the end of April
1994, with the final document ready for submission by May/June.
The DPSD programme means that the US Army is well on its way
to achieving a full digital delivery capability.
15. "A Practical Introduction to SGML Document
Transformation" - David Sklar (Electronic Book
Technologies)
This presentation looked at the requirements for, and features of
an "SGML transformation engine" which David likened to "the
Swiss Army pocket knife of the SGML community" (i.e. an
extremely useful, multi-featured tool - although this description
somewhat loses the humour of David's remarks).
David began by proposing his personal nomenclature for
distinguishing between "conversions" and "transformations".
Conversions are "Up and Lexical" in that they involve the
(upward) conversion of non-SGML unstructured source
information into SGML, on the basis of a content-identified (i.e.
lexically-based) process; it is difficult to achieve 100% accuracy
during conversion. Transformations are "SGML and Down" in
that they involve the (downward) translation of SGML, content-
identified, validated data into a non-SGML form; it is possible to
achieve 100% accuracy during these types of transformation. This
presentation focused on SGML transformations (rather than
conversions).
A dreaded "chasm" exists between the optimal in-house DTD, and
the processor needs/distribution DTD. An optimal in-house DTD
is based on a content-driven design, is formatting-free, and
supports omission of generatable data. A "processor needs" DTD
is useful for people who want to output hardcopy or to publish
online - but this is still a relatively young industry, and the
processors have limitations (e.g. a processor like DynaText cannot
do such things as the auto-numbering of heterogeneous element
types; it is inefficient at calculating list-adornment types (e.g.
bullets) based on context-sensitive algorithms; it should not be
used to auto-generate text as this will be unknown to the search
engine). A "distribution DTD" might be something like an
industry-standard DTD (e.g. ATA, J2008 etc.) or a DTD required
by a major customer. This situation usually results in the in-house
DTD being compromised to bridge the gap between the optimum
and processor needs/distribution DTDs.
Compromising the optimal DTD is the wrong solution to adopt. It
leads to a number of additional costs (e.g. disruption of the current
authoring/conversion process, the DTD's portability/lifespan is
shortened etc.). The solution is distasteful in so far as it represents
a change in long-term strategy to compensate for temporary
deficiencies in current SGML processors. Ultimately, the aim
should be to bring the SGML consumer to the data, and not to
force the data to travel to the customer.
The correct way to bridge the chasm, is to build an "Xform
Bridge" (automated SGML transformation) - to transform data
which conforms to the optimal in-house DTD into a form that
complies with an interim, "process-ready" DTD. This way, the
author/review environment remains unaffected by the
transformation process is not affected, and the possibility that the
transformation can be 100% automated means that it can be
potentially 100% accurate. Moreover, the interim instance need
not conform to an actual DTD, as this is not required by some
processors (e.g. DynaText), and validation may be deemed
optional if the transformation process is itself validated.
David then compared the main characteristics of an "Author-
Ready" DTD with those of a "Process-Ready" DTD. An Author-
Ready DTD will contain no information that can or should be
auto-generated; authors are allowed to focus on content which they
alone can produce. David called this approach the "noble but not
usually practical" from of SGML (emphasizing that it is not
riddled with "compromise" attributes designed to satisfy the
processor). A Process-Ready DTD will contain auto-generated
information that is needed by the processor - and David called
this the "real-world" form of SGML (in the sense that this is how
SGML systems are typically implemented).
Transformations can also be used for a variety of other purposes,
such as importing data from external sources (e.g. generating
tables from information held in a database). Transformations can
also be used to automate "mindless" or "formatting-based"
authoring tasks, such as calculating a list-adornment type. They
can be used to perform certain types of document analysis, for
example producing a report on some data (e.g. statistics, frequency
counts etc.). A transformation engine can also assist the semantic
validation of data, for example it could check the validity of the
value of a "date-month-year" attribute (which an ordinary SGML
validating parser will not check). Transformations could also help
to extract/duplicate data (e.g. generating an index or table of
contents), and to hide data on the basis of an intended audience.
Futhermore, a transformation engine could apply style information
during the transformation process - which would facilitate the
fast on-line display of the resulting SGML documents.
Most of the existing transformation engines fall into what David
called the category of "Parser with a Print statement" utilities (i.e.
the current context is maintained using a simple element stack etc.)
This approach is limited in a as much as it offers no lookahead
capability, and only limited "lookbehind" (typically only to
ancestors and left siblings). The output side often has no SGML
intelligence, and so it is quite easy to output invalid documents.
Another category of transformation is the "tree analyzer with a
print statement" utility. There are very few tools of this type (e.g.
EBT System Integrators Toolkit), and although they allow
unlimited access to the document tree (with arbitrary look-
ahead/behind, and re-ordering of input elements), there is still no
SGML-awareness on the output side.
There is a third category of transformations, which David called
"tree transformation" utilities (e.g. the Polypus library extensions
to Balise from AIS). To date, this would appear to be the only
product with SGML-awareness on both the input and output sides
of the transformation process.
David then proposed a number of tips that might prove useful
when comparing transformation engines. Check the software
externals, such as the range of platforms supported, speed, RAM
requirements etc. Also check the software internals i.e. whether or
not the scripting language is comfortable/easy to learn/offers good
diagnostics/is extensible, look at the script-debugging aids, the
error recovery during parsing of an input document, access to the
input tree, pattern matching, and access to external data.
David had hoped to talk about the role of GLTP (the General
Language Transformation Processor) as defined in the DSSSL
standard, but unfortunately he did not have time. (N.B. GLTP is
likely to be renamed as STTL, the SGML Tree Transformation
Language, in the next draft of DSSSL). David also wanted to talk
about "GLTPeeWee" the demonstration prototype GLTP
transformation engine that he hopes to release into the Public
Domain, but this had to be left until a special evening session.
16. "SGML Transformers: Five Ways" - Chair: Pam
Gennusa (Database Publishing Systems Limited)
This session was designed to give an overview of how various
tools would solve some typical transformation problems (e.g.
transforming document instances from one DTD to another, or
between an SGML document instance and something else). A
number of tools were represented, and the problems were specified
in advance by the speakers themselves.
The Tools:
Balise and Polypus (AIS/Berger Levrault) - Christophe LeCluse
SGML Hammer (Avalanche) - Ludo van Vooren
The Copenhagen SGML Tool [CoST] (Public Domain, University
of Copenhagen) - Klaus Harbo
GLTPeeWee (Public Domain, David Sklar) - David Sklar
[unfortunately, David was not able to present his GLTPeeWee
solutions to the problems, because his notes had disappeared].
OmniMark (Exoterica Corporation) - Andrew Kowai
TagWrite (Zandar Corporation) - Harry Summerfield.
The Problems:
(N.B. most problems were accompanied by a sample case of the
type of DTD outlined in the problem specification, and some also
included sample post-transformation output.)
1) INSTANCE NORMALIZATION: Starting from an arbitrary
source SGML instance with potential markup minimization,
generate a non-minimized result instance in which all SGML
information according to ESIS have been made explicit, and which
can still be parsed using the same DTD. The program should be
written to be DTD independent. This kind of processing is
extremely frequent. The result instance can be easily post-
processed by non-SGML systems (e.g. typesetting systems
loaders) or by minimal SGML systems. [Set by AIS]
2) DICTIONARY INSTANCE SPLIT/MERGE: Given a
dictionary made of entries, split this instance into as many files as
there are entries, while generating an SGML skeleton which keeps
track of the entry file names. In a second step, perform the inverse
operation: use this skeleton to gather all entries (stored one per
file) and re-create the original instance. This exercise is a
simplified version of a very common operation in database loading
and extraction situations. It stresses the input/output capabilities
of the application language. [Set by AIS]
3) STRUCTURAL TRANSFORMATION: Starting from a source
instance described by a hierarchical type DTD, generate a "flat"
instance described by a "flat type" DTD. (This kind of problem is
very typical of problems encountered when generating input for a
non-SGML word processor or DTP tool; "flattening" to some kind
of "transit" DTD is often needed). In a second step, do the
opposite: from the generated "flat" instance, re-generate the
original hierarchical instance. (This kind of problem is very
common in retroconversion). Note: the two sample DTDs are
designed to illustrate recursive definitions, and are not meant to be
really useful in the real world. [Set by AIS]
4) ID TYPE CHECKING / ID COMPUTATION: An SGML
parser checks uniqueness of ID attributes in the instance and
checks that IDREF attributes are resolved, but this does not
guarantee by itself correctness of the cross-references. In the
cases when cross-references are "typed" (i.e. a cross-reference to a
figure is not the same as a cross-reference to a table), checking the
type of elements associated to the ID/IDREF attribute pairs
provides an additional checking level. This problem examines
how to handle such a task. As an auxiliary task, it is asked to fill
implied ID attributes.
5) FLOATING ELEMENTS CONSISTENCY CHECKING:
Floating (empty) elements are often used in DTDs to markup
revisions or internationalization. These empty elements occur in
pairs with a "starting" and an "ending" element. When such
elements are declared floating, an SGML parser cannot check
much about the way they are used. This problem examines how to
handle that task with an application language.
6) CALS TO AAP TABLE CONVERSION: The general problem
consists in transforming SGML structured tables from the CALS
table DTD to the AAP table DTD. The transformation program
should handle: the general structure of the table, spanning, and
borders. The program should be DTD independent. That is, given
a DTD including CALS table description, the program can take
any instance of this DTD and output the given instance where all
CALS tables have been replaced by AAP tables. [Set by AIS]
7) DUPLICATION OF DESTINATION TEXT AT SOURCE OF
XREF: Consider a DTD that contains SECTREF elements that
represent cross-references to SECTION elements [DTD example
omitted]. The transformation should replace each input-DTD
SECTREF with an enhanced output-DTD SECTREF [example
omitted]. When a SECTREF is encountered during the
transformation, the engine should automatically generate this text
as the content of the output:" SECTREF: see 'XXXXXX'", where
XXXXXX is a copy of the contents of the TITLE of the
SECTION that is referenced by the SECTREF. Note that TITLE
is not atomic; a TITLE instance can contain an arbitrary number of
EMPH subelements. Also note that it is possible for a SECTREF
to appear before the SECTION to which it refers. The
transformation should perform error checking to ensure that
exactly one SECTION is referenced, report problems
appropriately in a 'non-fatal' way (i.e. should continue processing
and produce a usable document instance). [Set by AIS]
8) EXTRACTION/SORTING: Consider a DTD that allows
FIGURE elements to be placed anywhere in the body of the
document. Some figure elements have caption attributes, but not
all of them do: [DTD example omitted]. Create a transformation
that appends an element called FIGREVIEW to the end of the
ENDMATTER's child list: <!ELEMENT FIGREVIEW - -
(FIGURE+)>. FIGREVIEW simply contains copies of all the
FIGURE elements found in the body of the document, sorted
lexicographically based on the figure' s caption. (Assume all text
is ISO 8859-1). Uncaptioned figures should not appear at all in
the FIGREVIEW section. Note that each FIGURE element is
actually the root of a subtree of arbitrary size (I intentionally don't
show the content model for HOTSPOT); make sure the entire
subtree of each captioned FIGURE is duplicated in the
FIGREVIEW section. (I am interested in knowing if that is
possible, in your technology, without full knowledge of the
content model of HOTSPOT). [Set by EBT]
9) ROW COUNTING: While converting an SGML marked-up
table to some other form (e.g. a typesetting language), count the
number of rows and columns in the table and put the total in a
specified position at the start of the table. [Set by Exoterica]
10) LIST MARK FORMATTING: Given a <list> element, that
can have other <list> elements within its content, which has an
optional attribute whose presence determines the manner in which
items in each list are marked or numbered, and whose absence
indicates that the mark or numbering form is to be deduced from
the ancestory of the <list> (e.g. a <list> with a decimal-numbered
list is to use lower-case letters), correctly mark or number each list
item. Note that alignment of the text following the marks/numbers
can be dealt with as another problem. [Set by Exoterica]
11) DATE: Output the current date in a form determined by an
attribute value in the SGML document. [Set by Exoterica]
12) LINE BREAKING: In the process of converting an SGML
document to some other form (or even SGML), produce output
that has a given number of text characters on each line, not
counting any commands or tags in that other form in the total, and
adjusting the length of each line so that breaks only occur at
"allowed" points (such as at spaces and existing line breaks).
Require that there be no trailing spaces at the points of the breaks
(i.e. none of the preceding lines have trailing spaces).
13) RESOLVE EXTERNAL CROSS REFERENCES: Given two
or more SGML documents each of which contain references to
objects and locations in themselves and other documents in the set,
replace each reference by an identification of the referenced object
or location, the identification having been defined by the
processing of the target, whether it be in the document doing the
referencing or some other. [Set by Exoterica]
14) ICADD TRANSFORMATION: This transformation exercise
gives all of the participants the opportunity to implement (and if
necessary make suggestions for) the mapping techniques which
have been designed to allow any DTD to carry with it the
information needed to be turned into the input (with an instance)
to an automated Braille translation process. This is, admittedly,
quite a complicated exercise by the end, and people may wish to
build processors only for the earlier attributes. [Yuri
Rubinsky/ICADD]
The Solutions:
1) INSTANCE NORMALIZATION: Christophe said that
Balise/Polypus could solve this problem in about three lines of
code (but this was not shown). Ludo devised a simple SGML
Hammer module to handle any starts (and attributes), data content,
and element ends; a couple of other simple, very short modules
were also required to provide a complete solution. Klaus said that
CoST could also cope with these requirements, and summarized
his solution. Andrew said that OmniMark could solve the main
bulk of the problem, but could not provide a solution that would be
truly DTD independent (however this would only require a simple
fix, and it will be corrected in future versions of OmniMark).
Harry said that TagWrite could solve the general problem using
two or three simple rules and its notion of supertokens, however
the solution would probably not be completely DTD independent.
David Sklar suggested that the problems for GLTP will come
when trying to handle empty elements.
2) DICTIONARY INSTANCE SPLIT/MERGE: The panel agreed
that the key to any solution is handling the fact that the application
will need to address several external files. Christophe said that
this problem would be quite simple for Balise. SGML Hammer
cannot do multiple file output, so Ludo suggested that this case
might cause difficulties for the product. CoST could handle the
problem fairly easily, although some thought would need to be
given as to how to make the filenames unique, and making the
filenames out of PCDATA content might also be a bit tricky.
OmniMark would find the problem simple to solve, whilst
TagWrite could provide only a partial solution (the remainder
could be solved via the use of simple WordPerfect/Word macros).
3) STRUCTURAL TRANSFORMATION & 10) LIST MARK
FORMATTING: These problems were combined, because they
shared fundamental similarities. Balise/Polypus can solve the list
mark problem, however the nesting and un-nesting might be a
difficult area. A Polypus library function could be used to do the
structural transformation. OmniMark could solve the list mark
formatting issue by using pattern matching and text rebuilding ,
and solving the problem of structural transformations can also be
done (in each case the code used in the modules was shown).
SGML Hammer (and the Louise scripting language) could cope
with the list mark formatting problem (by using arrays, and acting
on the basis of context). TagWrite can solve the list problem
(although Harry admitted that it entailed using hideous SGML);
the structural transformation problem is quite familiar and solvable
with TagWrite.
4) ID TYPE CHECKING / ID COMPUTATION: Balise can solve
the type checking of ID/IDREF elements - by going through the
document building a data structure of IDs and IDREFs and then
checking that everything matches up and resolves correctly (if not,
it reports and error). The Louise/SGML Hammer solution would
be similar to that adopted by Balise. Using an array-based
approach, it would first read the whole file into the array (to avoid
having to perform any lookahead), then pass through the array to
check all ID/IDREFs resolve. At the end of the process,
Louise/SGML Hammer will output a report identifying any
problems (e.g. any footnotes that have been used but not
referenced).
CoST offers two ways to solve this problem. One is the
Balise/Louise approach. The other uses CoST's "tree mode",
making use of an object-oriented techniques to build a parse tree
which it can then use to check ID/IDREFs. This would also be a
two pass process, in which the first pass builds the parse tree and
the second does the checking. OmniMark would also solve the
problem by first constructing associative arrays, which are then
used to check ID/IDREFs. Any errors would be output to an error
log.
TagWrite does not but instead uses counters to keep track of
things within the document. The counter-based approach can be
used to place IDs in a document that has no IDs, but if some IDs
are present/missing, then TagWrite would not be able to cope (and
it would be necessary to develop some supporting solutions).
5) FLOATING ELEMENTS CONSISTENCY CHECKING: [I do
not appear to have any notes on the answers to this problem, other
than what appeared in subsequent handouts. I believe that most of
the panel considered that the solution to this problem would be
essentially similar in technique to that described for problem
number 4).
6) CALS TO AAP TABLE CONVERSION: Christophe said that
AIS did not really expect anyone to be able to provide a simple
solution to this common problem, as the real issue is one of tree
building.
Balise/Polypus solves the problem by using the Polypus library.
Ludo said that although the problem appears complicated, it is not
actually very difficult because all the data stays in the same order
(therefore, SGML Hammer could provide a linear rather than a
tree-based solution). Klaus said that he did not even attempt to
provide a solution using CoST, because he was not familiar with
the DTDs and it would have taken him too long to understand
them. Andrew offered a skeletal OmniMark solution, which relied
on the use of OmniMark's macro facilities rather than a wholly
SGML-based approach.
7) DUPLICATION OF DESTINATION TEXT AT SOURCE OF
XREF & 13) RESOLVE EXTERNAL CROSS REFERENCES:
In setting his problem, David Sklar (of EBT) wanted to see if the
Xrefs could be resolved in a single pass. Exoterica also wanted to
see if it could be done in a single (or more) pass. (Where "pass"
means taking in the data stream only once)
OmniMark can solve the second problem using two passes (by
building an intermediate file)and then working from that to resolve
the external cross references. The first problem can be solved
using a single pass.
The solutions to both problems offered by Balise/Polypus makes
use of several functions defined in the Polypus library. Essentially
the problems are solved by the building and manipulation of a
parse tree.
SGML Hammer, like OmniMark, also has to do multiple passes to
solve the second problem. The first problem could be solved in a
single pass if it has been written using entity references - where
the entity references will be resolved during the parsing process[?]
- otherwise this would also require multiple passes.
TagWrite - the duplication issue is easy to solve using TagWrite
supertokens. The second problem is non-trivial, and TagWrite
could not resolve it.
Using CoST, the first problem can be solved in a single "pass",
because it can build and subsequently interrogate a parse tree.
Klaus suggested that the second problem is really one of co-
ordination, which could be resolved in various ways with CoST.
8) EXTRACTION/SORTING: David Sklar said that the GLTP
solution to this problem is very elegant and simple, but was unable
to demonstrate this for the reasons mentioned above.
Balise/Polypus can solve this problem using the Polypus library
(ie. using trees). Christophe showed some code, including the
function called after the document tree has been built which is
required to perform some of the global actions.
This problem is not very difficult to solve with SGML Hammer .
It collects the subtrees during the parse phase, and handles them
accordingly at the end. Ludo suggested that interesting solutions
could also be created by using different types of DTD for
authoring and processing the document.
CoST handles the problem by processing the figures in CoST's
tree mode (ensuring that they will not be deleted), then extracting
them from the tree and moving them to the appropriate place.
OmniMark also makes this problem simple to solve. It puts the
figures in an associative array at the end of the document, which
can then be subsequently sorted.
TagWrite could not do this because it does not store associative
arrays (nor build parse trees). Harry suggested that this was really
a question of how to use SGML not how to create/transform
SGML, and therefore TagWrite was never designed to handle this
kind of problem
9) ROW COUNTING: All the tools seemed able to cope with this
problem (although a TagWrite solution was not offered as Harry
was speaking elsewhere). OmniMark solves the problem quite
easily in one pass of the document (the code of the solution was
shown). Balise can also solve the problem in a single pass
(without recourse to using the Polypus library), using an approach
similar in spirit to that offered for the ID/IDREF checking
problem. SGML Hammer would simply store the relevant
information in an array, then output it as appropriate. CoST would
use a two pass solution whilst in tree mode.
11) DATE: OmniMark is able to solve in a single pass, taking
advantage of OmniMark's built-in function to get the date. Balise
would use a system call function, the results or which would be
processed accordingly. SGML Hammer follows the same
approach as Balise, but also uses a string processing function. For
CoST the problem lies in parsing the attribute value; obtaining the
date information and handling the formatting can be done
internally.
14) ICADD TRANSFORMATION: Yuri suggested that this is
quite a complicated problem, but this is primarily due to the
difficulties that arise from handling the multiplicity of possible
inputs in the specification attributes.
A solution is possible using Balise, but Christophe acknowledged
that handling all the possible things that could occur in the
attributes would be quite hard. Ludo suggested that this problem
is a good demonstration of the use of architectural forms. Since
processing is attached to specific attributes defined in the
architectural forms, SGML Hammer would be well-suited to
handling this kind of thing. Klaus said that CoST, like SGML
Hammer, was very proficient at handling architectural forms and
so a solution would be possible. Andrew stated that a real-world
solution (using OmniMark) to the problems of handling ICADD
transformations was currently being developed at California
University.
At the end of the session, all the panel agreed that they would post
their solutions in electronic form to the newsgroup
comp.text.sgml.
17. "The Scribner Writers Series on CD-ROM: From a
Great Pile of Paper to SGML and Hypertext on a
Platter" - Harry I. Summerfield (Zandar
Corporation), Anna Sabasteanski (Macmillan New Media)
Anna Sabasteanski works for the electronic publishing division of
Macmillan New Media, which currently publishes about 50
electronic titles, mostly to do with medicine. The Scribner
Writers Series represents a mix of writers in English - American,
British etc. - who are also classed into numerous genres (such as
Children's authors etc.). The decision of which authors to put on
the CD-ROM was based on a survey of the texts most used in
schools (about 550 authors are represented on the CD-ROM).
Anna talked generally about the development process. Many of
the texts are only available in hot-metal/printed text, rather than in
electronic form - therefore the initial data conversion costs were
quite high. The developers had to decide how to differentiate and
provide added benefits to encourage use of the electronic form of a
text over the existing paper version. Copies of the paper books
were physically broken up, so that the pages could be easily
scanned.
A specialist company was used to handle the conversion from
scanned pages to SGML tagged and validated files. The company
guaranteed an accuracy rate of 99.5% (which was equivalent to
about two errors per page). Markup errors were fairly easy to find
and correct (using SGML-aware validation tools), although
correcting these required human editorial intervention. The
markup conformed to the AAP Book DTD, with a few
corrections/amendments (e.g. to allow for extra entities necessary
for ancient texts). Attributes were also added to indicate genre,
language, nationality, race, sex etc.
One of the main aims of the project was to use SGML markup to
tag documents for inclusion in what is effectively a database (i.e.
the header information of each text is used to facilitate organizing
and searching). Microsoft's Multimedia Viewer was used to
present the information, with the SGML tagged files converted
into RTF (the format recognized by Multimedia Viewer) by
Zandar Corporation.
A number of problems were encountered when developing the
CD-ROM, for example handling certain special characters (which
could not be represented in RTF), and deciding how to handle
represent links to other part or whole texts. A particular headache
arose when converting the original bibliographic sections - the
designers of the CD-ROM version wanted all bibliographies to
follow the same conventions, but senior editors at Macmillan also
imposed the requirement that the bibliographies could be re-
presented in the style used in the original source text. A final
quality control procedure was necessary to check the end product,
from the point of view of both software and content. The testers
found several bugs in the beta version of Multimedia Viewer
software which took some time to get corrected.
Harry Summerfield then described how his company had
approached the project. As described above, Zandar were
contracted to carry out the transformation from scanned, then
hand-edited SGML files, into RTF. However, they first carried
out an important feasibility study to ensure that they were capable
of doing the job and meeting Macmillan's deadlines.
When they began to design the transformation process, Zandar did
not just want to see the DTD being used, but also examples of live
documents containing real tags. Zandar was aware that DTDs
change, and that the tagging actually used in files may or may not
always match up with the current version of the DTD.
The conversion had to handle 50Mb of data in 510 files. The
actual conversion process was done in five passes, because this
approach was cheaper to develop. The first pass had to find any
(SGML) invalid characters, and convert them to SGML entities.
The second pass was to make the first letter of the text of every
file a dropped cap (in RTF) - this was made possible by having
used special SGML markup. The third pass was to do the
conversion of all special characters into RTF. The last two passes
[?] had to strip out SGML tagging in the main texts and
bibliographies, and format them appropriately for RTF (converting
the different kinds of bibliographic entries to a uniform structure
was tricky, and intentionally omitted end-tags in the entries made
things even harder).
The entire conversion process (excluding building hyperlinks)
took six hours of cpu time on a 486 PC. As part of the project,
Zandar developed a separate tool called HyperTagWrite to handle
the creation of the hyperlinking markup, which could be converted
in a subsequent pass into the format used by Multimedia Viewer.
New writers will be added into future versions of the CD-ROM.
Changes will be made in the SGML database, from which the RTF
(or whatever future target formats might be required) can be
generated. Using an SGML-based approach should greatly
facilitate the production of future editions.
In the subsequent questions and answers session, a number of
points were raised. Proper names in the original texts were
identified and processed on the basis of the punctuation in the data
content. The conversion process was relatively cheap in terms of
man months (i.e. Macmillan put one person on the project full-
time for only three months). The quality control checking took
seven people six months, and every hyperlink was hand-tested.
When proofing uncovered errors, all corrections were made to the
SGML source files. The retrieval engine indexes every word, but
they also built several specialist indexes on the basis of markup
(e.g. index of authors, titles, genre etc.).
18. "The Attachment of Processing Information to SGML
Data in Large Systems" - Lloyd Harding (Mead Data Central)
Mead Data Central collects information from over 2000 sources,
but has no control over the received format; they currently have
about 200 million documents available on-line.
Conversion/handling on this scale has to be automated as much as
possible, bearing in mind that the target is to produce an on-line
information system and that they are not concerned about delivery
on paper or other media.
Lloyd compared the handling of (electronic) information, to the
process of industrial manufacturing - especially in its infancy.
Standardized solutions have resolved many of the problems that
faced early manufacturing industry, but can the same be done for
information fabrication systems? He proposed a new paradigm in
which the author marks up the information content, an
"Information Fabrication System" adds value useful for the target
system (publishing, linguistic analysis etc.), and the target systems
use the information. The new middle process extents the
traditional paradigm.
Traditional SGML techniques may provide solutions within this
new Information Fabrication paradigm. Markup standardization,
that is agreed common DTD development along the lines of the
work of the AAP, TEI and existing efforts at Mead Data Central,
might help to provide markup relevant to specific target
applications (publishing etc.) but will only ever be a partial
solution. The use of architectural forms may also provide some
benefits, but adding them to existing DTDs requires skill. Link
Process Definitions are another possibility, but they also require
skill, and they are not supported by many existing tools. FOSIs
are fairly straightforward to use but may not be generalizable
enough to use as part of an Information Fabrication process.
DSSSL's Association Specification and Output DTD, combined
with GLTP, appear to offer the greatest promise of a solution, but
implementing them will require programmer expertise, and
DSSSL is still "shimmerware". As none of these traditional
approaches really provide a complete and/or ideal solution, Lloyd
proposed his own Information Fabrication System Architecture.
For the Information Fabrication Paradigm, the most viable concept
is a generalization of the FOSI to accommodate any type of
fabrication process. The Architecture required to do this consists
of two components: the Application Interface DTD (AID), and the
Processing Output Specification Instance (POSI). AID provides
the syntax for specifying the attachment and association
information. POSI specifies the attachment and association
information for a specific application and raw material. Lloyd
then talked through two examples of the steps required to develop
and use an application using this Architecture.
The Architecture-based approach solves the attachment and
association problems, alleviates some of the expense issues, and
reduces the skill set requirements involved. The goal of
developing an Information Fabrication System is to free the author
from target system constraints, thereby permitting him/her to focus
on content (e.g. authors should not have to worry too much about
authoring links). This requires Information Fabrication Systems
that can accept any author's creation and cost effectively prepare
that creation for a target system. AIDs and POSIs can provide the
basic underpinnings for such Information Fabrication Systems.
During questions and answer, Lloyd said that his was not
necessarily the ideal/only/whole solution, but he would like to see
people talking about "Information Fabrication Assembly Lines" -
where as much as possible of the process of generating marked up
information for target systems could be automated. His approach
has not yet been adopted at Mead Data Central, but it will be.
19. ISO 12083 Announcement" - Beth Micksch (Intergraph Corp.)
This presentation was intended to provide a brief history and
update on ISO 12083 "The Electronic Manuscript Preparation and
Markup" Standard. Formerly an ANSI standard ( Z39.59, but
generally referred to as the "AAP"), ISO 12083 is now being fast-
tracked through the ISO standards process. The first ballot on the
Draft International Standard(DIS) was in November 1992, and the
voting went as follows: 14 positive, 5 negative, and one
abstention.
Eric van Herwijnen was asked to be the editor and to set up a
small technical committee. Eric was required to resolve all the
comments received on the DIS into the Standard, as fast -tracking
means that a second vote would not be needed before the Standard
is approved.
The Standard is intended to facilitate the creation and interchange
of books, articles and serials in electronic form. It is meant to
provide a basic toolkit which users can pick up and modify
according to their needs. The Standard is meant for use by
authors, publishers, libraries, library users, and database vendors.
Use of the Standard is indicated by its public identifier (e.g. ISO
12083:1993//DTD Book//EN - for the Book DTD). Elements or
entity references may be removed or modified as needed. Users
can declare their own elements in external parameter entities, and
the parameter entities defined in IS0 12083 can be overridden to
modify order and occurrence or to specify user defined
elements/attributes; alias elements are not permitted. The
Standard allows SHORTTAG and OMITTAG, although the
revised usage examples will be fully normalized. The application
must conform to ISO 8879:1986.
ISO 12083 contains four DTDs: Book, Article, Serial, and
Mathematics. It has a very large Annex (A) which comments on
the DTDs and covers such things as design philosophy, structure
descriptions, special characters, electronic review, mathematics,
tables, braille/large print/computer voice facilities, and HyTime
facilities. Annex B contains descriptions of the elements, and
indicates how all the elements relate to one another. Annex C
contains examples, some of which are normalized versions of the
examples which first appeared in the ANSI standard.
Numerous changes have been made to ANSI Z39.59. Element
names have been changed and additions made, to make them less
cryptic than in the original; there are new elements for things like
poetry. The Mathematics DTD is based on the work of the AAP
update committee (which has met at a number o f SGML
conferences, and corresponded over the internet). ISO 12083
currently offers minimal HyTime capability, but this should be
enough to get people started. The Standard also supports the use
of ICADD's Structured Document Access (SDA) attributes, to
facilitate mapping to braille, large-print or voice-synthesizing
systems. The use of SHORTREFs is deprecated but still possible.
An alphabet attribute has been added to title, p (paragraph) and q
(quotes) - to allow the use of special characters in these
elements. Electronic review is also supported, and this was
achieved by incorporating the CALS Electronic review
declaration subset. The names and descriptions of the elements
and attributes are now more explicit and meaningful to make the
DTDs more "user-friendly"; the number of illustrative examples
has also increased.
The Standard will be published very shortly. It will be available
from ANSI and NISO (and reserving a copy before January 15th
can save $10). NISO will email out electronic copies of the DTDs
to anyone that wants them. To get a copy contact:
National Information Standards Organization
P.O. Box 1056
Bethesda, MD 20827
Phone: (301) 975-2814
Fax: (301) 869-8071
Email: niso@ehn.nist.gov
[This information probably only applies to people in the United
States. Elsewhere, people should first try contacting their own
National standards body].
Beth closed by remarking that the second edition of Eric van
Herwijnen's book "Practical SGML" has been produced using the
ISO 12083 Standard (including the HyTime capabilities), and
things seem to have worked pretty well. The indications from
other tests which are currently underway have been equally
positive
20. Reports from the SGML Open Technical Committees
Paul Grosso (Chair of SGML Open's Technical Committee),
reported that they not yet met - although the first meeting would
be held on the Friday immediately after this conference. On this
one occasion, anyone who wished to attend would be allowed to
do so; future attendance at such meetings would be restricted to
people connected with SGML Open.
Paul suggested that the general role of the Technical Committees
will be to look at interoperability issues and make sure that SGML
solutions work (i.e. that SGML applications can successfully
interact). The main Technical Committee will form specifically-
tasked/short-lived sub-committees as necessary. The Technical
Committee will need to get input from all the SGML Open
member companies, and no-one should expect to be able to "piggy
back" on the efforts of a few dedicated companies or experts.
Particular problems which may be considered by the Technical
Committee include things like entity management, how to package
together and exchange SGML files (cf. SDIF, SBENTO etc.),
handling tables and math (where many issues go beyond the area
covered by ISO 8879), HyTime issues, and so on.
21. "A Technical Look at Authoring in SGML" - Paul
Grosso (ArborText)
There are a number of ways of authoring in SGML. Approaches
include using a standard "ASCII" editor to author both the text and
the markup, using a conversion program to add SGML markup to
existing content, using an SGML-aware (non "ASCII") editor that
provides the proper markup during the initial authoring process,
and recreating an SGML document from a repository of existing
SGML or SGML-type documents.
When discussing authoring in SGML, it is useful to distinguish the
roles of the parser and the editor. A parser turns an ASCII
character stream into tokenized SGML constructs (it also expands
any markup "minimization"). However, the parser also leaves
many things to the application for checking. A non-ASCII SGML
editor is such an application, and only it can associate meaning to
the SGML constructs returned by the parser. Such an editor is not
just an interface to the parser - it is an application optimized to
author structured documents that represent valid SGML document
instances. A non-ASCII SGML editor should provide an interface
that transcends the syntactic details; it should represent the
document using internal data structures that are isomorphic to the
basic constructs of SGML.
There are three levels of "understanding " an SGML document.
The lowest is the recognition of SGML syntax (e.g. recognizing
the individual characters in the string </para> as a an end tag for a
particular element called "para"). The middle level entails
understanding and providing an interface to SGML semantics, for
example what it means to be an element, an attribute, an entity, or
a marked section, and what it means to have a content model, a
declared value, a default etc. The top-most level of understanding
performs the attachment of application-specific semantics, for
example in a composition system it determines how to format a
paragraph element.
All non-ASCII SGML editors must convert the ASCII SGML
input into an internal representation, but an SGML editor that
inherently understands SGML semantics can provide much greater
benefit to the end user than an editor - even a structure one -
that "imports" and "exports" SGML by converting to an alternate
view that does not maintain a real-time comprehension of and
compliance to SGML semantics. When it comes to measuring an
SGML editor's performance, the parsing component of an SGML
system is defined in Annex G of ISO 8879. Conformance of an
editor application is often measured by examining what it can
import and export; it should be able to read/output a wide range of
valid SGML. However, real-time context checking is also
important in an SGML-aware editing system. The system should
guide the author whilst creating a valid SGML document, as
continual validation and checking will make life easier for the
user. However , it must be remembered that during the authoring
process it is possible to have "incomplete" as opposed to "invalid"
documents - the incomplete document contains nothing that is
actually wrong, it just does not yet contain everything that is
required.
The SGML Conformance Standard uses the concept of ESIS
(Element Structure Information Set) to define conformance for a
pasrser. However, the definition of ESIS is not inclusive enough
to describe all that an SGML editor must do (and this has led to
the notion of "ESIS+"). ESIS+ suggests that things of importance
to an SGML editor could include comments, the existence of
ignored marked sections, and the existence and name of internal
general text entities. An SGML editor's view of SGML is really
dependent on its view of ESIS+. Therefore, an SGML editor
could/should be evaluated on the basis of what it recognizes as the
scope of ESIS and ESIS+
The interfaces to complex constructs can cause problems for
SGML editors; for example handling such things as marked
sections should be done properly, even when they are specified
using a parameter entity reference. The editor should allow for
marked sections with unbalanced markup (i.e. which include the
start tag of an element but not its corresponding end tag); it should
also allow the synchronous changing of the values of parameter
entities so that the final result is valid even though an intermediate
state may be invalid.
Subdocs are another complex structure to be considered. A
subdoc is basically an external SGML entity with its own DTD.
The authoring interface to a subdoc can be similar to that for a
regular external SGML entity but there are issues are additional
issues to be considered, such as the different ID name/entity name
space, a need for a potentially different presentation style for the
subdoc, to say nothing of what it might meant to actually compose
a document containing a subdoc.
A third type of complex structure involves the use of the
CONCUR feature. Using concur allows a document instance to
contain completely independent and orthogonal structural
hierarchies in the same ASCII SGML file. At any given time, the
document must be parsed according to the currently active DTD.
When a different DT is made active, it is equivalent to reading in a
different document from a different document type. Although the
character data of the different views of the document remains the
same in both cases, some thought needs to be given as to which tag
sets should be displayed to the user - only the active DTD, or one
or more of the others that apply.
22. "Implementing an Interactive Electronic Technical
Manual" - Geoffrey von Limbach (InfoAccess Inc.)
There are two specifications which relate to IETMs. The first is
GCSFUI (General Content, Style, Format, and User Interaction
Requirement - MIL-M-87268), which specifies the on screen
layout and behaviour of an IETM (e.g how an IETM should handle
warnings - their duration, iconization etc.) The second is
IETMDB (MIL-M-87269), but although this mentions "Data Base"
in the title, it is primarily a set of architectural forms or templates
for IETMDB compliant DTDs. IETMDB also specifies a linking
mechanism for sharing content between document instances, based
on the use of HyTime ILINKs.
Several classes of ETMs and IETMs have been identified in a
recent paper by Eric Jorgeson ("Classes of Automated TM
Systems" Eric L Jorgeson, Carderock Division, Navel Surface
Warfare Center, 11 August 1993). In summary, these are as
follows:
Class 1: stored page images (+ display mechanism and
access via an index)
Class 2: as 1, but adds hypertext links
Class 3: real IETMs (display conforms to GCSFUI, file is
tagged in accordance with IETMDB)
Class 4: as 3, but authored with a relational or Object-
oriented underlying database
Class 5: as 4, but the display is integrated with other tools
(such as an expert systems to assist diagnostics).
Geoffrey then described the implementation of a prototype IETM
at the David Taylor Research Center (DTRC). The DTRC
prototype was implemented using Guide Professional Publisher
(GPP). GPP includes a scripting language which enabled the flow
control required by IETMDB as well as a flexible user interface
which can be adapted to meet the requirements of GCSFUI. The
DTRC provided a DTD and document instance derived from the
architectural forms specified in IETMDB. DTRC also specified
the screen layout beyond the requirements of GCSFUI. Geoffrey
then showed an example of some sample warning source text and
markup.
It became clear during the DTRC project that IETM production
requires flexible software. The user interface must be adaptable to
user requirements. The DTD involved does not remain constant,
and is likely to be frequently revised. There is also a need for
tools which can handle things like HyTime's ILINKs. Geoffrey
recommended that anyone working on a similar project should try
to use inherently flexible tools (such as an adaptable user
interface, and a good scripting language). Other tools which offer
an API which can be called from a development language such C
or C++ are also worth considering, although they can lead to
higher development costs and greater overheads as requirements
change.
23. "The Conversion of Legacy Technical Documents into
Interactive Electronic Technical Manuals: A NAVAIR
Phase II SBIR Status Report" - Timothy Billington,
Robert F. Fye (Aquidneck Management Associates, Ltd.)
[This presentation followed on closely from the previous one. It
involved a large number of slides (all of which are included in the
proceedings), so I shall only attempt to summarize the main
points]
The Navy's ETM strategy depended upon a comparison of the
Class 2 and Class 4 ETMs/IETMs outlined above. Class 2 ETMs
are typically linear/sequential in nature. An SGML instance and
document DTD are fed into an indexing tool, and the resulting
files are fed into an ETM SGML and graphics browser (with style
declarations being applied to control formatted display). In Class
4 IETMs, the logic sequence is much more complex - since the
underlying SGML tagged information has to be displayed on the
basis of interactive inputs from the user (e.g. branching on the
basis of user responses to dialogue boxes).
Aquidneck Management Associates were awarded a second phase
Small Business Innovative Research (SBIR) contract to develop
processes and procedures for the transition of legacy, paper-based
NAVAIR Work Packages into Class 4 IETMs in accordance with
Tri-Service specifications.
Timothy described the migration process. This involves scanning
massive amounts of legacy data, and particular attention needs to
be given to potential problem areas - such as how to markup
tables to support interactive decision making. Migration has to be
a phased process, with increasingly sophisticated markup being
added at each phase.
Timothy talked briefly about data enhancement and authoring, as
well as data storage and maintenance. He stressed that the
subsequent data quality assurance process is very important to
users, and should not be neglected or played down. He identified
some of the features of an IETM presentation system (e.g. tracking
user interactions, setting and clearing logic states, navigating to
the next logical node etc.), and showed some diagrams illustrating
the operation of a frame oriented IETM.
Looking to the future of IETMs, Timothy said that there is a need
to find a cost effective authomated means of converting paper-
based legacy data to Class 4 SGML-based IETMs; this is a pre-
requisite for the widespread implementation of IETM technology.
He noted with regret that this is something of a circular argument,
since the IETMs will not appear without the cost-effective
conversion technologies, but they will not be developed until there
is sufficient demand for IETMs.
24. New Product Announcements and Product Table Top
Demonstrations
[These announcements were made in quick succession, so I
apologize in advance for anything I missed or mis-heard.]
Xyvision Publishing Systems announced that they have built an
SGML publishing solution around their document management
system (and also round FrameMaker and Ventura).
Incontext announced the release of their Windows 3.1 based
SGML editor (which uses Excel to handle tables).
Electronic Book Technologies (EBT) announced that DynaText
now has a new multibyte core display language, which means that
it can display Asian languages (eg. Kanji). The Rainbow DTD
being shown at the Poster Sessions will be made publicly available
to facilitate translations from word processors to SGML
SoftQuad announced a new version of Application Builder which
allows Author/Editor to be used as an interface to document
management systems. They would also be showing the latest
version of Author/Editor (v.3.0).
Recording for the Blind announced the creation of their Etags
product, to assist the production of electronic texts that can be
made accessible to the print-disabled.
OpenText Corporation announced the creation of new client/server
extensions to support easy combination of hardware. Their
product has also been extended to support multibyte char sets (eg.
Kanji)
Datalogics announced that their WriterStation P/M product has
now been ported to Windows 3.1.
Schaeffer [?] consultants announced that they would be showing
some of the integrations they have done, (based on OpenText), to
facilitate data management.
Grif SA announced that they would be showing GATE [tools to
integrate Grif into your system?], Grif SGML Editor for Windows,
and CAT (Corporate Authoring Tool) for authoring in BookMaster
(GML to SGML).
Oracle announced the release of OracleBook, an online
multimedia document delivery system. Version 2 will be SGML-
aware, and will be demonstrated at this conference.
Texcel announced the release of Information Manager, a package
for building document management/collaborative authoring SGML
systems.
Tachyon Data Services announce that they would be showing the
customizable Tagger software that they have developed to convert
files to SGML.
Synex Information [?] announced the release of SGML Darc, an
SGML-based document management and archiving system for
Windows.
ArborText announced the recent release of version 5 of the
ADEPT Series of software (now including a FOSI editor). They
were also demonstrating Powerpaste (a set of conversion tools),
and the Windows version of the ADEPT SGML Editor, which is
due to be released in the second quarter of next year.
Interleaf announced the recent release of the Interleaf 5 <SGML>
Editor, and the Interleaf 5 <SGML> Toolkit to develop
translations.
Exoterica announced that they would be demonstrating a new
release of OmniMark for use on Macintosh machines.
Passage systems announced the release of PassagePro, a document
management and production system. Currently available on SGI
machines, and soon on Suns, they hope to have it running under
Windows by next year.
Frame Technology announced that they would be demonstrating
FrameBuilder, and announcing/demonstrating their SGML Toolkit
(which facilitates mapping between SGML and Frame's
WYSIWYG environment)
Zandar Corporation announced that they would be demonstrating
the latest version of their TagWrite conversion tools (currently at
v.3.1).
The following companies demonstrated/exhibited their products
[this list is taken from the proceedings, so some late entries may
have been omitted]:
ArborText, Inc.; AIS/Berger-Levrault; CTMG/Officesmiths; Data
Conversion Laboratory; Datalogics Inc.; Electronic Book
Technologies; Exoterica Corporation; InContext Corporation;
InfoAccess; Information Strategies Inc.; Information Dimensions
Inc.; Intergraph Corporation; ISSC Publishing Consulting;
Microstar Software Ltd; Ntergaid; Open Text Corporation;
Recording for the Blind; Saztec International Inc.; SoftQuad Inc.;
STEP Sturtz Electronic Publishing GmbH; Synex Information;
Tachyon Inc.; Texcel; WordPerfect Corporation; Xyvision Inc.;
Zandar Corporatio
25. Poster Session
As above
26. "Implementing SGML Structures in the Real World"
- Tim Bray (Open Text Corp.)
Tim began by remarking that the number of people attending the
conference indicates that there are a vast amount of SGML-tagged
information being created, but where is it all being put? Some of
the possible technologies and systems have been shown at the
vendor demonstrations at this conference.
At SGML'92, Neil Shapiro said "I can model SGML with a slight
extension to SFQL and store SGML in a relational database". In
response, Charles Goldfarb said "SGML should be stored in a flat
file in a native form". This presentation looked at some of the
different possible approaches.
Computer systems at the operating system level, have an extremely
linear view of the world (i.e. they see everything as a row of
bytes). The application program, which actually uses files, has a
different view of the world; it sees SGML etc. as a sequence of
characters, although it still goes through the file in a linear fashion.
However, you might want to jump right into the middle of an
SGML file (for example to pick up a particular entity), accessing
information in the same way that humans do.
The design of a system that needs to access SGML-tagged
information in this very direct way, must incorporate a number of
goals. The design must be open, that is it must provide full SGML
text to any application (with or without entity management? -
Tim commented that it is good to see that relevant PD software
tools will soon appear, such as the POEM software announced
earlier in this conference). The design must make information
highly accessible, retrievable on the basis of structure, content,
and linkage. It should also be "alive", allowing information to be
updated quickly and safely. The design should support
"Document Management" (i.e. versioning and workflow), and it
should be able to do all of the above quickly.
What does "Open" mean in this context? Flat filesystem access
(real or emulated) is the lowest common denominator shared by all
open systems, but being able to pass SGML files between systems
requires additional sophistication. You really need to develop an
"element server" (effectively an element API) to have truly open
SGML
There are four possible strategies which can be adopted when
developing a solution. The first involves the use of standalone flat
files (where the SGML tagged information sits in separate files);
this approach offers complete SGML fidelity, maximal openness,
and relative ease of update - however, retrieval can be hampered
by poor tools and performance, and it is difficult to perform
updates safely and securely.. From the standpoint of data
management, this approach appears neutral (neither especially
good or bad) because although there are no tools, there are also no
barriers.
The second strategy involves the use of indexed flat files; this is
the approach adopted by the Opentext product (i.e. it builds
structure and content indexes which can be then used to
access/retrieve information, support updating etc.; the indexes
relational database. The arguments in favour of this approach are
that it allows complete SGML fidelity, excellent openness,
excellent retrieval. However, update implementation is complex,
as it is difficult to insert to stick bytes into the middle of a large
indexed flat file without creating problems.
The third strategy requires the use of a relational database (and
Tim pointed out that 90% of the world's existing business
information is already stored in relational databases - so there is
a great deal of expertise with this kind of database). In a relational
database, SGML elements or entities are mapped into relations,
and extra information is included to model the SGML element
hierarchy. Some extensions to SQL may be provided to support
this approach.
Tim gave some examples of how relational databases to handle
SGML have been implemented in some of the products
demonstrated at this conference. The first used a single table of
three columns (Context representation, Properties, and SGML
Text); record boundaries are inserted at the starts of a subset of
"distinguished" elements. The hierarchy is stored and rebuilt via
Context Rep [?], whilst metadata (versioning and presentation) is
stored in properties. This means that the SGML can be
reconstructed on demand, and the product can perform structure
and/or content queries on the distinguished elements. In his
second example, the relational database was built from a simple
decomposition process - breaking the documents down into
distinguished elements (e.g. paragraphs); each table record is one
such element, with one field of text, and the rest of attributes. In
his final example, the SGML text is stored in BLOBs (Binary
Large Objects) divided purely for implementation reasons. In this
case, elements are stored as table entries with the fields: BLOB id,
parent ID, sibling ID, first child ID, and attributes].
There are advantages to using the relational database approach.
SQL is a world-standard tool; the theory and practice for safe and
efficient updating of relational databases is well-understood. It
also offers the possibility of excellent integration with existing
document management techniques. However, using relational
databases offers poor SGML fidelity, the openness of the
information is compromised, and retrieval performance can also be
poor.
With this in mind, Tim proposed his fourth strategy for developing
an open system for handling SGML information - using a "native
SGML" (object-oriented type) database. In this case the Database
Definition Language (DDL) is SGML, and the Data Manipulation
Language (DML) is also SGML-oriented; there is no hidden or
relational machinery. As an example of a product that has
implemented this approach, Tim talked about SGML/Store; the
input is via an SGML parser, and the database essentially stores
the resulting ESIS. The software has API primitives for tree
navigation and content queries, and treats both elements and
attributes as nodes. It uses a transaction-oriented, versioned-object
model, and supports multiple instances and DTDs per database.
The advantages to using a native SGML database are that it allows
for complete SGML fidelity, it gives an opportunity to implement
commercial database security and integrity features, and it also
makes it possible to optimize for performance. The disadvantages
are that it involves a proprietary implementation, and the use of
proprietary API and/or DML.
In conclusion, Tim recommended that if it is possible to get away
with storing SGML documents in flat files, then this is the
preferred solution as it is simple and thus safe. He felt that there is
still a major requirement for the development of relevant standards
(e.g. an SQL for handling SGML documents), and hoped that this
need would be met sooner rather than later.
27. "User Requirements for SGML Data Management" -
Paula Angerstein (Texcel)
[Paula suggested that an appropriate alternative title to this
presentation could well have been the question, "Why do we want
to have SGML-based approaches to data management in the first
place?".]
The current business trend shows a growing awareness of
documents. Documents are at last being recognized as a corporate
asset, although they are often not managed as well as other types
of corporate information (such as financial data). There also
appears to be a gap in the corporate information infrastructure (for
example, many large companies cannot share documents
effectively and efficiently internally) - so strategies are needed to
track and share information.
There are a number of common document-related problems in
business. For example, finding the right version of the right
document, keeping related documents compatible with each other,
and synchronizing documents with the latest data. Other typical
problems include getting a document customized for a particular
job, reusing appropriate existing document parts, and coordinating
the multiple access and update of documents.
An effective document repository management system should
provide a number of benefits: it should support automated quality
control, maintain document security, account for and trace any
amendments, facilitate document reuse, and assist worker
coordination. The successful solution to providing such a system
must offer facilities for the automation of the main business
processes, provide for the collaborative reuse of information, and
be easy to integrate with existing data management practices
(although this is often more of a cultural than a technical problem).
The key to a successful information management strategy is a
centralized information repository. It represents a "logical vault"
for all the documents and information relevant to a workgroup. As
a managed collection, documents can be browsed, queried, and
used by all members of a workgroup to collaborate on projects.
Versions, configurations, and status of information can be
centrally kept up-to-date, providing automated quality control,
accountability, and traceability. Information can be shared and
reused, guaranteeing integrity of data and boosting productivity.
SGML-based repository management goes beyond document
image and/or traditional document management. It makes it
possible to use the rich set of information in a document -
namely the markup as well as the content. Document components
can be shared, reused, and subject to configuration management.
Moreover, it means that automated processes can be driven by the
document contents (i.e. by the data), and so do not have to be left
to other tools.
This approach is different from traditional product configuration
management, in so far as documents are structured but have
"unpredictable" data (i.e. they contain elements which have no
fixed size, order or content type). The typically hierarchical
structure of documents does not model naturally as relational
tables, and the level of granularity required to track changes etc.
probably does not correspond to a document as it is presented to
the end user. Similarly, most people probably would not wish to
adopt a forms-based document editing environment.
Although SGML is often thought of as an interchange format only,
it also provides a small but powerful set of semantics for document
data modelling. Element and attribute definitions provide a
"schema" and attributes for objects in a repository. Entity
definitions provide a way to share units of information, and
IDs/IDREFs (together with appropriate external identifier
resolution or HyTime mechanisms) provide repository-wide
linking. The benefits of using SGML modelling in a data
repository stem from the fact that SGML is optimized for
documents. Document markup and structure contribute to the
process of retrieval. It also means that you need only one model
and language for information creation, management, and
interchange.
SGML repository management enables new business processes. It
becomes possible to have on-demand generation of new
information products, through dynamic document assembly and
"data analysis". Element-based access control makes it easier to
share document components amongst the members of a
collaborative workgroup. It also becomes possible to track the
life-cycle of document components through element-based
configuration control. Whilst the use of structure-based queries
allows dynamic online viewing and navigation around document
components within the repository.
SGML repository management also facilitates existing business
processes such as storage, filing, browsing, query, retrieval, access
control, auditing, routing, version, job tracking and reporting,
archiving and backup. For many of these functions, an SGML
repository enables operation on individual elements in addition to
documents.
Documents are being recognized as an increasingly important part
of the information infrastructure of a company. With the
introduction of SGML-based approaches, we should witness a
gradual movement from the current notion of "document
management" to that of "information management".
28. "A Document Query Language for SGML Databases"
- Ping-Li, Pang, Bernd Nordhausen, Lim Jyh Jang,
Desai Narasimhalu (Institute of Systems Science,
National University of Singapore)
[This presentation was delivered by Ms Ping-Li Pang.]
The Institute of Systems Science is one of the main research
departments at the National University of Singapore. Recently,
they have been looking at managing documents, especially using
SGML-based approaches, and this has led to the development of
DQL, a document query language for SGML databases.
The main requirements for a language to query a database of
SGML structured documents are that it must support queries on
the basis of complex structural information, facilitate information
retrieval, and assist document development. Ping-Li talked
through some examples of query expression in DQL, showing the
typical form of a query (select....from.....where....), and the use of
path expressions (for elements, database attributes, and SGML
attributes). She then discussed the DQL method for expressing the
following types of query: a DTD structural query, a document
instance structural query, a document instance content query, a
query on versioning information, and a query on links. [Readers
who would like to see examples of these queries should probably
contact the DQL team at the Institute of Systems Science].
DQL was an attempt to implement an SQL-like language for use
with SGML. It has the expressive power to query on structural
and/or content information at any granularity. DQL is being
implemented at the Institute of Systems Science and an initial
prototype that has all the features of DQL will be ready in March,
1995.
During questions, Ping-Li stated that the "database attributes"
mentioned in her examples are defined when the database model is
developed (i.e. the database attribute "title" maps from the content
of the SGML element "title" in the relevant DTD[?]). The DQL
development team have not looked at HyQ for querying SGML
documents.
29. Closing Keynote - Michael Sperberg-McQueen (Text
Encoding Initiative)
Following the success of the presentation he gave last year,
Michael Sperberg-McQueen had again been invited to deliver the
closing keynote speech. The full text of his address will be posted
to the comp.text.sgml newsgroup.
Michael was really posing the same question that he asked last
year, "What will happen to SGML and the market for SGML
information in the future?" He emphasised that he would be
stating his personal opinions only, and they should not be taken as
representative of the TEI or any other institution.
Michael noted that some progress had been made on several of the
issues that he raised in his closing address at SGML`92. The
growing expertise in SGML has meant that improved styles of
DTD design are being adopted. DTDs are being developed to
meet the users' information handling needs, not the requirements
of a system or application (often evidenced in earlier DTDs by the
presence of processing-motivated "tweaks"). The HyTime engines
which are now approaching should be able to close some of the
"gaps" of using SGML, since they will make it possible to use
architectural forms to convey some sense of system and/or
application awareness without compromising the SGML source.
SGML promises data portability, but it may also lead to
application portability. The HyTime and DSSSL standards should
facilitate this process, and developers will need to know about
them.
The biggest change since SGML`92 is the amount by which the
volume knob on the so-called "Quiet revolution" has been turned
up. SGML has already begun its entry into mainstream
information processing. It is already a standard that is being
adopted world-wide.
Whilst the use of HTML (the Hypertext Markup Language) on the
World Wide Web is perhaps not an ideal demonstration of what
SGML-based approaches can achieve, they are doing a very good
job of selling SGML to people who were previously ignorant or
resistant.
Michael said that he would like to see SGML-awareness
embedded into as many mainstream applications as possible as a
matter of course. When SGML-awareness is embedded almost
incidentally in an application, then SGML will be able to realize
its full potential, and it could be argued that this is already
beginning to happen. The future is perhaps not only closer than
we think, it may already by here now; the products demonstrated
at this conference, and the public domain tools such as
ObjectSGML, POEM and CoST are perhaps examples of this.
Michael predicted that it might not be too long before we see
SGML being used in the area of literate programming. It is clearly
an ideal case for an SGML application.
SGML to SGML transformations have been a key topic at this
conference. This is important, because in future only legacy data
that is being converted to SGML, and outputs from SGML
systems (e.g. for printing a document), will exist in non-SGML
form. All other information interchange will be done using SGML
and thus DTD to DTD transformations will be a fundamental
issue. The points raised by Dave Sklar in his presentation on
GLTP (now to be re-named STTL) will become highly pertinent,
as will the requirement to use GLTP and DSSSL in general.
Things are going to become more complicated, therefore we will
all need new/better tools for things like DTD design and
development, SGML systems development, DTD transformation
design, and the actual transformations themselves.
There is clearly still much more work to do. Mainstream vendors
will need to understand more about SGML, and if they do not
change their products in the long run they will lose out as
customers come to expect/demand embedded SGML-support. It is
certain that technology will keep on changing, and superficially
attractive (non-SGML-based) solutions to managing information
will always ultimately fail. For when this situation comes about,
SGML will not just be in the mainstream, it will be the
mainstream!
For further details of any of the speakers or presentations,
please contact the conference organizers are:
Graphics Communications Association
100 Daingerfield Road 4th Fl
Alexandria VA 22314-2888
United States of America
Phone: +1 (703) 519-8160
Fax: +1 (703) 548-2867
-------------------------------------------------------------------------------
You are free to distribute this material in any form, provided
that you acknowledge the source and provide details of how to
contact The SGML Project. None of the remarks in this
report should necessarily be taken as an accurate reflection of
the speaker's opinions, or in any way representative of their
employer's policies. Before citing from this report, please
confirm that the original speaker has no objections and has
given permission.
-------------------------------------------------------------------------------
Michael G Popham
SGML Project - Computer Development Officer
Computer Unit - Laver Building
North Park Road, University of Exeter
Exeter EX4 4QE, United Kingdom
email: sgml@uk.ac.exeter M.G.Popham@uk.ac.exeter (INTERNET)
Phone: +44 392 263946 Fax: +44 392 211630
-----------------------------------------------------------------------------