Re: Generating a DTD from XML files?

From: "Ernest G. Allen" <ernestgallen@earthlink.net>

To: Rob Lawson <rob.lawson@ktiworld.com>, xml-dev@lists.xml.org

Date: Thu, 17 May 2001 08:33:11 -0700

At 7:21 AM -0700 2001-05-16, Rob Lawson wrote:
>Hi,
>
>Does anyone know a utility or package to generate a basic
>DTD from XML files?
>
>Second question, does anyone have a link to useful XML FAQs,
>so I don't ask anymore possibly silly questions?
>
>Many thanks for any help,
>
>-------
>Robert G. Lawson (rob.lawson@ktiworld.com)
>KPM Consultant
>Knowledge Technologies International Ltd.
>Phone: +44 (0) 7866 610409
>Fax: +44 (0) 7970 030914
>http://www.ktiworld.com
>
There was a good paper at SGML '95 on this. See "Creating DTDs
via the GB-Engine and Fred" by Keith E. Shafer at:
http://www.oclc.org/fred/docs/sgml95.html
See especially sections 4, "Automatic DTD Creation Process"
and 5, "Reductions". It should help a lot.
I'll include an Awk program that I use as the first step when
creating a DTD from a collection of tagged documents. It might
help you get started.
It reads the ESIS output from SGMLS and writes out the full
"path" for each element like this:
doc (
doc chapter (
doc chapter section (
doc chapter section para (
doc chapter section para )
doc chapter section para xref (
doc chapter section para xref )
doc chapter section para )
.
.
.
The same approach could be used with SAX events.
You can then write some other utilities that use this output to
count the various nestings and have a better chance of getting the
cardinality contstraints a little tighter, instead of just using
* or + for each element. At the least, it makes it easy to build
loose content models such as (X | Y | Z)* in order to get started.
/s/ Ernest G. Allen
//----------------------------------------------------------
## GI_path.awk -- accepts ESIS input, writes the full path
# from the root element to each element start and end tag.
#
# Uses "(" and ")" at the end of each line of output to
# indicate that the last GI on the line is a start tag or
# end tag, respectively.
#
# by Ernest G. Allen, 1995-2001
#
# No copyrights held, placed into the Public Domain.
#
/^\(/ { GI = $1; GI = substr(GI, 2); push(GI); next; }
/^\)/ { pop(); }
/^\?/ { print; }
function push(s) {
stack_ptr++;
stack[stack_ptr] = GI;
print_stack();
print "(";
return;
}
function pop() {
print_stack();
print ")";
stack_ptr--;
return;
}
function print_stack() {
for (i=0; i<=stack_ptr; i++) {
printf("%s ", stack[i]);
}
}
\\----------------------------------------------------------