ODD by example utility

ODD by example utility

I have entertained myself recently by writing a utility
which attempts to work out the minimal TEI customization
needed to validate a collection of files.

What I have done is create an XSLT (version 2) stylesheet which
traverses a nominated directory tree looking for
*.xml files which have <TEI> or <teiCorpus> root
elements. It analyzes the collection of elements
and attributes in the resulting corpus, and compares
that to the whole of TEI P5. An ODD file is generated
which

* loads the required modules
* deletes any elements which are not used
* deletes any attributes (including class attributes)
which are not used by each element
* for every attribute which has a TEI "data.enumerated" datatype,
constructs a closed <valList> enumerating the values actually used.

From this you can construct a target schema, obviously.

Is this of interest to anyone apart from me? If so,
I could do with some testing and feedback.[1]

Memory capacity is an issue, obviously. My test set
is the XML files in the TEI P5 Guidelines "Test" directory,
and it can run over all the Shakespeare plays in a few seconds,
but it's not going to read a giant corpus without you have
a big load of memory to assign to Java. Caveat emptor.[2]

Re: ODD by example utility

I shall be interested to hear whether it flies for you.
As I hope I indicated, I have not really paid
much attention to memory usage. The thing is relatively
easy if you have all the docs in memory at once, but
doing it in a scaleable way to allow for multi-gigabyte corpora
would require a lot more care.

>> * deriving simplified content models (beyond what Roma already
>> does)
>
> IIRC, there have been several papers written about proof-of-concept
> projects that do this kind of stuff.

my inclination is to improve what Roma does in this
area, rather than implement it in this utility, if
there is a need. But I guess thats a job for another
day :-}

Re: ODD by example utility

This is a wonderful idea. I'll give it a good workout next week -- I
have several projects that can really make use of it, and one in
particular has several thousand TEI files, so it'll be a serious stress
test. I can throw 6 or 7 GB of memory at Java if necessary.

Cheers,
Martin

Sebastian Rahtz wrote:

> I have entertained myself recently by writing a utility
> which attempts to work out the minimal TEI customization
> needed to validate a collection of files.
>
> What I have done is create an XSLT (version 2) stylesheet which
> traverses a nominated directory tree looking for
> *.xml files which have <TEI> or <teiCorpus> root
> elements. It analyzes the collection of elements
> and attributes in the resulting corpus, and compares
> that to the whole of TEI P5. An ODD file is generated
> which
>
> * loads the required modules
> * deletes any elements which are not used
> * deletes any attributes (including class attributes)
> which are not used by each element
> * for every attribute which has a TEI "data.enumerated" datatype,
> constructs a closed <valList> enumerating the values actually used.
>
> From this you can construct a target schema, obviously.
>
> Is this of interest to anyone apart from me? If so,
> I could do with some testing and feedback.[1]
>
> Memory capacity is an issue, obviously. My test set
> is the XML files in the TEI P5 Guidelines "Test" directory,
> and it can run over all the Shakespeare plays in a few seconds,
> but it's not going to read a giant corpus without you have
> a big load of memory to assign to Java. Caveat emptor.[2]
>
> Want to try? grab getfiles.xsl and oddbyexample.xsl from Sourceforge
> (http://tei.svn.sourceforge.net/viewvc/tei/trunk/Stylesheets2/tools2/)
> and run it something like this:
>
> saxon -o my.odd oddbyexample.xsl oddbyexample.xsl
> corpus=/wherever/you/have/yourfiles/
>
> The script assumes you have the TEI package which has a file
> called "/usr/share/xml/tei/odd/p5subset.xml". If you don't
> have that, grab http://www.tei-c.org/release/xml/tei/odd/p5subset.xml,
> put the file somewhere, and add a "tei" parameter to point
> at it.
>
>
> [1] Warning: I don't think I can face
> adding the code to handle any or all of
>
> * deriving simplified content models (beyond what Roma already does)
> * adding new elements and deriving a content model
> * dealing with non-TEI namespaces
> * generating attribute datatypes with complex regexps
> * working out Schematron constraints etc
>
> but of course you are welcome to try yourself :-}
>
> [2] no, not literally! it's open source, free etc
>

Re: ODD by example utility

I think this would be an incredibly useful tool. I personally find it much easier to arrive at a desirable set of constraints on the encoding of a particular set of files by actually doing some encoding first rather than sitting with Roma open and thinking "now what set of values should I allow for this attribute?" If I could work on a small sample - say ten or so files (if they're short - and then when I was happy with them generate my ODD that I can then use on the next couple of hundred files, why then I would be a happy man.

Re: ODD by example utility

> If I could work on a small sample -
> say ten or so files (if they're short - and then when I was happy with
> them generate my ODD that I can then use on the next couple of hundred
> files, why then I would be a happy man.

then I think you are but a download away from achieving
that much-desired state :-}

Re: ODD by example utility

Sebastian Rahtz wrote:
> Is this of interest to anyone apart from me? If so,
> I could do with some testing and feedback.[1]

Hey, this is just great. Just tried it with our corpus and it works like
a charm. The resulting schema is much more neatly fitting to the our
corpus-format than the previous hand-crafted odd. The only thing one
really has to take care of is to check whether it restricts things that
should be possible but do not yet
occur.

Just a thought: we use some attributes and elements from a custom
namespace. Previously I defined them in my own customisation, now I just
dropped them into the output of oddbyexample. Would it be possible to
use my previous odd as starting point for oddbyexample?

Re: ODD by example utility

Stefan Majewski wrote:
>
> Just a thought: we use some attributes and elements from a custom
> namespace. Previously I defined them in my own customisation, now I
> just dropped them into the output of oddbyexample. Would it be
> possible to use my previous odd as starting point for oddbyexample?
hmm. I am not sure how this might work. can you send me your ODD so that
I can have a try?