Text Class Collection Implementation

This portion of the DLXS Workshop focuses on
implementing a collection in the Text Class. It is organized as a hands-on
lesson, with the entire process outlined in detail. All of the steps are
included so that it can be repeated or used as a guide later. Links to the
detailed Text Class documentation are included.

Workshop Day 2 -- Tuesday Afternoon

Text Class Content Prep

In Text Class Content Prep we discuss the elements
and attributes required for Text Class delivery, the necessary architecture for
storing texts and collections, strategies and methods for converting texts to
conform to the Text Class DTD, and normalization.

Text Class DTD
Overview

It is assumed that any texts to be converted to Text
Class already validate against another DTD for encoding monographic materials,
such as TEI Lite, that represents corresponding structures (chapters, pages,
etc.). Because of the extremely lax content modelling (almost every element is
defined to permit ANY of the named elements), the Text Class DTD is only useful
to ensure that the nomenclatures have been changed appropriately.

default value for N is *; IDs will be automatically assigned
for elements without them

The Text Class DTD is a fluid document; more
attributes, and occassionally elements, are added as the need arises in
processing new collections. These basic requirements are unlikely to change,
however.

Text
Conversion Strategies

DLPS does not have any preferred methods or quick
and easy tools for this stage of the process. Only you, looking at your texts
and your encoding practices, can do the intellectual work required to convert
the texts. You should do this with the tools you are most comfortable using,
whether they be macros in your favorite editor, perl scripts if you have strong
programming skills, OmniMark if you like that, or XSLT (my personal choice). We
have a fairly detailed XSLT strategy on the documentation website, which uses
freely-available or ubiquitous tools, and if you are creating XML documents
anyway, this might be a reasonable route to pursue.

We have also used a perl script to do conversions of
TEI Lite-encoded SGML into Text Class SGML, and are willing to make these
(largely undocumented) scripts available. We are happy to offer suggestions and
our historical experience in converting collections, but cannot really support
you with specific tools or methods in your conversion, as it is particular to
the encoding of your texts.

For today, we are going to be working with some
texts that are already in Text Class, and one file that is in a DTD based on
TEI Lite. We will be building them into a collection we are going to call
workshoptc.

This documentation will make use of the concept of
the $DLXSROOT, which is the place at which your DLXS directory structure
starts. We generally use /l1/, but for the workshop, we each have our own
$DLXSROOT in the form of /l1/workshop/userX. To determine what your $DLXSROOT
is, type the following commands at the command prompt:

cd $DLXSROOTpwd

Create directory $DLXSROOT/prep/w/workshoptc with
the following command:

mkdir -p $DLXSROOT/prep/w/workshoptc/data

Move into that directory with the following command:

cd $DLXSROOT/prep/w/workshoptc/data

This will be your staging area for all the things
you will be doing to your texts, and ultimately to your collection. At present,
all it contains is the data subdirectory you created a moment ago. We will be
populating it further over the course of the next two days. Unlike the contents
of other directories, everything in prep should be ultimately expendable in the
production environment.

Copy the necessary files into your data directory
with the following commands:

The first file, bab3633.0001.001.sgm, is not yet in
the Text Class DTD. However, since it is a very simple text, a few changes will
make it so:

TEI.2 to DLPSTEXTCLASS

TEIHEADER to HEADER

DIV to DIV1

We'll also change the N attribute value in the
EDITORIALDECL to 4, as it is pretty fully encoded for its size. If you feel
confident in your file editing skills in the unix environment, you can do so
now (don't forget the end tags!). Otherwise, copy the following script and use
it to change your file (this is just for this file -- for real document conversion, see Converting Documents to Text Class below):

Other Text
Modifications

You need to decide whether you wish to keep
character entities (for example, &eacute;) in your text files or replace
them with their 8-bit ISO Latin 1 equivalent (for example, é). If you
choose to do this, you will be able to search for blessed, for example, and
retrieve both blesséd and blessed, because the indexing process maps
both of these characters to just e. Otherwise, you would have to search for
bless&eacute;d to retrieve the word with the diacritic. If you want to do
this (and this process is not necessarily valid for XML, only SGML!), use the
following command:

One way to help the cgi with identifying specific
text structures, like divisions, exactly is to insert unique attributes based
on a combination of the IDNO and the sequence of the division in the text. This
is an expendable ID and not meant to permanently identify a structure -- use
you own throughtfully assigned and permanent ID attributes for that. Before
indexing, check to see if node attributes have been applied when the documents
were converted to Text Class -- they will appear in the DIV tags for each
division and will look like this: <DIV1 NODE="AAN8938.0001.001:1">. If
they have not, use the following command to insert them:

Validate
and Normalize SGML

This step checks the SGML against the Text Class DTD
to validate the SGML. It also normalizes the SGML, which, if necessary, adjusts
the SGML tagging so that it is consistent in terms of case and order of element
attributes.

There are not likely to be any errors with the
workshoptc data, but tell the instructor if there are.

This will normalize the texts and result in new
texts with a .norm extension added. These are the files we will use to build
our new collection tomorrow morning.

Storing
Texts and Page Images

As you may have noticed from our file copying steps
earlier, we store each digitized text in its own directory, based on its DLPS
ID, along with the related page images. The DLPS ID is a unique ID for each
text, based on the ID assigned to its MARC record by the OPAC. Directories are
created in the form $DLXSROOT/obj/d/l/p/dlpsid (the DLPS ID can consist of a
mix of number and letter characters). Pageviewer defaults to search for page
images stored in a directory based on this form, although there is a method
that can be overriden.

To facilitate links between the texts and the images
stored in the $DLXSROOT/obj directories, the middleware is configured to read a
four-million row table on our MySQL server containing page image metadata. In
$DLXSROOT/misc/bb there's a file called Pageview that is a CSV version of that
table that continas only the rows for the pages in the sample collection. We
are not using this during the workshop. During DLXS installation yesterday,
Alan and Phil could have chosen CSV as the database format, and you can always
look at this table as an example of necessary metadata fields. However, we have
found that CSV does not scale, and while it is feasible for the fewer than 2000
pages in the sample collection, it was not adequate in our production
environment. The most recent release of the DLXS middleware does not support
pageview.dat files. If you have created pageview.dat files in the past and
would like to upgrade to the new middleware, we are delivering a program
($DLXSROOT/bin/t/text/makepageviewdata.pl) that will convert pageview.dat files
into MySQL rows. Invocation is simple (don't do it -- just FYI):

$DLXSROOT/bin/t/text/importpageviewdata.pl [-f] -d "$DLXSROOT/obj"

The -f flag indicates a "full run", i.e., process
all files regardless of whether they've changed since the last run (otherwise,
there is a timestamp file to determine which files have changed since the last
run). Based on what database format you chose during DLXS installation, this
process will populate the database with the information from any pageview.dat
files it encounters as it runs through the directory you specified
recursively.

More Documentation

Workshop Day 3 -- Wednesday Morning

Text Class Index Building with XPat

In this section the workshoptc SGML will be
concatenated and indexed with the XPat search engine, preparing it for use with
the DLXS middleware.

Set
Up Directories and Files for XPat Indexing

Yesterday, we did what we needed to do with our
materials "by hand" -- today, we will work with the materials packaged in the
sampletc collection and adapt them for use with workshoptc. This
should parallel what you'll be doing back at your institutions. First, we need
to create the rest of the directories in the workshoptc environment with
the following commands:

The bin directory holds and scripts or tools used
for the collection specifically; obj holds the "object" or SGML/XML file for
the collection, and idx holds the XPat indexes. Now we need to populate the
directories. First, change directories into $DLXSROOT/prep/w/workshoptc/data
and concatenate the texts into one collection with the following command:

Four of these files need to be edited to reflect the
new collection name and the paths to your particular directories. This will be
true when you use these at your home institution as well, even if you use the
same directory architecture as we do, because they will always need to reflect
the uniqname of each collection. Failure to change even one file can result in
puzzling errors, because the scripts are working, just not necessarily
in the directories you are looking at.

If you are comfortable editing in the unix
environment, in the Makefile, workshoptc.single.blank.dd,
workshoptc.extra.srch, and workshoptc.inp, change all references to /l1/ to
your $DLXSROOT value, /s/ to /w/ and sampletc to workshoptc. Otherwise, run the
following command:

sh $DLXSROOT/paths

Build the
Collection Specific Text Class DTD

Before indexing your collection, you will need to
create a collection-specific Text Class DTD. Because the class DTD supports any
element having any of the declared attributes (for convenience of DTD
creation), indexing "as-is" with XPat will cause your index to be unnecessarily
large. This may also cause problems for SGML/XML validation tools. A copy of
the textclass.stripped.dtd is included in the sample collection; you can create
your own from more recent versions of the Text Class DTD by running the
following command: (don't do it -- just FYI)

egrep -i "<\!ELEMENT" textclass.dtd > textclass.stripped.dtd

Next, use the "make dtd" command from the Makefile
to determine which attributes are used in your collection and build a custom
DTD. Using the "make validate" command will then validate your collection
against the new DTD. If the individual texts validated before, they should
validate as a concatenated collection now.

cd $DLXSROOT/bin/w/workshoptcmake dtdmake validate

Build the XPat
Index

Everything is now set up to build the XPat index.
The Makefile in the bin directory contains the commands necessary to build the
index, and can be executed easily.

To create an index for use with the Text Class
interface, you will need to index the words in the collection, then index the
SGML/XML (the structural metadata, if you will), and then finally "fabricate"
structures based on a combination of elements (for example, defining what the
"main entry" is, without adding a <MAINENTRY> tag around the appropriate
<AUTHOR> or <TITLE> element). The following commands can be used to
make the index, alone or in combination. We will be using "make singledd," make
sgml," and "make post."

make singledd indexes words for texts that
have been concatenated into on large file for a collection. This is the
recommended process.

make sgml indexes the SGML structure by
reading the DTD. Validates as it indexes. Slower than multiregion indexing (see
below) for this reason. However, necessary for collections that have nested
elements of the same name (for example a P within a NOTE1 within a P).

make multi (multiregion structure indexing)
indexes the SGML structure and relies on a "tags file" (included in the sample
collection) to know what SGML elements and attributes to index. Rarely used
with fully-encoded full-text collections because of the nesting problem
mentioned above. If you'd like to try this on your own, index only the new text
(bab3433.0001.001.sgm.norm)

make mfsdd (multi-file system indexing)
indexes words and structure for each SGML text listed in the data dictionary
(dd) individually. Seems like a good idea -- no redundant copies of files! --
but searching is slower than an index built of concatenated files. Also, if any
one of the files referenced changes in any way, the entire index fails. We no
longer use MFS indexes ourselves for this reason. If you'd like to try this on
your own, note that this is pointing to the obj directories for the individual
texts, and does not include the fifth file we edited yesterday. You'd want to
point to your normalized texts in $DLXSROOT/prep/w/workshoptc/data or
rename those and copy them out to their individual $DLXSROOT/obj/x/y/z
directories.

make post builds and indexes fabricated
regions based on the XPat queries stored in the workshoptc.extra.srch file.

Workshop Day 3 -- Wednesday Afternoon

Text Class Collection to Web

These are the final steps in deploying an Text Class
collection online. Here the Collection Manager will be used to review
the Collection Database entry for workshoptc . The Collection
Manager will also be used to check the Group Database. Finally, we
need to work with the collection map and the set up the collection's web
directory.

Review the Collection
Database Entry with CollMgr

Each collection has a record in the collection
database that holds collection specific configurations for the middleware.
CollMgr (Collection Manager) is a web based interface to the collection
database that provides functionality for editing each collection's record.
Collections can be checked-out for editing, checked-in for testing, and
released to production.A collection database record for workshoptc has
already been created and we will edit it. In general, a new collection
needs to have a CollMgr record created from scratch before the middleware can
be used. Take a look at the record to become familiar with it.

http://yourworkshopvirtualhost/cgi/c/collmgr/collmgr

Notice that it thinks it's the sampletc
collection. Change references to s/sampletc to w/workshoptc . Since we are not
building word wheels, remove the data in that field. Let's change the name as
well -- remove the reference to graphic:most-logo3bd3.gif and change it to
text:whatever you want to call it.

More Documentation

Review the Groups
Database Entry with CollMgr

Another function of CollMgr allows the grouping of
collections for cross-collection searching. Any number of collection groups may
be created for Text Class. Text Class supports a group with the groupid "all".
It is not a requirement that all collections be in this group, though that's
the basic idea. Groups are created and modified using CollMgr. For this
workshop, the group "all" record has already been edited to include the
workshoptc collection. Take a look at the record to become familiar with
it.

http://yourworkshopvirtualhost/cgi/c/collmgr/collmgr

We won't be doing anythign with groups; I'm sure you
will in Image Class.

Make Collection
Map

Collection mapper files exist to identify the
regions and operators used by the middleware when interacting with the search
forms. Each collection will need one, but most collections can use a fairly
standard map file, such as the one in the sampletc collection. The map
files for all Text Class collections are stored in
$DLXSROOT/misc/t/text/maps

Map files take language that is used in the forms
and translates it into language for the cgi and for XPat. For example, if you
want your users to be able to search within chapters, you would need to add a
mapping for how you want it to appear in the search interface (case is
important, as is pluralization!), how the cgi variable would be set (usually
all caps, and not stepping on an existing variable), and how XPat will identify
and retrieve this natively.

The first part of the file is operator mapping, for
the form, the cgi, and XPat. The second part is for region mapping, as in the
example above. There is an optional third part for collections with metadata
applied bibliographically, such as genre categories.

cd $DLXSROOT/misc/t/text/mapscp sampletc.map workshoptc.map

More Documentation

Set
Up the Collection's Web Directory

Each collection may have a web directory with custom
Cascading Style Sheets, interface templates, graphics, and javascript. The
default is for a collection to use the web templates at $DLXSROOT/web/t/text. A
collection specific web directory may be created, and it is necessary if you
have any customization at all. For a minimal collection, you will want three
files: index.html, home.tpl, and textclass-specific.css.