L2/04-333

Proposed distribution format for the UCD

Eric Muller, Adobe Systems Inc.August 5, 2004

the UNIDATA directory, which contains the
complete set of files for the current version.

one
x.y-Updaten
directory for each version, which contains an
incremental update over its predecessor version, i.e. only
those files which have changed. Furthermore, the names of the
files include a version number, which is in not present in the
UNIDATA directory.

The main reason for this organization was to minimize the size
of the UCD data, in part to make downloads easier. However, this
approach also has a number of problems:

Reconstructing an old version of the UCD a non-trivial
task. Besides having to go to previous versions to find some
files, the user must also know to ignore
some older files with are not propagated (e.g. Props.txt in
2.1.9 is not part of 3.0), or have been moved
(e.g. DerivedNormalizationProperties.txt in 3.1 renamed to
DerivedNormalizationProps.txt in 3.2).

Since an Update directory does not necessarily include
all the files of that release, and those present have
different names than they have in the UNIDATA directory, we
cannot make hyperlinks between the files. For example,
UCD.html mentions StandardizedVariant.html, but does not link
to it. Hyperlinks would force us to go across Update directories,
and/or make some files different between the UNIDATA and the
Update directories.

The space saving afforded by the organization is no longer
a constraint. A complete UCD 4.0, including Unihan.txt, is less
than 33 megabytes; and a compressed ZIP is less than 7
megabytes. Furthermore, each release tends to touch most of the files,
thereby defeating the incremental organization.

The overall proposal is to stop publishing new Update directories,
and instead to publish each version of the UCD as a self-contained
set of files.

Here are more specific details of this proposal:

The proposed layout is to have one subdirectory in
http://www.unicode.org/Public for each release:

The ucd directories would contains all
the UCD files for the corresponding releases, and hyperlinks
between the files (represented as relative links) would be
allowed.

The purpose of the new intermediate ucd directory is to
provide a home for other data that is part of a release, such as
specific versions of the UAXes or the code charts. Ultimately, the
last published book plus the content of the directory for a
release would form a complete definition of corresponding version
of the standard. However, adding those components is not part of
this proposal.

The UNIDATA entry would be retained, and be made to have the
same content as the directory of the latest version (either by
some linking/redirecting magic, or simply by having a copy of the
same content).

We should also rebuild the directories corresponding to
earlier releases, starting with 2.0.0 (it is just not worth going
further back in time, and the data is not available in electronic form
for all the 1.x releases).

We should also provide a ZIP file for each release, simply
to facilitate http-based access. The proposal is to have one ZIP
file named ucd-release.zip per release,
placed in the directory for that release (that is, next to the
ucd directory it contains). Filenames in that
ZIP file would include directories, starting with the directory of
the release; in other words, the file 4.0.1/ucd-4.0.1.zip would
contain: