Beginning 9/30/11, and on a quarterly basis, we are capturing snapshots of our MODS records for our archive.

The reason we do not capture more snapshots is because of the overhead costs for long-term storage (LOCKSS) for each new version of the
file, and also for each new version of the Manifest from which the files are linked. Each collection has a Manifest which
enables the LOCKSS partners to know what files to copy. Every time a Manifest changes, it is considered a new version of the Manifest.
When the collections are small, this is a small file. When the collections are large, this can be a large file. The more versions we
have of the Manifests, the more space we have to pay for in LOCKSS on an annual basis (above and beyond storage and local backup costs).

And metadata is constantly changing. While we provide a minimal MODS for each item on upload, the Metadata Unit is remediating these as time permits, overwriting them with improved versions which contain, for example, LCSH (Library of Congress Subject Headings).

The snapshot-capturing software is a Perl script currently called "getMODS" and resides in /srv/scripts/storing/MODS/ on libcontent1.
When run, it explores the Acumen web directories in /srv/www/htdocs/content/ on the same server, below the collection level, looking for
MODS files. When it finds one, it checks to see:

Is there a MODS file for this thing in the archive? (/srv/archive/ area)

If not:

this one is copied over,

versioned to version 1, and

linked into the Manifest

If so, does this MODS match that one?

If so: no action is taken. This version is already in the archive

If not, is there a version 2 already in the archive?

If not:

This one is copied over,

versioned to version 2, and

linked into the Manifest

If so:

The existing MODS is backed up

This MODS is copied over

This MODS is versioned to version 2, and

NOT linked in (version 2 is already linked, because it exists)

Backups are made of all changed manifests. Backups of manifests and of MODS have the suffix "_LOCKSS_yyyy-mm-dd" if
the collection has been harvested into LOCKSS, and have the suffix "_yyyy-mm-dd" if not.

Those with the "_LOCKSS_" in the suffix are counted when we estimate our LOCKSS storage for yearly fees.
Those without that suffix may in the future be discarded if we determine we don't need them.

Since the getMODS manages a huge quantity of files, one must be prepared for potential blips in the process. After running
getMODS, one should always (and after all storage procedures) run "checkArchive" in /srv/scripts/storing/ (precede with "nohup" as it
may take several hours to run). This verifies that everything that *should* be in the Manifests is linked somewhere, and everything
that is in the Manifests exists in the directories in the place linked. The output is /srv/scripts/storing/ArchiveERRORS, and
must be checked after running "checkArchive".

If MODS files are listed in ArchiveERRORS, return to the /MODS subdirectory and run "getMissingMODS". It reads ArchiveERRORS and pulls out the MODS listings, and adds them to the correct manifests. I do not yet know why there's blips in this process; it appears to only happen with the really huge collections, so it may be a memory error. This should be addressed in the next quarterly backup if it continues to be a problem. It may be that it was only a problem the first time, as our capture the first time was so huge (many thousands of files).

Currently, the schedule for MODS backups is September, December, March and June of each year.