Tracking for the long term

(New page: We're developing a tracking database for management of content over time. Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan...)

We're developing a tracking database for management of content over time.

+

We're developing tracking databases for management of content over time.

Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan is to periodically print reports

Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan is to periodically print reports

from the database and store these as flat files in the top levels of the storage system, as a form of manifest.

from the database and store these as flat files in the top levels of the storage system, as a form of manifest.

−

The primary location of InfoTrack is on libcontent1.lib.ua.edu; any changes made to it should be there, for it is then copied to content.lib.ua.edu for access and delivery of collection information by PHP scripts. Since we do not intend to continue with our CONTENTdm delivery system on content.lib.ua.edu, and that server needs rebuilding, that is NOT the primary site for this database.

+

InfoTrack and md5sums are located on libcontent.lib.ua.edu. The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described in [[Tracking_automated_scripts]].

−

+

== The InfoTrack database ==

== The InfoTrack database ==

−

Currently has seven tables:

−

'''digColls'''

+

'''allColls'''

+

+

This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online). Most of this information comes from the [[Collection_Information]] file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists. This information is uploaded by the collToDbase script described in step 10 of [[Moving_Content_To_Long-Term_Storage]] or the moveContent script described in the 3rd section of [[Most Content]]. These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet. More information about this table is available here: [[allColls]]

+

+

'''inLOCKSS'''

+

+

Once a collection has been released for harvesting into LOCKSS, we must monitor the size and additions to that content, and avoid any changes. As content is communicated over the network, we log here the identifier, manifest number, and date (and for collections which are subcollections of others, such as rare books, the subcollection title and parent identifier).

+

+

'''lookup'''

+

+

This table was created to support persistent identifiers. It's a lookup table to provide redirects. Each item for which we would like to provide this support is assigned a number. Retrieval of that item will be by using a URL of the form: http://purl.lib.ua.edu/3234 where the number following the last forward slash is the number of the item. The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history). This table is accessed by the script redirect.pl which lives in the cgi-bin of libcontent.lib.ua.edu. A URL rewrite and a virtual host configuration, along with the DNS registration of purl.lib.ua.edu, were the only other support necessary for this to work. Whenever a file must be moved, the database is updated, and the persistent URL continues to work. In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.

+

+

+

+

'''bornDigital'''

+

+

This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported. Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, the date the content should be made available (dateAvailable, in form yyyy-mm-dd), and exceptions.

−

This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections, descriptions of them, icons, links to the content (and if available, the Manuscript number associated, and the link to that online). Most of this information comes from the [[Collection_Information]] file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists. This information is uploaded by the collToDbase script described in step 10 of [[Moving_Content_To_Long-Term_Storage]]. That script also adds the expected canned link for retrieving content, and specifies if the collection is online yet. Other information that may at some point be stored in this table includes the long-term storage location of the content, the date it went online, query terms and query fields used to retrieve the content.

Line 20:

Line 33:

By maintaining a record of the current format, version, and quality of each file, in a database, we can easily identify and locate all files in need of migration or emulation in the face of approaching obsolescence.

By maintaining a record of the current format, version, and quality of each file, in a database, we can easily identify and locate all files in need of migration or emulation in the face of approaching obsolescence.

+

There is a good chance that this table will be combined with the md5sums database tables.

−

'''metadataTypes'''

+

'''dates_tagged'''

−

Similarly, this table is designed to capture the current type, version, schema of both descriptive and technical metadata for an item, and also the location of any existing data dictionaries which may help decipher how particular fields were used. A query on this table should be able to tell us what metadata needs to be revamped to meet the current pressing need or software change.

+

Here we track when content was rotated through the tagging software. For more on this see [[User_tagging]]

+

'''dates_transcription'''

−

'''file_relations'''

+

Same thing for rotation through transcription software. For more on this see [[User_transcription]]

−

This table is a temporary holding place for information about how files are related, until it can be captured in a METS file. This table probably is not long for this world. Currently, it is designed to hold the parent ID number, the collection ID number, the item's ID number, and the number of child files this item has.

+

'''userTagged'''

+

Here we track what was tagged, so it need not be rotated back through. For more on this see [[User_tagging]]

−

'''identifiers'''

−

As we have migrated legacy content to our standardized [[File_naming_schemes]], and in and out of CONTENTdm (which assigns its own identifiers, database numbers, and OAI identifiers), we've struggled to keep track of all the names associated with each item. This is the centralized table in which that should happen. Again, this table is in transition, and it may lose some CONTENTdm identifiers after we've left that system, and will probably gain other identifier types over time.

+

'''userTranscribed'''

+

Same thing for success in transcription collection. For more on this see [[User_transcription]]

−

'''lookup'''

+

'''tags'''

+

+

This is the list of tags, what they were applied to, and how many times a particular tag was used for a particular item.

+

+

'''numItemPages'''

+

+

To find out how relevant it is to have 3 transcriptions for an item, we need to know how many pages that item has. If it has 600 pages, only 3 transcriptions (one per page) is not very successful.

+

Here we collect info on how many pages items have within u0002 and u0003. For more on this see [[User_transcription]]

+

+

'''geocode'''

+

+

Information extracted using Google API is stored here for particular item locations. This enables us to apply the same lat/long for other items with the same location without calling the API again and again and again (there are limits on calls, as well as server overhead.

+

+

'''itemLocation'''

+

+

This works with the geocode table, indicating the item location for each of the items in the database. The locationID here corresponds to the locationID in the geocode table.

+

+

+

'''Donors'''

+

+

This table is not yet in use, and will likely expand. The purpose is to track which donors have provided support for digitization, processing, or donated content, and be able to link to their website and use a logo for the display of content.

−

This table was created to support persistent identifiers. It's a lookup table to provide redirects. Each item for which we would like to provide this support will be assigned a number. Retrieval of that item will be by using a URL of the form: http://purl.lib.ua.edu/3234 where the number following the last forward slash is the number of the item. The actual url is stored in this table, along with the assigned number, and possibly an OAI identifier, an original identifier, but certainly a history of any changes. This table is accessed by the script redirect.pl which lives in the cgi-bin of libcontent1.lib.ua.edu. A URL rewrite and a virtual host configuration, along with the DNS registration of purl.lib.ua.edu, were the only other support necessary for this to work. Whenever a file must be moved, the database is updated, and the persistent URL continues to work. In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.

+

== md5sums ==

+

'''itemSums'''

+

This table contains the identifier, file name, file path, current MD5 checksum, byte size, number of times modified, date of first entry, and a notes field.

−

'''cdm_location'''

+

'''modified'''

+

If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the byte size of the original file, the reason modified, and by whom are recorded here.

−

As we are moving away from CONTENTdm, the cdm_location table is not likely to survive much longer; it

+

Dates that the files are verified are stored in the checkscripts database described in [[Tracking_automated_scripts]]

−

was designed to identify the database number, the file identifier number, the CONTENTdm container ("collection") and the CONTENTdm reference URL, for each item stored in that software.

+

Revision as of 09:09, 6 August 2013

We're developing tracking databases for management of content over time.
Recognizing that this is still software-dependent, and that relational databases are inherently unstable, the plan is to periodically print reports
from the database and store these as flat files in the top levels of the storage system, as a form of manifest.

InfoTrack and md5sums are located on libcontent.lib.ua.edu. The checkscripts database, which documents any errors encountered and also when each script runs (and thus when the MD5 sums are verified) is described in Tracking_automated_scripts.

The InfoTrack database

allColls

This table contains information about each collection sufficient for a PHP script to be able to dynamically deliver an alphabetized list of current online collections and EAD finding aids, descriptions, icons, links to the content (and if available, the finding aid associated, and the link to that online). Most of this information comes from the Collection_Information file created by Digital Services personnel when they begin to digitize a collection, drawing information from the spreadsheets filled out by archivists. This information is uploaded by the collToDbase script described in step 10 of Moving_Content_To_Long-Term_Storage or the moveContent script described in the 3rd section of Most Content. These scripts also add the expected canned link for retrieving content, and specify if the collection is online yet. More information about this table is available here: allColls

inLOCKSS

Once a collection has been released for harvesting into LOCKSS, we must monitor the size and additions to that content, and avoid any changes. As content is communicated over the network, we log here the identifier, manifest number, and date (and for collections which are subcollections of others, such as rare books, the subcollection title and parent identifier).

lookup

This table was created to support persistent identifiers. It's a lookup table to provide redirects. Each item for which we would like to provide this support is assigned a number. Retrieval of that item will be by using a URL of the form: http://purl.lib.ua.edu/3234 where the number following the last forward slash is the number of the item. The actual url is stored in this table (realurl), along with the assigned number (purlnum), an original identifier (id_2009), a datestamp, and a history of any changes (history). This table is accessed by the script redirect.pl which lives in the cgi-bin of libcontent.lib.ua.edu. A URL rewrite and a virtual host configuration, along with the DNS registration of purl.lib.ua.edu, were the only other support necessary for this to work. Whenever a file must be moved, the database is updated, and the persistent URL continues to work. In this fashion, we may enter URLs into metadata, webpages, online catalogs, etc., and never have to change them.

bornDigital

This table tracks born digital content such as Electronic Theses and Dissertations, which may have embargoes on web delivery which must be tracked and supported. Fields here include the identifier (id_2009), first and last name of the primary author, collection number (collnum), datestamp entered, title, the date the content should be made available (dateAvailable, in form yyyy-mm-dd), and exceptions.

archival_formats

This table is designed to capture the type of format an file is in, a URL to information on that format, the quality of the capture, and the version of the format. For example, a TIFF version 6.0 file captured at 600 dpi would have a URL entered from the Unified Digital Format Registry ([[1]] similar to this one now available via the PRONOM Digital Format Registry: [[2]].

By maintaining a record of the current format, version, and quality of each file, in a database, we can easily identify and locate all files in need of migration or emulation in the face of approaching obsolescence.

There is a good chance that this table will be combined with the md5sums database tables.

dates_tagged

Here we track when content was rotated through the tagging software. For more on this see User_tagging

dates_transcription

Same thing for rotation through transcription software. For more on this see User_transcription

userTagged

Here we track what was tagged, so it need not be rotated back through. For more on this see User_tagging

userTranscribed

Same thing for success in transcription collection. For more on this see User_transcription

tags

This is the list of tags, what they were applied to, and how many times a particular tag was used for a particular item.

numItemPages

To find out how relevant it is to have 3 transcriptions for an item, we need to know how many pages that item has. If it has 600 pages, only 3 transcriptions (one per page) is not very successful.
Here we collect info on how many pages items have within u0002 and u0003. For more on this see User_transcription

geocode

Information extracted using Google API is stored here for particular item locations. This enables us to apply the same lat/long for other items with the same location without calling the API again and again and again (there are limits on calls, as well as server overhead.

itemLocation

This works with the geocode table, indicating the item location for each of the items in the database. The locationID here corresponds to the locationID in the geocode table.

Donors

This table is not yet in use, and will likely expand. The purpose is to track which donors have provided support for digitization, processing, or donated content, and be able to link to their website and use a logo for the display of content.

md5sums

itemSums
This table contains the identifier, file name, file path, current MD5 checksum, byte size, number of times modified, date of first entry, and a notes field.

modified
If a file was indeed modified (in which case that would have been indicated in the itemSums table), then there will be an entry here for each modification. The timestamp of the change, the identifier, filename, previous MD5 checksum, the byte size of the original file, the reason modified, and by whom are recorded here.