Type of content

Type of access

The difference between these applications mainly resides in the configuration parameters of the LOCKSS software.

This manual aims at exposing the most important parameters and describing how to concretely set-up a Private LOCKSS network.

Basic Private LOCKSS Network infrastructure

Basic PLN infrastructure

A PLN is a generic framework. For the sake of illustration, the figure represents one specific implementation of a PLN (MetaArchive).

LOCKSS caches: The LOCKSS caches are the nodes in the network in which digital objects will be preserved and monitored as archival units. One archival unit is typically a one-year collection (size: 1GB to 20GB). The size of an AU results of a trade-off between the processing overhead required by large AUs and the guarantee that all AUs integrity can be regularly checked in the case of a multitude of small AUs. The daemon is a java application which collects digital objects through http requests from the original website, store them inside the cache as an Archival Unit, computes an SHA-1 checksum and regularly monitors their integrity by comparing the preserved content with the other caches in the network with a specific voting protocol (LCAP). The AUs are collected at various moments in time by the different nodes in the network to reduce the risk of communication issues. The content is regularly recrawled from the original website if it is still available. If a new version of the AU is available, the previous version is kept but only the most recent AUs will be checked for fixity. A LOCKSS cache can also be configured as a proxy to deliver the preserved content. The LOCKSS cache local configuration can be managed through a web-based user interface. This interface allows local administrators to add or remove AUs from the cache, to view the daemon status and to modify the local cache configuration.

LOCKSS cache web-based user interface

The administrative server: The administrative server is a simple webserver delivering the global configuration (lockss.xml) to the PLN caches composing the network. Local cache settings can however supersede the global configuration. The server generally also includes the plug-in repository and the title databases. In the specific case of MetaArchive, a conspectus tool is managing the creation of title databases. The administration server must be accessible to all PLN nodes to set their global configuration.

Title database: One or more xml files describing the content of the archival units: where to find the corresponding plug-in to collect them and the archival unit base url.

Plugins repository: The repository is typically a versioning server that stores the plugins. A plugin instructs the LOCKSS box of how to crawl (filtering rules) and audit the content - where to find the manifest page - only signed plugins are accepted (based on keystore).

Website Manifest pages: provides permission for LOCKSS caches to crawl and harvest an Archival unit from the webserver

How to set up a PLN admin server properly (what are the important parameters?)

A complete list of parameters (and their description) is available but it would be interesting to have a list of the most important parameters (~100 parameters according to Tom) and a more detailed description for those parameters (typical values, range of acceptable values).

Insert the new cache IP address [your_lockss_box_ip_address] in <property name="id.initialV3PeerList">

How to create a title database?

Placeholder for MetaArchive title db configuration (forthcoming)

How do you get your content into a LOCKSS box?'

This can vary but typically, content is 'staged' on a web server so it can be harvested by the LOCKSS boxes in a PLN. An example of content staged for this purpose from the Simon Fraser University Library's Editorial Cartoon Collection is available.

LOCKSS content crawlers

LOCKSS preserves collections of URLs. An archival unit in LOCKSS terminology is really a set of URLs (UrlSet) with captured response data and HTTP headers. LOCKSS stores URL payload bytes (current) and the HTTP headers (current.props) in discrete files. A vote is composed of a SHA-1 hash of the current file and the access URL (see VoteBlock). This type agnostic method of storing and peer polling content makes LOCKSS suitable for preservation of practically every type of static file accessible by HTTP, which is essentially every kind of static file. Since the poller compares a vote's access URL before the content hash, potential SHA-1 collisions are effectively mitigated. Examine the LOCKSS file system to understand the relationship between crawl URL and the file system organization.

There are two basic types of content crawler in LOCKSS: the new content crawler and the repair crawler. The new content crawler is given a start URL and crawl scope rules (e.g. descend into URL only). The new content crawler then follows links accordingly and discovers new content. The new content crawler, in typical usage, only looks at the publisher's 'content staging area'. Both the new content crawler and the OAI crawler extend FollowLinkCrawler. Modified files captured by the new content crawler are saved as revisions.

The repair crawler is given specific URLs. The repair crawler does not follow links or discover new content. The repair crawler can crawl either publisher's content staging area or request repair data from PLN peers with V3 LCAP messaging (see requestRepairFromPeer in V3Poller.java). Received repairs are not saved as revisions.

Okay, so what?

A PLN is not required to maintain a content staging area with a copy of all content. However, the only method to populate a node is with the new content crawler so a PLN should be aware of the necessity of a content staging area. Any sort data recovery that by-passes the content staging area requires a populated LOCKSS node. Also, LOCKSS stores URL response header and bytes in a file system directory tree analogous to the URL segments of the access URL. So any sort of ephemeral URL scheme to provide initial access may be inappropriate for LOCKSS long-term. Revisions and duplication control in the LOCKSS cache depend on two factors, the file and the URL. A modification of the access URL that does not violate the crawl spec will result in a new file in a new directory tree in the LOCKSS cache. That being said, once the nodes of a PLN have cached the URL and established something resembling a quorum, the content staging area is not a part of the routine polling and voting methods (unless configured to be a or the repair source for the repair crawler).

How to test a plugin (or test other aspects of a PLN)

How to ingest content with the UI?

How to monitor your AUs status?

How to replace, upgrade or insert a new node in a PLN already in production?

When a new cache is inserted in the PLN (due to a new institution contributing to the PLN, a cache repair or a regular 3-year cache disk replacement) and if this new cache shows a valid security certificate, it should directly collect AUs not from the original source (the AU publisher) but from another cache in the PLN configured as a proxy. The idea is that the PLN should survive the institutions, implying that the safe source of information is supposed to be the PLN itself and not the original source which is supposed to be more prone to attacks and less safe than our PLN. This is actually another point of view than "LOCKSS cache can be used as a proxy for the original server at anytime" which assumes that the original source is always the reference and which, in my understanding, should be only used for the GLN and not for PLNs. If I understand well, this behavior should be easy to configure by setting the AU parameter org.lockss.crawler.fetch_from_other_caches_only to true, forcing the cache to collect the AU from other caches.

Need Clarification

Can be deleted

The parameter org.lockss.crawler.fetch_from_other_caches_only is only used in the RepairCrawler. While repair data can be sent via the V3 LCAP messaging protocol (peer to peer, search particularly for requestRepairFromPeer() in V3Poller.java), it appears the requisite enumerable list of URLs needing repair, as required by the repair crawler, is only triggered from V3 Poll vote tallying methods. I don't think a new node with no AU and vacated publisher staging, can participate in a poll. See Repair Crawler for more.

In terms of GLN and access rights, I can see applicability of this logic, else a random network node could populate its own cache without proof of possession or rights to possess the data.

Data transfer options

1. Populate the original content staging area.

The content staging area should maintain the original URL access scheme to provide seamless integration of a new node into an existing network.

The data on a publisher's content staging area is authoritative. However, data on the content staging area is not likely to be monitored for corruption, degradation or accidental change. While corrupted data will likely be repaired in a node after poll (if sufficient nodes agree over quorum and too_close limit is met), modifications that change the file last write will result in all nodes ingesting the content and pushing the original unmodified data out of polling scope. (Optional) input validation in the daemon would alleviate some of these concerns. This is a wish list item for PLNs who utilize the standard RPM release.

2. Modifying AU title list parameters to specify a LOCKSS cache as a crawl proxy.Crawl Proxy

This method proxies an AU start URL through crawl_proxy parameter and enables FollowLinkCrawler to gather data from a peer node.
Title list parameter definition:

This approach establishes a per-AU proxy host and is different than setting the global parameters org.lockss.crawler.proxy.enabled, org.lockss.crawler.proxy.host and org.lockss.crawler.proxy.port in the Expert Config.

There are a few files to expunge from the transfer and also to keep in order to convince the repository manager a new content crawl has been completed. And a new content crawl must be completed before the repository manager engages in polling for an archival unit.

Goes back to node has no knowledge of AU, don't see how this would work.

While this method probably won't work for populating a node with content, coupling the ICP server with a Squid proxy instance to establish a single, unified reference point for title awareness of partitioned cache data across uneven nodes would be an interesting exercise. Partitioned Caches and Title Awareness

Advanced configuration, fine-tuning and more

Securing your PLN

For the web interfaces, IP filtering is turned on be default, and can be
configured through the UI (Admin Access Control, Content Access
Control). Stronger security is available (SSL, form authentication with
inactivity timeouts, password rotation and strength requirements, etc.)
for all but the proxy. The easiest way to enable this is to choose one
of the pre-defined policies by setting org.lockss.accounts.policy to one
of {BASIC, FORM, SSL, LC}. BASIC just enables the user accounting
subsystem and allows you to create additional user accounts; FORM
replaces the basic auth method (browser popup) with an HTML form, SSL
replaces HTTP with HTTPS, and LC enables the strict Library of Congress
security policies (strong passwords, password rotation, session
timeouts, auditing of user-initiated events, etc.) When you enable SSL
you can either give it a server certificate or let it generate one
itself (in which case browsers will produce a scary warning about a
self-signed certificate, which users will have to accept. I recommend
you use the defaults at first and turn on additional security once you
have things working.

LCAP communication is unencrypted by default. If you put the
appropriate keystores in /etc/lockss/keys, the startup scripts will
automatically enable v3overSSL and sslClientAuth, which will cause all
communication to be encrypted and all connections to be verified as
coming from a node with a valid key. It's best to leave this until all
or most of the nodes are up and running as generating and distributing
the key files adds a nuisance factor.