The objective of collecting these thoughts is to organize a rationale for how to use the various available tools on a collaborative scientific project which is expected to generate information products of lasting value. In the near term, a specific project will be driving the development of this solution. However, it is hoped that the salient features of both the problem and the solution can be distilled into a more generalized understanding which will aid projects of this type.

For the purposes of this effort, modern information systems are defined as being comprised of both human and machine. Humans are involved at the start, and are ultimately the audience for any information. It is ultimately the human's job to understand the information asset in order to effectively discharge their roles as producers and consumers of information. Between the producers and consumers are the custodians, which are often machines.

In the above figure, the rectangles represent individuals involved with the understanding of the material. The circle represents the storage system, which is capable of storing, preserving, indexing and retrieving the material without attempting to understand it.

Information management, in the context of this producer-custodian-consumer model, involves defining the responsibilities and expectations of each role as well as how the roles interact.

all nodes which are the same distance from the root are the same "type"

Two methods to implement hierarchical classification:

tree of nodes

common set of properties

Differences in searching/forming groups:

locating subtrees

searching for a value (or combination of values) in the correct property(ies)

Typically in a hierarchical classification tree, each level (in "distance from the root of the system") has a specific meaning, and would always map to the same property. See the following.

If the above were:

implemented as a "tree",

a set of categories would be created to reflect the accepted "Domains", "Kingdoms", "Phylums"

Each category/subgroup explicitly claims membership in the appropriate superior node.

Each resource explicitly claims membership in the appropriate category in the tree.

Resources which participate in this classification scheme are identified because they claim membership in something which can trace it's lineage back to the root node of the scheme...in this case, "life".

implemented as a set of properties,

resources which participate in the classification scheme are identified by the presence of properties without regard for their value. The presence of subordinate properties is optional, but if one of the properties is set, all superior properties must also be set.

resources are located within the classification scheme by the values stored in the relevant properties.

the hierarchy can be derived by selecting all participants in the scheme and analyzing the property values.

Consistency is expected from members of the same collection. There are three levels of consistency:

the manner or system in which properties get their values and/or membership is asserted

the presence of properties

the meaning of property values

Automated extraction of information related to characteristic properties is a requirement for large holdings of information resources, and this requires a consistent, deterministic expression. Relying on a human reader to interpret information on a case by case basis is simple to set up but requires a high level of tedious and error-prone effort to maintain. Additionally, the properties which are only accessible to a human reader are not available for searching and filtering.

The criteria for collection membership must be consistently applied to all members of the collection. If criteria are to change or the collections are to evolve, the collection system itself should be versioned, and each member must be updated such that it is consistent with the new definition.

In the first case, a whitespace separated list is used and in the second case, HTML table markup is used. Other inconsistencies include the designation of the name of the property to which values are being assigned: sometimes a boldface string is used, sometimes a third-level heading is used. The boldfaced string is especially problematic for automatic processing because it is not guaranteed that all bold text signifies a property name. It is only that the bold text is on its own line with a colon after it which causes the human reader to infer that a list of states follows. Likewise, the human will immediately recognize that the states form a "list" and the table markup is merely a device to make the list more compact. There is no significance to states being in the same row or the same column.

In the situation where a human consumer is intended to interpret and process the information, authors are free to use any expression which effectively communicates to the human. When the intent is to supply information to a computer, authors must effectively communicate meaning to the computer. Once the computer understands the content, it can present the information to the human in a variety of ways.

Ideally, there is a one-to-one relationship between a concept and the property value which represents that concept. This ensures that the Producers who supply property values are following the same rules used by the Consumers who construct searches. Falling short of this ideal leads to the situation where producers could tag an article with the string "CO2", and Consumers search for "carbon dioxide". The Custodian will not return the desired information resources because Producers and Consumers used alternate means of expressing the same thing.

To solve this problem in the field of chemistry, the International Union of Pure and Applied Chemistry (IUPAC) has made an attempt to standardize the means of referring to chemical substances and compounds. This standard method is called the International Chemical Identifier (InChI). The fixed length equivalent of this identifier, InChIKey, looks much like a random string of characters. Both identifiers may be automatically generated from a description of the molecular structure itself. In addition, there is a human readable standard name. One public database of many chemical compounds and substances is Pubchem.

For the purposes of having the Producers and Consumers follow the same rules, it is clear that an algorithm which deterministically generates a text representation of a chemical given the structural information is preferable to a human readable name. Human nomenclature tends to evolve over time, and alternate names are often widely recognized by practitioners. Successfully searching by a chemical name may depend on knowing the person who entered the information into the collection of holdings, or knowing when it was entered.

Identification of chemical substances to humans and the computer may require "related properties." These properties should be synchronized to all refer to the same item or concept. In this example:

For the case where the number of allowable property values is manageably small, it is possible to enumerate the identifiers. Enumeration effectively controls the representation of a unique concept to the computer, but it does not communicate the concept itself to the human. For instance, setting the "Instrument" property to the value "1" effectively distinguishes instrument number one from all the other instruments. However, it does nothing to communicate anything about that instrument: what does it measure; what is the make, model, serial number of the instrument, when was its most recent calibration?

For each property, it is important to define the list of permissible meanings and the list of permissible representations for those meanings. The representation must be machine readable and distinct from all other representations of the same type. The meaning may be automatically comprehensible (as in a table) or written for human consumption (as in prose).

There are times when it is insufficient to simply define an enumeration of values. At times it is necessary to provide a more complete description about what the enumeration refers to. This strategy collects the descriptive information into one place and allows other managed objects simply to refer to it. When the reference is used as the value of a property, specific meaning can be conveyed: i.e., an "instruments used" property could refer to another object within the system; an "analytes" property could contain a link to a substance description on pubchem. In both cases, the target of the reference explains the meaning of the property value.

When it is necessary to describe specific instances, it is probable that a common "type" has already been defined. Similar if not identical information will be provided about each instance of the type. In the case of an "Instrument", a good, common set of properties might be: "Manufacturer", "Model", "Serial Number", "Last calibration date". It could also be that each instrument has additional/unique information which can be filled out, or which is not machine-interpretable.

There are two options for defining instances of a common type:

If the type is completely machine-comprehensible, the instances could be defined in a single table.

For the case where a tabular structure does not have the required flexibility, each instance must be described separately.

An automated custodian becomes more capable when it understands more about the content it is managing. The degree of comprehension can be visualized as a spectrum from "black box" to "completely described"

Black box: the system knows nothing about the contents of the item. The item can be stored and retrieved, but nothing else can be done. This is analogous to throwing all files in a single directory with cryptic names. The only method of searching for content is to perform a full text search for likely keywords.

Classified black box: the system knows very little about the contents of the item. Predefined groups of items are explicitly constructed. This is analogous to devising a directory structure or filing scheme, into which all items are filed. Full text searches may be combined with group. It is not possible for the system to form new groups which have an unanticipated organizing principle.

Tagged items: in addition to a predefined filing scheme, a free-text keywording system associates descriptive phrases with information resources.

Property-value associations: a generalization of "tagging", each item can be associated with a set of properties; each of which has it's own set of values. A property imposes an interpretation on it's value. The text of the "Author" property is expected to be the name of an author. Likewise, properties may have a "type": like a number or a date. Properties help to make searches and filters much more precise and narrow down the results. Properties serve as a summary of the salient characteristics of the managed item. A well-designed set of properties makes it possible to quickly locate information resources which are relevant in a context not anticipated by the Producers.

References to/associations of managed items: the system provides a way for one item to refer to another in a semantically meaningful way. The reference may be bidirectional or unidirectional, and may have arbitrary multiplicities on each end. Simple links are not considered to fall in this category. Named links, where the name refers to a specific relationship such as "related resources", "children", or "instruments used", are what is described here.

Types/interfaces/aspects: collections of property-value associations occur in combination to fully describe an aggregate concept. Information resources would be considered to implement the "interface" or possess the "aspect" if they had values for all required properties belonging to the interface or aspect definition. For instance, a temperature range would specify both a minimum and a maximum temperature; and an information resource would be considered to have the "temperature range" interface if it had both the required minimum and maximum temperature properties. Types, interfaces, and aspects could all be used to limit the legitimate targets of references and associations.

Database/Datastore: the information resource is completely described in a machine-comprehensible framework. Typically, much work has been invested in defining a common data structure and importing data into that structure. Searches and filters can be precise to the point of selecting a portion of data from a particular source. The information itself is represented in a tabular, XML, object oriented, or similar machine-comprehensible model.

Three main theorems present themselves:

The benefit of an information management system is directly proportional to the level of automated comprehension.

The level of automated comprehension is directly proportional to the amount of effort invested in furnishing information to the machine.

Benefit is largely realized by Consumers, but effort is largely expended by Producers. Custodians need only supply an appropriate tool.

Accessibility, in this case, is defined in terms of efficiently retrieving and operating on the data. In most cases, processes which are human-bound (e.g., spend the majority of the time waiting on human input, like word processing) will perform acceptably well with any form of online storage. Processes which are IO bound benefit primarily from increasing the bandwidth which can be sustained between the storage device and the processing unit.