Description

Cheshire3 is a fast XML search engine, written in Python for
extensability and using C libraries for speed. Cheshire3 is feature
rich, including support for XML namespaces, unicode, a distributable
object oriented model and all the features expected of a digital library
system.

Standards are foremost, including SRU and CQL, as well as Z39.50 and
OAI. It is highly modular and configurable, enabling very specific needs
to be addressed with a minimum of effort. The API is stable and fully
documented, allowing easy third party development of components.

Given a set of documents records, Cheshire3 can extract data into one or
more indexes after processing with configurable workflows to add extra
normalization and processing. Once the indexes have been constructed, it
supports such operations as search, retrieve, browse and sort.

The abstract protocolHandler allows integration of Cheshire3 into any
environment that will support Python. For example using Apache handlers
or WSGI applications, any interface from standard APIs like SRU, Z39.50
and OAI (all included by default in the cheshire3.web sub-package), to
an online shop front can be provided.

Previously, source code was available from our own Subversion server. The SVN
repository is being kept alive for the time being as read-only, and best
efforts will be made to keep it up-to-date with the master (i.e.
stable/production) branch from the Cheshire3 Git repository. It is available
at:

While step 4 should theoretically resolve dependencies, we’ve found it
more reliable to run this explicitly.

Requirements / Dependencies

Cheshire3 requires Python 2.6.0 or later. It has not yet been verified
as Python 3 compliant.

As of the version 1.0 release Cheshire3’s python dependencies should be
resolved automatically by the standard Python package management
mechanisms (e.g. pip, easy_install, distribute/setuptools).

However on some systems, for example if installing on a machine without
network access, it may be necessary to manually install some 3rd party
dependencies. In such cases we would encourage you to download the
necessary Cheshire3 bundles from the Cheshire3 download site and install
them using the automated build scripts included. If the automated scripts
fail on your system, they should at least provide hints on how to resolve
the situation.

If you experience problems with dependencies, please get in touch via
the GitHub issue tracker or wiki, and we’ll do our best to help.

Additional / Optional Features

Certain features within the Cheshire3 Information Framework will have
additional dependencies (e.g. web APIs will require a web application
server). We’ll try to maintain an accurate list of these in the module
docstring of the __init__.py file in each sub-package.

The bundles available from the Cheshire3 download site should
continue to be a useful place to get hold of the source code for these
pre-requisites.

Documentation

If you downloaded the source code, either as a tarball, or by checking
out the repository, you’ll find a copy of the Sphinx based Documentation in
the local docs directory.

There is additional documentation for the source code in the form of
comments and docstrings. Documentation for most default object
configurations can be found within the <docs> tag in the config XML
for each object. We would encourage users to take advantage of this tag
to provide documentation for their own custom object configurations.

Development

This section is intended for those who are intending to develop code to
contribute back to Cheshire3.

Fix bugs in the develop branch, or develop new features in your own
feature branch and merge back into the develop branch.)

Push your changes back to you github fork

Issue a pull request

Developed code intended to be contributed back to Cheshire3 should
follow the recommendations made by the standard Style Guide for Python
Code (which includes the provision that guidelines may be ignored in
situations where following them would make the code less readable.)

Particular attention should be paid to documentation and source code
annotation (comments). All developed modules, functions, classes, and
methods should be documented in the source code. Newly configured
objects at the server level should be documented using the <docs>
tag. Comments and Documentation should be accurate and up-to-date, and
should never contradict the code itself.

Licensing

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:

Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.

Neither the name of the University of Liverpool nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS
IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

MARC Utilities

The following licensing conditions apply to the marc_utils module
included in the Cheshire3 package. In the following statements, “This
file” and “the Software” should be understood to mean marc_utils.py.

Permission is hereby granted, free of charge, to any person
obtaining a copy of this software and associated documentation files
(the “Software”), to deal in the Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, provided
that the above copyright notice(s) and this permission notice appear
in all copies of the Software and that both the above copyright
notice(s) and this permission notice appear in supporting
documentation.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE
COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR
ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY
DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,
WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE
OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder
shall not be used in advertising or otherwise to promote the sale,
use or other dealings in this Software without prior written
authorization of the copyright holder.

Examples

Command-line UI

Cheshire3 provides a number of command-line utilities to enable you to
get started creating databases, indexing and searching your data quickly.
All of these commands have full help available, including lists
of available options which can be accessed using the --help option.
e.g.:

``cheshire3 --help``

Creating a new Database

cheshire3-init[database-directory]

Initialize a database with some generic configurations in the given
directory, or current directory if absent

Example 1: create database in a new sub-directory:

$ cheshire3-init mydb

Example 2: create database in an existing directory:

$ mkdir -p ~/dbs/mydb
$ cheshire3-init ~/dbs/mydb

Example 3: create database in current working directory:

$ mkdir -p ~/dbs/mydb
$ cd ~/dbs/mydb
$ cheshire3-init

Example 4: create database with descriptive information in a new
sub-directory:

Python API

This section contains examples of using the Cheshire3 API from within
Python, for embedding Cheshire3 services within a Python enabled web
application framework, such as Django, CherryPy, mod_wsgi etc. or when
the command-line interface is simply insufficient.

Initializing Cheshire3 Architecture

Initializing the Cheshire3 Architecture consists primarily of creating
instances of the following types within the Cheshire3 Object Model:

Session

An object representing the user session. It will be passed around amongst
the processing objects to maintain details of the current environment.
It stores, for example, user and identifier for the database currently in
use.

Server

A protocol neutral collection of databases, users and their dependent
objects. It acts as an inital entry point for all requests and handles
such things as user authentication, and global object configuration.

The first thing that we need to do is create a Session and build a Server.:

>>> from cheshire3.baseObjects import Session
>>> session = Session()

The Server looks after all of our objects, databases, indexes …
everything. Its constructor takes session and one argument, the filename
of the top level configuration file. You could supply your own, or you can
find the filename of the default server configuration dynamically as
follows::

A virtual collection of Records which may be interacted with. A Database
includes Indexes, which contain data extracted from the Records as well
as configuration details. The Database is responsible for handling
queries which come to it, distributing the query amongst its component
Indexes and returning a ResultSet. The Database is also responsible for
maintaining summary metadata (e.g. number of items, total word count etc.)
that may be need for relevance ranking etc.

Using the cheshire3 command

One way to ensure that Cheshire3 architecture is initialized is to use the
Cheshire3 interpreter, which wraps the main Python interpreter, to run your
script or just drop you into the interactive console.

cheshire3 [script]

Run the commands in the script inside the current cheshire3
environment. If script is not provided it will drop you into an interactive
console (very similar the the native Python interpreter.) You can also tell
it to drop into interactive mode after executing your script using the
--interactive option.

When initializing the architecture in this way, session and server
variables will be created corresponding to instances of Session and Server
respectively.

Additionally, if you ran the script from inside a Cheshire3 Database
directory, or provided the Database identifier using the --database option,
the Database will be available as db. The default RecordStore will also be
available as recordStore if it was possible to discover from the Database.

Loading Data

In order to load data into your database you’ll need a document factory
to find your documents, a parser to parse the XML and a record store to
put the parsed XML into. The most commonly used are
defaultDocumentFactory and LxmlParser. Each database needs its own
record store.:

this could be a filename, a directory name, the data as a string, a URL to
the data and so forth.

If data ends in [(numA):(numB)], and the preceding string is a filename,
then the data will be extracted from bytes numA through to numB (this is
pretty advanced though - you’ll probably never need it!)

cache

setting for how to cache documents in memory when reading them in.
This will depend greatly on use case. e.g. if loading 3Gb of documents on a
machine with 2Gb memory, full caching will obviously not work very well. On
the other hand, if loading a reasonably small quantity of data over HTTP,
full caching would read all of the data in one shot, closing the HTTP
connection and avoiding potential timeouts. Possible values:

0

no document caching. Just locate the data and get ready to discover
and yield documents when they’re requested from the documentFactory.
This is probably the option you’re most likely to want.

1

Cache location of documents within the data stream by byte offset.

2

Cache full documents.

format

The format of the data parameter. Many options, the most common are:

xml:

xml file. Can have multiple records in single file.

dir:

a directory containing files to load

tar:

a tar file containing files to load

zip:

a zip file containing files to load

marc:

a file with MARC records (library catalogue data)

http:

a base HTTP URL to retrieve

tagName

the name of the tag which starts (and ends!) a record. This is useful for
extracting sections of documents and ignoring the rest of the XML in the
file.

codec

the name of the codec in which the data is encoded. Normally ‘ascii’ or
‘utf-8’

You’ll note above that the call to load returns itself. This is because
the document factory acts as an iterator. The easiest way to get to your
documents is to loop through the document factory::

Store the record in the recordStore. This assigns an identifier to it, by
default a sequential integer.

Add the record to the database. This stores database level metadata such
as how many words in total, how many records, average number of words per
record, average number of bytes per record and so forth.

Index the record against all indexes known to the database - typically all
indexes in the indexStore in the database’s ‘indexStore’ path setting.

Pre-Processing (PreParsing)

As often than not, documents will require some sort of pre-processing
step in order to ensure that they’re valid XML in the schema that you
want them in. To do this, there are PreParser objects which take a
document and transform it into another document.

The simplest preParser takes raw text, escapes the entities and wraps it
in a element::

>>> from cheshire3.document import StringDocument
>>> doc = StringDocument("This is some raw text with an & and a < and a >.")
>>> pp = db.get_object(session, 'TxtToXmlPreParser')
>>> doc2 = pp.process_document(session, doc)
>>> doc2.get_raw(session)
'<data>This is some raw text with an &amp; and a &lt; and a &gt;.</data>'

Searching

In order to allow for translation between query languages (if possible)
we have a query factory, which defaults to CQL (SRU’s query language,
and our internal language).:

This transformer uses XSLT, which is common, but other transformers are
equally possible.

Indexes

While Searching is the primary use of an Index, there are other API methods
that can be used to get information from an Index in slightly different forms
that can be useful when developing a user interface. This section describes
those API methods and then shows how to really get your hands dirty by
Looking Under the Hood and getting direct access to some of the object types
that are used to process data within an Index.

Browsing

It is possible to browse through all terms in an index, just like reading the
index in a book. This is usualy done through scan method of a Database
object, so as to make use of the normal Index resolution machinery:

terms will be a list of no more than 25 items representing the terms
from the start of the Index that was resolved from the context dc.title
(by convention the Dublin-Core definition of “title”; the title of a piece of
work.) Each item in terms is a 2-item list:

The unicode representation of the term

A 3-item list:
0. internal numeric term id
1. number of records the term appears in
2. total number of occurrences of the term across the database

e.g.:

[u"zen and the art of motorcycle maintenance", [12345, 2, 3]]

It is also possible to use the scan method of an Index object directly:

The resulting terms will be the same as when obtained through the scan
method of the Database object.

Facets and Filtering

Assuming that you have configured your Index with the setting vectors set to
1, it is possible to obtain search facets for the Index. That is to say that
given a ResultSet obtained from a Searching, one can obtain a list of the terms
that occur within the Records in that ResultSet. This list can be used to
present a search user with options for refining their search.:

The resulting facets will be a list representing the 5 terms that occur in
the highest number of Records within the ResultSet. Setting nTerms to 0
(or omitting it) will return all terms within the Index for the Records within
the ResultSet. Each item in terms is a 2-item list:

The unicode representation of the term

A 3-item list:
0. internal numeric term id
1. number of records the term appears in
2. total number of occurrences of the term across the database

e.g.:

[u"Crichton, Michael", [54321, 3, 24]]

Looking Under the Hood

Configuring Indexes, and the processing required to populate them
requires some further object types, such as Selectors, Extractors,
Tokenizers and TokenMergers. Of course, one would normally configure
these for each index in the database and the code in the examples below
would normally be executed automatically. However it can sometimes be
useful to get at the objects and play around with them manually,
particularly when starting out to find out what they do, or figure out
why things didn’t work as expected, and Cheshire3 makes this possible.

Selector objects are configured with one or more locations from which
data should be selected from the Record. Most commonly (for XML data at
least) these will use XPaths. A selector returns a list of lists, one
for each configured location.:

However we need the text from the matching elements rather than the XML
elements themselves. This is achieved using an Extractor, which
processes the list of lists returned by a Selector and returns a
doctionary a.k.a an associative array or hash::

Although the key at the beginning looks the same, the value is now a
list of tokens from the key, in order. We then have to merge those
tokens together, such that we have ‘the’ as the key, and the value has
the locations of that type.:

This example will have the effect of ‘touching’ each Record, as if it had
been updated. This might be useful if for example, you knew that your Database
was being harvested periodically using OAI-PMH, and you wanted to indicate that
all Records should be reharvested next time.