NAME

Catmandu::Introduction - An introduction to Catmandu

INTRODUCTION

Importing, transforming, storing and indexing data should be easy

Catmandu provides a suite of Perl modules to ease the import, storage, retrieval, export and transformation of metadata records. Combine Catmandu modules with web application frameworks such as PSGI/Plack, document stores such as MongoDB and full text indexes as Solr to create a rapid development environment for digital library services such as institutional repositories and search engines.

WHERE DO WE USE IT?

In the LibreCat project it is our goal to provide in open source a set of programming components to build up digital libraries services suited to your local needs. Here is an example of the projects we are working on:

Extract, Transform and Load

To create a search engine, one of your first tasks will to import data from various sources, map the fields to a common data model and post it to a full-text search engine. Perl modules such as WebService::Solr or ElasticSearch provide easy access to your favorite document stores, but you keep writing a lot of boilerplate code to create the connections, massaging the incoming data into the correct format, validating and uploading and indexing the data in the database. Next morning you are asked to provide a fast dump of records into an Excel worksheet. After some fixes are applied you are asked to upload it into your database. Again you hit Emacs or Vi and provide an ad-hoc script. In our LibreCat group we saw this workflow over and over. We tried to abstract this problem to a set of Perl tools which can work with library data such as MARC, Dublin Core, EndNote protocols such as OAI-PMH, SRU and repositories such as DSpace and Fedora. In data warehouses these processes are called ETL, Extract, Transform, Load. Many tools currenty exist for ETL processing but none adress typical library data models and services.

Copy and Paste

As programmers, we would like to reuse our code and algorithms as easy as possible. In fast application development you typically want to copy and paste parts of existing code in a new project. In Catmandu we use a functional style of programming to keep our code tight and clean and suitable for copy and pasting. When working with library data models we use native Perl hashes and arrays to pass data around. In this way adhere to the rationale of Alan J. Perlis: "It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures." Our functions are all based on a few primary data structures on which we define many functions (map, count, each, first, take, ...)

Schemaless databases

Working with native Perl hashes and arrays we would like to use an easy mechanism to store and index this data in a database of choice. In the past it was a nuisance to create database schemas and indexes to store and search your data. Certainly in institutional repositories this can be a ongoing job for a programmer because the metadata schemas are not fixed in time. Any new report will require you to add new data fields and new relations for which you need to change your database schema. With the introduction of schemaless databases the storage of complex records is really easy. Create a Perl hash excute the function add and your record is stored into the database. Execute get to load a Perl hash from the database in memory. With our ElasticSearch plugin we even can provide you a CQL style query language for retrieval.

Importer

Importers are Catmandu packages to read data into an application. We provide importers for MARC, JSON, YAML, CSV, Excel but also Atom and OAI-PMH endpoints. As an example, lets create a Perl script to read a YAML file containing an array of values. We use the each function to loop through all the items

Iterable

The Iterable package provides many list methods to process large streams of records. Most of the methods are lazy if the underlying datastream supports it. While all of the data in Catmandu are native Perl hashes and arrays it can be impratical to load a result set of thousands of records into memory. Most Catmandu packages such as Importer, Exporter and Store provide therefor an Iterable implementation.

Using a Mock importer we can generate some Perl hashes on-the-fly and show the functionality provided by Iterable:

use Catmandu::Importer::Mock;
my $it = Catmandu::Importer::Mock->new(size => 10);
With each you can loop over all the items in an iterator:
$it->each(sub {
printf "My n is %d\n" , shift->{n};
});

Using any, many, all you can test for the existence of items in an Iterator:

Fix

Fixes can be used for easy data manipulation by non programmers. Using a small Perl DSL language librarians can use Fix routines to manipulate data objects. A plain text file of fixes can be created to specify all the data manipulations that need to be executed to 'massage' the data in the desired format.

As an example we will import data from a MARC file and change some metadata fields using Fix routines. Here is the code to run the example:

We need two files as input: marc.txt is a file containing MARC records and marc.fix contains the fixes that need to be applied to each MARC record. Lets take a look at the contents of this marc.fix file:

The fixes in this file are specialized in MARC processing. In line 1 we map the contents of the MARC-100 field into a deeply neested Perl hash with key 'authors'. In line 3 we map the contents of the MARC-600 x-subfield into the 'subjects' field. In Line 4 we read characters 35 to 37 from the MARC-008 control field into the 'language' key.

Store

As explained in the introduction, one of the rationales for creating Catmandu is to ease the serialization of records in our database of choice. The introduction of schemaless databases made the storage of complex records quite easy. Before we delve into this type of database we need to show you what syntax Catmandu is using to store data.

As example lets create the most simple storage mechanism possible, an in memory hash. We use this mock 'database' to show some of the features that any Catmandu::Store has. First we will create a YAML importer as shown above to import records into an in memory hash store:

Each Catmandu::Store has one or more compartments (e.g. tables) to store data called bag. We use the function add_many to store each item in the importer Iterable into the Store. We can also store an array of Perl hashes with the same command. Or store a single hash with the add method.

Each bag is an Iterator so you can apply any of the each, any, all,... methods shown above to read data from a bag.

$store->bag->take(3)->each(sub {
my $obj = shift;
#.. your code
});

When you store a perl Hash into a Catmandu::Store then an identifier field '_id' gets added to your perl Hash that can be used to retrieve the item at a later stage. Lets take a look at the identifier and how it can be used.

# First store a perl hash and return the stored item which includes the stored identifier
my $item = $store->bag->add( { name => 'Patrick' });
# This will show you an UUID like '414003DC-9AD0-11E1-A3AD-D6BEE5345D14'...
print $item->{_id} , "\n";
# Now you can use this identifier to retrieve the object from the store
my $item2 = $store->bag->get('414003DC-9AD0-11E1-A3AD-D6BEE5345D14');

And that is how it works. Catmandu::Store has some more functionality to delete items and query the store (if the backend supports it), but this is how you can store very complex Perl structures in memory or on disk with just a few lines of code. As a complete example we can show how easy it is to store data in a fulltext search engine like ElasticSearch.

In this example we will download ElasticSearch version 0.19.3 and install it on our system:

This last example will print 'Albert Einstein' as result. Clinton Gormley did some great work in providing a Perl client for ElasticSearch. Searching complex objects can be done by using a dot syntax e.g. record.titles.0.subtitle:"My Funny Valentine". The beauty of ElasticSearch is that it is completely painless to setup and requires no schema: indexing data is simply done by using JSON over HTTP. All your fields are indexed automatically.

Module Install Instructions

To install Catmandu::Introduction, simply copy and paste either of the commands in to your terminal

As a valued partner and proud supporter of MetaCPAN, StickerYou is
happy to offer a 10% discount on all Custom Stickers,
Business Labels, Roll Labels,
Vinyl Lettering or Custom Decals. StickerYou.com
is your one-stop shop to make your business stick.
Use code METACPAN10 at checkout to apply your discount.