2.
Welcome
During the next few hours, we will create an small aggregated index
from scratch.
You can code along if you like. Code, data and slides are distributed
in a VM (on a USB stick).

3.
Why
At Leipzig University Library we built a version that serves as a
successor to a commercial product.
Index includes data from Crossref, DOAJ, JSTOR, Elsevier, Genios,
Thieme, DeGruyter among others.
About 55% of our holdings covered. Potentially growable in breadth
and depth.

4.
Format
We will use a combination of
slides to motivate concepts and
live coding and experimentation
–
We will not use a product, we will build it.
Goals
a running VuFind 3 with a small aggregated index
learn about a batch processing framework

7.
Import Appliance
On the USB-Stick you can ﬁnd an OVA ﬁle that you can import
into Virtualbox (or try to download it from https://goo.gl/J7hcYC).
This VM contains:
a VuFind 3 installation – /usr/local/vufind
raw metadata (around 3M records) – ~/Bootcamp/input
scripts and stubs for processing – ~/Bootcamp/code
these slides – ~/Bootcamp/slides.pdf

13.
Intro: Immutability
immutability = data is not modiﬁed, after it is created
immutable data has some advantages, e.g.
“human fault tolerance”
performance
our use case: recompute everything from raw data
tradeoﬀ: more computation, but less to think about

14.
Intro: Frameworks
many libraries and frameworks for batch processing and
scheduling, e.g. Oozie, Askaban, Airﬂow, luigi, . . .
even more tools, when working with stream, Kafka, various
queues, . . .
luigi is nice, because it has only a few prerequisities

18.
Intro: Incremental Development
when we work with unknown data sources, we have to
gradually move forward

19.
Intro: Wrap up
many approaches to data processing
we will focus on one library only here
concepts like immutability, recomputation and incremental
development are more general
–
now back to the code

22.
Setup wrap-up
You can now edit Python ﬁles on your guest (or host) and run them
inside the VM. You can start and stop VuFind inside the VM and
access it through a browser on your host.
We are all set ot start exploring the data and to write some code.

23.
Bootcamp outline
parts 0 to 6: intro, crossref, doaj, combination, licensing, export
each part is self contained, although we will reuse some
artifacts

24.
Bootcamp outline
you can use scaﬀoldP_. . . , if you want to code along
the partP_. . . ﬁles contain the target code
code/part{0-6}_....py
code/scaffold{0-6}_....py

32.
Coding: Part 2 Recap
used command line tools (fast, simple interface)
chained three tasks together

33.
Excursion: Normalization
suggested and designed by system librarian
internal name: intermediate schema –
https://github.com/ubleipzig/intermediateschema
enough ﬁelds to accomodate various inputs
can be extended carefully, if necessary
tooling (licensing, export, quality checks) only for a single
format

36.
Coding: Part 3
This source is not batched and comes in a single ﬁle, so it is a bit
simpler:
locate ﬁle
convert to intermediate schema
$ python part3_require.py

37.
Coding: Part 3 Recap
it is easy to start with static data
business logic in python, can reuse any existing python library

38.
Coding: Part 4
after normalization, we can merge the two data sources
$ python part4_combine.py

39.
Coding: Part 4 Recap
a list of dependencies
python helps with modularization
using the shell for performance and to reuse existing tools

40.
Coding: Part 5
licensing turned out to be an important issue
a complex topic
we need to look at every record, so it is performance critical
we use AMSL for ERM, and are on the way to a self-service
interface
AMSL has great APIs
we convert collection information to an expression-tree-ish
format – https://is.gd/Fxx0IU, https://is.gd/ZTqLqB

42.
Coding: Part 5
boolean expression trees allow us to specify complex licensing
rules
the result is a ﬁle, where each record is annotated with an ISIL
at Leipzig University Library we currently do this for about 20
ISILs