Case Study: Data Mining with Criminal Intent

Project definition: text mining is the derivation of
structured, meaningful data from a large body of unstructured
data, using automated, analytical methods.

Some writers, such as Hearst (2003) http://people.ischool.berkeley.edu/~hearst/text-mining.html,
distinguish text mining from data mining: the latter referring to
structured data (such as databases) and the former to
unstructured text files. The project team at With Criminal
Intent does not seem to observe such a distinction. As there
is no clear consensus on this matter of terminology we will not
pursue it further.

Text mining involves a variety of techniques for analysing data.
Once a corpus has been assembled (such assembly may in itself be
considered part of text mining) the texts are normally prepared:
this may involve tokenisation (to divide the text into required
elements, such as words: for example determining if
book-end is one token, or two – book and
end) or stemming (treating as one entity different forms
of a verb, such as fight, fought, fighting,
or other part of speech).

Statistical techniques are used to extract grammatical
information (for example, by automatic parsing to determine parts
of speech) or semantic information (such as sentiment analysis,
or spam filtering). Text mining can also discover associations
between texts, or the entities within texts such as people and
places.

The UK’s National Centre for Text Mining, NaCTem, based at the
University of Manchester, is a good resource for information
about text mining and also provides text mining tools freely to
UK Higher Education institutions: http://www.nactem.ac.uk/.

Project
description

Data Mining With Criminal Intent is a project to produce a
research environment for the Old Bailey Online (http://www.oldbaileyonline.org/),
an online edition of the texts of criminal trials held at the
central London court between 1674 and 1913. Since the Old Bailey
Online contains records of almost 200,000 trials (amounting to
127 million words) it is a good candidate for automated
procedures to interrogate the corpus of texts it contains and to
obtain more nuanced results than would be possible simply from
using the available search facilities.

Data Mining With Criminal Intent is an international
project where collaborators from different institutions have
worked on different aspects of the project. The project was
funded by JISC, NEH and the SSHRC Programme – higher education
funding bodies for the US, UK and Canada, respectively – and
involved team members from each of those countries.

The declared aims of the research environment are that it will
allow a user to:

Query the records of the Old Bailey Online.

Save the results of the queries to a Zotero account.

Send selected results and texts to Voyant Tools for analysis.

A further key theme of the project is that it should allow what
is referred to as the ‘ordinary working historian’ to incorporate
text mining into their work. The project considers that, if
digital research tools are to transform the humanities, it is
essential that the ordinary working historian is empowered to use
them.

The project did not develop any new tools, but worked on existing
ones, with the focus on making them work seamlessly together.

Use of
tool

The heart of the text mining part of the project is the API that
was developed to increase the flexibility of queries and allow
for subsequent processing to be applied to the results. The OBAPI
(http://www.oldbaileyonline.org/obapi/)
is available as a demonstrator webpage, where users can
construct advanced searches; the advantage over a conventional
advanced search is that the demonstrator facilitates export of
data to Zotero and Voyant.

The Web API can be used the conventional way as a URL argument,
returning data in the JSON (Javascript Object Notation) format.
For example (using a slightly modified example given at
http://www.oldbaileyonline.org/static/DocAPI.jsp),
this URL:

returns the results 4-13 of trial texts containing the word
‘Sheffield’ and the offence category ‘deception’, as well as any
subcategories of offence. The URL arguments begin after the
? and are separated by &, so we can break this
URL query down as follows:

part of URL

meaning

notes

term0=trialtext_sheffield

the first term, ‘Sheffield’, should appear anywhere in a
trial text

term indexing begins at zero, so the first term is term 0

term1=offcat

the second term should appear in the offence category
field

clearly it is not possible to guess the naming
conventions of these categories; they have to be derived
from the API documentation

breakdown=offsubcat

include offence subcategories

count=10

give 10 results

the default is to return all results if this argument is
omitted

start=3

start from the fourth result

defaults to 0 if this argument is omitted

The results from this query look like this (although the use of
JSON format may sound off-putting, the results are actually very
easy to read):

{ "total" : 84,

"breakdown" :

[

{ "term" : "fraud", "total" : 44 },

{ "term" : "forgery", "total" : 31 },

{ "term" : "bankrupcy", "total" : 7 },

{ "term" : "perjury", "total" : 4 },

{ "term" : "receiving", "total" : 2 },

{ "term" : "simpleLarceny", "total" : 1 },

{ "term" : "stealingFromMaster", "total" : 1 }

]

,

"highlight" :

[

"sheffield"

]

,

"hits" :

[

"t18110529-18",

"t18171029-84",

"t18300527-110",

"t18310106-4",

"t18381022-2382",

"t18390617-1733",

"t18440101-380",

"t18470614-1435",

"t18510407-820",

"t18511027-1888"

]

}

Results can then be exported to Zotero (http://www.zotero.org/), a free
reference manager that is used in conjunction with the Firefox
browser. Zotero can be used for collecting citations from web
pages (it detects when page content is Zotero compatible), as
well as storing many other kinds of information. A Zotero account
allows users to sync their Zotero data across multiple devices or
machines.

The final stage of the pipeline is that users can analyse their
results using Voyant Tools: http://voyant-tools.org/. Here
users can upload text in a variety of formats, or point the tool
at URLs, and perform a variety of types of simple
linguistic analysis, such as word counts, frequency, and relative
frequency across documents. Stop words can be specified and the
results can be compared to another corpus or exported in a
variety of formats.

Further
possibilities

There seem to be two potential developments of Data Mining With
Criminal Intent. The first would be to integrate more tools,
allowing users to, for example, use more advanced natural
language processing techniques on the data (which at present
Voyant Tools cannot do). The limitation with this approach is
that the project’s aim of serving the ordinary working historian
would be compromised if techniques were introduced that require a
steep learning curve.

A second possibility is the more general one that other projects
could adopt this pipeline-like approach. Only a minority of
historians will want to access the Old Bailey Online for research
purposes, but in principle it does not seem too difficult to
apply the approach to other large datasets, now that this project
has provided a proof of concept. However a limitation to this may
be that the Old Bailey Online is somewhat unusual as a historical
dataset, being both extremely regularly structured and existing
in a high-quality transcription.

Conclusion

The project’s aim of allowing the ordinary working historian to
use digital research techniques with a minimum of effort is a
laudable one. It also stands out in a field in which digital
projects are often led by enthusiastic power users who tend to
lose sight of the technical demands that the project outcomes may
make on interested academics. A side benefit of the project is
that it may convince some researchers that they can use digital
tools effectively in their work, and perhaps even enthuse them
sufficiently to learn more digital research skills.