Resources for (Legal) Deep Learning

This post sets out a number of resources to get you started with deep learning, with a focus on natural language processing for legal applications.

A Bit of Background

Deep learning is a bit of a buzz word. Basically, it relates to recent advances in neural networks. In particular, it relates to the number of layers that can be used in these networks. Each layer can be thought of as a mathematical operation. In many cases, it involves a multidimensional extension of drawing a line, y = ax + b, to separate a space into multiple parts.

I find it strange that when I studied machine learning in 2003/4, neural networks had gone out of fashion. The craze then was for support vector machines. Neural networks were seen as a bit of a dead end. While there was nothing wrong theoretically, in practice it wasn’t possible to train a network with more than a couple of layers. This limited their application.

What changed?

Computers and software improved. Memory increased. Researchers realised they could co-opt the graphical processing units of beefy graphics cards of hardcore gamers to perform matrix and vector multiplication. The Internet improved access to large scale data sets and enabled the fast propagation of results. Software tool kits and standard libraries arrived. You could now program in Python for free rather than pay large licence fees for Matlab. Python made it easy to combine functionality from many different areas. Software became good at differentiating and incorporating advanced mathematic optimisation techniques. Google and Facebook poured money into the field. Etc.

This all led to researchers being able to build neural networks with more and more layers that could be trained efficiently. Hence, “deep” means more than two layers and “learning” refers to neural network approaches.

Deep Natural Language Processing

Deep learning has a number of different application areas. One big split is between image processing and natural language processing. The former has seen big success with the use of convolutional neural networks (CNNs), while natural language processing has tended to focus on recurrent neural networks (RNNs), which operate on sequences within time.

Image processing has also typically considered supervised learning problems. These are problems where you have a corpus of labelled data (e.g. ‘ImageX’ – ‘cat’) and you want a neural network to learn the classifications.

Natural language processing on the other hand tends to work with unsupervised learning problems. In this case, we have a large body of unlabelled data (see the data sources below) and we want to build models that provide some understanding of the data, e.g. that model in some way syntactic or semantic properties of text.

Saying this there are cross overs – there are several highly-cited papers that apply CNNs to sentence structures, and document classification can be performed on the basis of a corpus of labelled documents.

Introductory Blog Posts

A good place to start are these blog posts and tutorials. I’m rather envious of the ability of these folks to write so clearly about such a complex topic.

Courses

After you’ve read those blog articles a next step is to dive into the Udacity free Deep Learning course. This is taught in collaboration with Google Brain and is a great introduction to Logical Regression, Neural Networks, Data Wrangling, CNNs and a form of RNNs called Long Short Term Memory (LSTMs). It includes a number of interactive Jupyter/IPython Notebooks, which follow a similar path to the Tensorflow tutorials.

Data Sources

Once you’ve got your head around the theory, and have played around with some simple examples, the next step is to get building on some legal data. Here’s a selection of useful text sources with a patent slant:

The file you probably want here is enwiki-latest-pages-articles.xml.bz2. This clocks in at 13 GB compressed and ~58 GB uncompressed. It is supplied as a single XML file. Again I need to work on some Python helper functions to access the XML and return text.

(Note: this is the same format as recent USPTO grant data – a good XML parser that doesn’t read the whole file into memory would be useful.)

Although there is no API or bulk download option as of yet, it is possible to set up an RSS feed link based on search parameters. This RSS feed link can be processed to access links to each decision page. These pages can then be accessed and converted into text using a few Python functions (I have some scripts to do this I will share soon).

Again a human accessible resource. However, the decisions are accessible by year in fairly easy to parse tables of data (I again have some scripts to do this that I will share with you soon).

Your Document / Case Management System.

Many law firms use some kind of document and/or case management system. If available online, there may be an API to access documents and data stored in these systems. Tools like Textract (see below) can be used to extract text from these documents. If available as some form of SQL database, you can often access the data using ODBC drivers.

Tools

Once you have some data the hard work begins. Ideally what you want is a nice text string per document or article. However, none of the data sources listed above enable you to access this easily. Hence, you need to start building some wrappers in Python to access and parse the data and return an output that can be easily processed by machine learning libraries. Here are some tools for doing this, and then to build your deep learning networks. For more details just Google the name.

NLTK

– brilliant for many natural language processing functions such as stemming, tokenisation, part of speech tagging and many more.

SpaCy

– an advanced set of NLP functions.

Gensim

– another brilliant library for processing big document libraries – particularly good for lazy functions that do not store all the data in memory.

Tensorflow

– for building your neural networks.

Keras

– a wrapper for Tensorflow or Theano that allows rapid prototyping.

Scikit-Learn

– provides implementations for most of the major machine learning techniques, such as Bayesian inference, clustering, regression and more.

Beautiful Soup

– great for easy parsing of semi-structured data such as websites (HTML) or patent documents (XML).

Textract

– a very simple wrapper over a number of different Linux libraries to extract text from a large variety of files.

Pandas

– think of this as a command line Excel, great for manipulating large lists of data.