At Free Law Project, we have gathered millions of court documents over the years, but it’s with distinct pride that we announce that we have now completed our biggest crawl ever. After nearly a year of work, and with support from the U.S. Department of Labor and Georgia State University, we have collected every free written order and opinion that is available in PACER. To accomplish this we used PACER’s “Written Opinion Report,” which provides many opinions for free.

This collection contains approximately 3.4 million orders and opinions from approximately 1.5 million federal district and bankruptcy court cases dating back to 1960. More than four hundred thousand of these documents were scanned and required OCR, amounting to nearly two million pages of text extraction that we completed for this project.

Today we are launching a new project to download all of the free opinions and orders that are available on PACER. Since we do not want to unduly impact PACER, we are doing this process slowly, giving it several weeks or months to complete, and slowing down if any PACER administrators get in touch with issues.

In this project, we expect to download millions of PDFs, all of which we will add to both the RECAP Archive that we host, and to the Internet Archive, which will serve as a publicly available backup.1 In the RECAP Archive, we will be immediately parsing the contents of all the PDFs as we download them. Once that is complete we will extract the content of scanned documents, as we have done for the rest of the collection.

This project will create an ongoing expense for Free Law Project—hosting this many files costs real money—and so we want to explain two major reasons why we believe this is an important project. The first reason is because there is a monumental value to these documents, and until now they have not been easily available to the public. These documents are a critical …

Today we’re extremely proud and excited to be launching a comprehensive database of judges and the judiciary, to be linked to Courtlistener’s corpus of legal opinions authored by those judges. We hope that this database, its APIs, and its bulk data will become a valuable tool for attorneys and researchers across the country. This new database has been developed with support from the National Science Foundation and the John S. and James L. Knight Foundation, in conjunction with Elliott Ash of Princeton University and Bentley MacLeod of Columbia University.

This post is one with mixed news, so I’ll start with the good news, which is that version 3.0 of the CourtListener API is now available. It’s a huge improvement over versions 1 and 2:

It is now browsable. Go check it out. You can click around the API and peruse the data without doing any programming. At the top of every page there is a button that says Options. Click that button to see all the filtering and complexity that lies behind an API endpoint.

It can be sampled without authentication. Previously, if you wanted to use the API, you had to log in. No more. In the new version, you can sample the API and click around. If you want to use it programatically, you’ll still need to authenticate.

It conforms with the new CourtListener database. More on this in a moment, but the important part is that version 3 of the API supports Dockets, Opinion Clusters and Sub-Opinions, linking them neatly to Judges.

The search API supports Citation Searching. Our new Citation Search is a powerful feature that’s now available in the API.

While working on a soon-to-be-released feature of CourtListener, we needed to create “short form” case names for all the cases that we could. We’re happy to share that we’ve created about 1.8M short form case names, including complete coverage for all Supreme Court cases going back to 1947, when the Supreme Court
Database begins.

If you’re not familiar with the term, short form case names are the ones you might use in a later citation to an authority you’ve already discussed in a document. For example, the first time you mention a case you might say:

Kellogg Brown & Root Services, Inc. v. United States Ex Rel. Carter

But later references might just be:

Kellogg Brown at 22

The Blue Book doesn’t have a lot to say about this format, but does say the short form must make it, “clear to the reader…what is being referenced.” Also:

When using only one party name in a short form citation, use the name
of the first party, unless that party is a geographical or
governmental unit or other common litigant.

With these rules in mind, we made an algorithm that attempts to generate good short form …

A long time ago in a courthouse not too far away, people started making
books of every important decision made by the courts. These books became
known as reporters and were generally created by librarian-types of
yore such as Mr. William
Cranch and Alex
Dallas.

These men—-for they were all men—-were busy for the next few centuries
and created thousands of these books, culminating in what we know
today as West’s reporters or as regional reporters like the “Dakota
Reports” or the thoroughly-named, “Synopses of the Decisions of the
Supreme Court of Texas Arising from Restraints by Conscript and Other
Military Authorities (Robards).”

Motivated by our need to identify citations to these reporters, we’ve
taken a stab at aggregating a few facts about them, such as variations
in their name, abbreviation, or years they were published, and put all
that information into our reporters database. Until recently, this
database lived deep inside CourtListener and was only discovered by
intrepid hackers rooting around, but a few months ago we pulled it out,
put it in its own repository, and converted it to better formats so
anyone could more easily re-use it.

The Supreme Court Database includes data for about 8,500 Supreme Court
opinions from 1946 to 2013 and this first pass merges that data with
CourtListener so that:

Our copy of these opinions are enhanced with better parallel
citations. You can now look these items up by U.S. Reporter (U.S.),
The Supreme Court Reporter (S.Ct.), Lawyers’ Edition (L.Ed.) or even
LEXIS citation (U.S.LEXIS). This should make our citation graph
much more robust and should help people like Colin Starger at
University of
Baltimore that
are doing great analyses with this
data. Many
of these items were screen scraped directly from the Supreme Court
website meaning that for these items, this is the first time they
have had proper citations. Here’s an example of the many parallel
citations items now have:

We’re very excited to announce that CourtListener is currently in the
process of rolling out support for Oral Argument audio. This is a
feature that we’ve wanted for at least four years — our
name is CourtListener, after all — and one that will bring a raft
of new features to the project. We already have about 500 oral arguments
on the site, and we’ve got many more we’ll be adding over the coming weeks.

For now we are getting oral argument audio in real time from ten
federal appellate
courts.
As we get this audio, we are using it to power a number of features:

Oral Argument files become immediately available in our search results.

A podcast is automatically available for every jurisdiction we
support and for any query that you can dream up. Want a custom
podcast containing all of the 9th circuit arguments for a particular
litigant? You got it.

You can now get alerts for oral arguments so you can be sure that
you keep up with the latest coming out of the courts.

The Burning of the Library of Alexandria, an illustration from
‘Hutchinsons History of the Nations’, c. 1910.

At least since the destruction of the Ancient Library of
Alexandria, the world has
known the importance of having a backup. The
RECAP archive of documents from PACER is a
partial backup of documents taken offline by five federal
courts. It
is impossible to determine how complete a backup we have, because the
problem with missing documents is that you cannot even determine that
they are missing without a complete list of what used to be available.
No such lists exist for the documents from these five courts.

The BBC mentions the case Ricci v. DeStefano which was decided at the
Second Circuit while Sonia Sotomayor was a Circuit Judge. Sotomayor,
now a Supreme Court Justice, had her role in deciding the case closely
scrutinized during her Supreme Court confirmation hearings. Many who dug
in to Sotomayor’s background during those hearings …

A recent
announcement
on the federal PACER website indicated that PACER documents from five
courts prior to certain dates (pre-2010 for two courts, pre-2012 for one
court, etc.) would no longer be available on PACER. The announcement was
reported widely by news organizations, including the Washington
Post and
Ars
Technica.
The announcement has now been changed to explain, “As a result of these
architectural changes, the locally developed legacy case management
systems in the five courts listed below are now incompatible with PACER;
therefore, the judiciary is no longer able to provide electronic access
to the closed cases on those systems.” See a screenshot of the earlier
announcement without this explanation:

Original PACER announcement

This morning, Free Law Project signed on to five
letters from the non-profit,
Public.Resource.Org, headed by Carl
Malamud, asking the Chief Judge of
each of these five courts to provide us with access to these newly
offline documents. The letter proposes that we be provided access in
order to conduct privacy research, particularly with respect to the
presence of social security numbers in court records, as
Public.Resource.Org has done previously in several contexts. In addition
we offer to host all the documents …

Today Free Law Project announced that
it is partnering with Princeton University’s Center for Information
Technology Policy to manage the operation
and development of the RECAP platform.
Most readers here will know that the RECAP platform utilizes free
browser extensions to improve the experience of using PACER, the
electronic public access system for U.S. federal courts, and
crowdsources the creation of a free and open archive of public court records.

I have been frustrated with PACER for a long time: as a member of the
public, as a law student, as a litigator, as an academic, and as one
trying to build systems for public access to court documents. I’ve been
frustrated by the price per page, by the price for searches with no
results, by the shocking price for inadvertent searches with thousands
of results, by the occasional price for judicial opinions that are
supposed to be free, by the price in light of the fact that Congress
made clear that the Judicial Conference “may, only to the extent
necessary, prescribe reasonable fees… for access to information
available through automatic data processing equipment” when it has been
demonstrated time and again that PACER revenues grossly exceed …

The citation graph is made into a network to compute CiteGeist scores.

We’re excited to announce that beginning today our relevancy engine will
provide significantly better results than it has in the past. Starting
today, whenever you place a query we will analyze which opinions are the
most cited, and we will use that to provide the best results possible.
We’re calling this the CiteGeist score because it finds the spirit
of your query (“Geist”) and gives you the best possible results. This is
currently enabled for our corpus starting in the 1750’s up through about
1985, and the remaining years will get the CiteGeist treatment as well
over the next few days.

The details of how CiteGeist works are in our code, but the basic idea
is to give a high CiteGeist score to opinions that are cited many times
by other important opinions, and to give a lower CiteGeist to opinions
that have not been cited or that have only been cited by unimportant
opinions. Once we’ve established the CiteGeist score, we combine it with
a query’s keyword-based
(TF/IDF) relevancy.
Together, we get a combined score which is a measure of how …

Note: This is the third in the series of posts explaining the work
that we did to release the data donation from Lawbox LLC. This is a
very technical post exploring and documenting the process we use for
extracting meta data and merging it with our current collection. If
you’re not technically-inclined (or at least curious), you may want to
scoot along.

Working with legal data is hard. We all know that, but this post serves
to document the many reasons why that’s the case and then delves deeply
into the ways we dealt with the problems we encountered while importing
the Lawbox donation. The data we received from Lawbox contains about
1.6M HTML files and we’ve spent the past several months working with
them to extract good meta data and then merge it with our current
corpus. This post is a long and technical one and below I’ve broken it
into two sections explaining this process: Extraction and
Merging.

Extraction

Extraction is a difficult process when working with legal data because
it’s inevitably quite dirty: Terms aren’t used consistently, there are
no reliable identifiers, formats vary across jurisdictions, and the data
was …

After many years of collecting and curating data, today
CourtListener crossed some incredible
boundaries. Thanks to a generous data
donation from Lawbox
LLC, our computers are currently adding more
than 1.5M new opinions to CourtListener, expanding our coverage to a
total of more than 350 jurisdictions. This new data enables legal
professionals and researchers insight into data that has never before
been available in bulk and greatly enhances the data we previously had.
This data will be slowly rolling out in our front end, and will soon be
available in bulk from our bulk downloads
page. A new version of our
coverage page was developed,
and, as always, you can see our current coverage for any jurisdiction we support.

It’s difficult to overstate the importance of this new data. In addition
to being a massive expansion of our coverage, it also brings some
notable improvements to the project:

For all of the new data and much of our old data, we have added star
pagination throughout. For the first time, this will make pinpoint
citations possible using the CourtListener platform.

We’ve re-organized our database for more accurate citations enabling
for the first …

A goal of the Free Law Project is to make development of legal tools as
easy as possible. In that vein, we’re excited to share that as of today
we’re officially taking the wraps off what we’re calling the Free Law
Virtual Machine.

For those not familiar with this, a virtual machine is a snapshot of a
computer that can be run by anybody, anywhere. With this release, we’ve
created a computer running Ubuntu Linux that our developers or academics
can download, and which has all of the Free Law Project’s efforts
pre-loaded and ready to go.

In addition to a number of minor improvements, the following are
installed and configured:

Courtlistener

Juriscraper

Development tools such as Intellij, Meld, vim, and Kiki

Bookmarks of all American courts

In addition to providing a simple virtual machine that you can install,
we’re also releasing sample
data that can easily
be imported into the CourtListener
platform. This data is available in groups of 50, 500, 5,000 or 50,000
records so that anybody can easily begin working or experimenting with
our platform …

We’re updating our code in a number of ways today and that is resulting
in a number of changes to the format of our data dumps. If you use them
in an automated fashion, please note the following changes:

dateFiled is now date_filed

precedentialStatus is now precedential_status

docketNumber is now docket_number

westCite is now west_cite

lexisCite is now lexis_cite

Additionally, a new field, west_state_cite, has been added, which will
have any citations to West’s state reporters.
We’ve made these changes in preparation of a proper API that will
return XML and JSON. Before we released that API, we needed to clean up
some old field values so they were more consistent. After this point, we
expect better consistency in the fields of our XML.

If this causes any inconvenience or if you need any help with these
changes, please let us know.

I mentioned in my last post that we’ve added some new courts to the
site. Today we’ve added the historical data for these courts that was
available on their website.

This amounts to about 1,500 new cases on CourtListener:

112 from November 2003 to today at the Court of Appeals for the
Armed Forces.

764 from January 2000 to today at the Court of Veterans Claims

600 from January 2008 to today at the Court of International Trade

All of these docs are immediately available for search, RSS or via our
dump API, and will be in our dump of all our cases when it is
regenerated at the end of the month.

This also marks an important achievement for the Juriscraper library.
Since CourtListener now has scrapers for all federal courts of special
jurisdiction, we’re officially moving it to version 0.2. It’s taken
longer than we wanted to get it here, but this is a huge step for the library.

Following on Friday’s big announcement about our new citator, today I’m
excited to share that we’ve completed incorporating volumes 1 to 491 of
the third series of the Federal Reporter (F.3d). This has been a
monumental task over the past six months. Since we already have many
cases that were from the same time period and jurisdiction, we had to
work very hard on our duplicate merging algorithm. In the end, we had
were able to get upwards of 99% accuracy with our merging code, and any
cases that could not be merged automatically were handled by human
review. The outcome of this work is an improved dataset beyond any that
has been available previously: In tens of thousands of cases, we have
been able to merge the meta data on Resource.org with data that we
obtained directly from the court websites.

These new cases bring our total number of cases up to 756,713, and we
hope to hit a million by the end of the year. With this done, our next
task is to begin incorporating and data from all of the appellate-level
State Courts. We will be working on this in a …

I’m incredibly excited today to announce that over the past few weeks we
have successsfully rolled out a Citator on CourtListener. This
feature was developed by UC Berkeley School of Information students
Karen Rustad and Rowyn McDonald after a thorough design and development
cycle which included everything from user interviews to performance
optimizations of our citation finding algorithm.

As you’re browsing the site, you’ll immediately see three big new
features. First, all Federal citations to documents that we have in our
collection are now links. So as you’re reading, if there’s a reference
to a prior case that you feel might be useful to your research, you can
just click the link to that case and continue your research there. This
allows you to go upstream in your research, looking at the important
cases that came before.

The second big change you’ll see is a new sidebar on all case pages that
lists the top five cases that reference the one you’re reading. This
allows you to go downstream from the case you’re reading, where you’ll
be able to identify how the case was later interpreted by other courts.

Just a quick note today to share some exciting news and updates about CourtListener.

First, I am elated to announce that the CourtListener project is now
supported in part by a grant from Public.Resource.Org. With this
support, we are now able to develop much more ambitious improvements
to the site that would otherwise not be possible. Over the next few
months, the site should be changing greatly thanks to this support,
and I’d like to take a moment to share both what we’ve already been
able to do, and the coming changes we have planned.

One feature that we added earlier this week is a single location where
you can download the entire CourtListener corpus. With a single click,
you can download 2.2GB of court cases in XML format. Check out the
information on the dump page for more details about when the dump is
generated, and how you can get it: http://courtlistener.com/dump-info/

The second exciting feature that we’ve been working on is a platform
change that enables CourtListener to support a much larger corpus. In
the past, we’ve had difficulty with jobs being performed synchronously
with the court scrapers …