A goal of the Free Law Project is to make legal research easier and
faster. One way we do that is to scrape court websites, downloading any
opinions that they have, making them searchable and finding the
citations relationships among them. For many jurisdictions, we download
all the opinions they host, while at others we simply start downloading
their opinions on a given day and use those as fuel for our awareness
project whenever new material is published.

Today we are happy to share that thanks to several volunteer
contributors, we’re adding a number of new jurisdictions to the project:

Combined, these new jurisdictions already add nearly 50,000 new
opinions
to our collection, and as always, these are immediately available for
free via our bulk downloads.
As these jurisdictions publish more opinions, we will have them
automatically, usually within 30 minutes from when they are posted.

We will continue adding more and more jurisdictions and opinions. This
is only the beginning.

Brian W. Carver and Michael Lissner, creators of the CourtListener
platform and associated technology, are
pleased to announce that after four years developing free and open legal
technologies, they are launching a non-profit umbrella organization for
their work: Free Law Project. Free Law Project will serve to bring legal
materials and research to the public for free, formalizing the work that
they have been doing, and providing a long-term home for similar projects.

“Since the birth of this country, legal materials have been in the hands
of the few, denying legal justice to the many,” said Michael Lissner,
co-founder of the new non-profit. “It is appalling that the public does
not have free online access to the entirety of United States case law,”
said Brian Carver, UC Berkeley professor and Free Law Project
co-founder. “We are working to change this situation. We also provide a
platform for developing technologies that can make legal research easier
for both professionals and the general public.”

The official goals for the non-profit are:

To provide free, public, and permanent access to primary legal
materials on the Internet for educational, charitable, and
scientific purposes;

We’re proud to announce a big new feature today that we’ve been planning
for a long time. Starting today, you can make citation queries against
the CourtListener corpus. If you look in the bottom of the left hand
column, you’ll see a new slider:

Sliding the handles around, you can easily filter out any documents that
are too popular or not popular enough — or both. In addition to this,
we’ve added citation counts to our results list, and citation count
ordering to our results. For example, you can now order the results by
most cited or least cited, depending on the kind of work you’re doing.

In addition, we’re also announcing two new fields that you can query:
Judges and Nature of Suit. Both of these fields are currently very
limited in our corpus, but as we add more documents, we want to expose
these to our users. To query by judge name, you can either type the name
directly into the judge text box on the left, or you can place a query
using the “judge” operator and a query like [ judge:smith ]. For the
Nature of Suit, the data is both incomplete …

We’re updating our code in a number of ways today and that is resulting
in a number of changes to the format of our data dumps. If you use them
in an automated fashion, please note the following changes:

dateFiled is now date_filed

precedentialStatus is now precedential_status

docketNumber is now docket_number

westCite is now west_cite

lexisCite is now lexis_cite

Additionally, a new field, west_state_cite, has been added, which will
have any citations to West’s state reporters.
We’ve made these changes in preparation of a proper API that will
return XML and JSON. Before we released that API, we needed to clean up
some old field values so they were more consistent. After this point, we
expect better consistency in the fields of our XML.

If this causes any inconvenience or if you need any help with these
changes, please let us know.

In memory of Internet activist Aaron Swartz, Think Computer Foundation
(http://www.thinkcomputer.org) and the Center for Information
Technology Policy (CITP) at Princeton University
(http://citp.princeton.edu) are announcing the winners of two $5,000
grant awards for improving RECAP.

Since 2009, a team of researchers at Princeton has worked on a web
browser-based system known as RECAP (https://free.law/recap/) that
allows citizens to recapture public court records from the federal
government’s official PACER database. The Administrative Office of the
Courts charges per-page user fees for
PACER documents, which makes it expensive to access these public
records. RECAP allows users to easily share the records that they
purchase to and freely access documents that others have already purchased.

Shortly after the unexpected death of Mr. Swartz, Think Computer
Foundation announced that
it would fund grants worth $5,000 each to extend RECAP and make use of
data contained in Think Computer Foundation’s PlainSite database of
legal information.

Two of these grants are being awarded today.

Ka-Ping Yee, a Canadian software developer living in Northern
California, has created a version of RECAP for Google’s Chrome browser.
This gives RECAP a much larger base of …

We’re excited to announce today that we’ve added five new courts to our
list that we support.

Today we add the Supreme Courts for

California

Indiana

West Virginia

Wisconsin

Wyoming

These are the first State courts that we support and over the next few
days we’ll be adding more as the Juriscraper library supports them. We
already have another seven state courts in the wings!

By launching these courts today, we’re making a small change in our
plans. We were previously working towards having all 50 supreme courts
ready to go so we could add them in one big push, but since that’s
taking longer than we would like to develop these scrapers, we’re going
to start adding state courts as they’re ready, one by one.

Today’s launch adds five courts and about 1,200 more cases to the
project. We need help getting the remaining courts ready. If you’re a
developer and want to help, get in touch via our contact form and we’ll
get you up and coding in no time.

Thanks to a great volunteer contribution, we now have amazing graphs on
our coverage page instead of simply static numbers.

The old version used to simply say the number of total documents we had
for a court, leaving you scratching your head. The new version shows you
a timeline indicating how many documents we have in each court for each
year. It’s a great improvement that brings a lot more transparency into
the coverage we have on the site.

Today, teams across the country are hard at work on the Aaron Swartz
Memorial
Grants.
These grants, offered by the Think Computer Foundation, provide $5,000
awards for three different projects related to RECAP.

We are delighted to announce additional awards. The generous
folks over at Google’s Open
Source Programs team have pledged to support two more RECAP-related
project awards — at $5,000 each. These are open to anyone who wishes
to submit a proposal for a significant improvement to the RECAP system.
We will work with the proposers to scope the project and define what
qualifies for the award. All projects must be open source.

There are several potential ideas. For instance, someone might propose
add support to RECAP for displaying the user’s current balance and
prompting the user to liberate up to their free quarterly $15
allocation as the end of the quarter approaches (inspired by Operation
Asymptote). Someone
might propose to improve the
https://www.courtlistener.com/recap/ interface, and
to improve detection and removal of private information. Someone might
propose some other idea that we haven’t thought of. You may wish to
watch the discussion of a few of these initial
ideas …

I mentioned in my last post that we’ve added some new courts to the
site. Today we’ve added the historical data for these courts that was
available on their website.

This amounts to about 1,500 new cases on CourtListener:

112 from November 2003 to today at the Court of Appeals for the
Armed Forces.

764 from January 2000 to today at the Court of Veterans Claims

600 from January 2008 to today at the Court of International Trade

All of these docs are immediately available for search, RSS or via our
dump API, and will be in our dump of all our cases when it is
regenerated at the end of the month.

This also marks an important achievement for the Juriscraper library.
Since CourtListener now has scrapers for all federal courts of special
jurisdiction, we’re officially moving it to version 0.2. It’s taken
longer than we wanted to get it here, but this is a huge step for the library.

It’s been quiet around here for a little while, so it’s about time I
share what’s been going on behind the scenes. As you might imagine, just
because we haven’t had a lot of news doesn’t mean that we haven’t been busy.

The biggest thing I have to share today is that we’ve moved our
CourtListener infrastructure to new and bigger hardware. This task has
taken months to complete and involved applying many updates to the code
and infrastructure. For developers, this upgrade comes with a few changes:

Our default database for CourtListener is now Postgres rather than
MySQL. This is something that’s been planned for a while, but wasn’t
really possible until a big upgrade like this one. The big changes
that come out of this are non-locking queries for our database
dumps, and better performance for many of our queries. Since
Postgres is a transactional, stricter and more featureful database,
we’re convinced that it is a better way forward than MySQL. Oracle
lately hasn’t been a great steward to MySQL, so it was a good time
to jump ship. As a bonus, Posgres was started in Berkeley …

Last week, our community lost Aaron
Swartz.
We are still reeling. Aaron was a fighter for openness and freedom, and
many people have been channeling their grief into positive actions for
causes that were close to Aaron’s heart. One of these people is Aaron
Greenspan, creator of
the open-data site Plainsite and the Think
Computer Foundation. He has established
a generous set of grants to be awarded to the first person (or group)
that develops the following upgrades to
RECAP, our court record liberation
system. RECAP would not
exist
without the work of Aaron Swartz.

Three grants are being made available related to RECAP. Each grant is
worth $5,000.00:

Grant 1: Develop and release a version of RECAP for the Google
Chrome browser that matches the current Firefox browser extension functionality

Grant 2: Develop and release a version of RECAP for Internet
Explorer that matches the current Firefox browser extension functionality

Attachments

I got a bit frustrated today, and decided that I should build a tool to
fix my frustration. The problem was that we’re using a lot of XPath
queries to scrape various court websites, but there was no tool that
could be used to test xpath expressions efficiently.

There are a couple tools that are quite similar to what I just built:
There’s one called Xacobeo, Eclipse has one built in, and even Firebug
has a tool that does similar. Unfortunately though, these each operate
on a different DOM interpretation than the one that lxml builds.

So the problem I was running into was that while these tools helped, I
consistently had the problem that when the HTML got nasty, they’d start
falling over.

No more! Today I built a quick Django
app that can be run
locally or on a server. It’s quite simple. You input some HTML and an
XPath expression, and it will tell you the matches for that expression.
It has syntax highlighting, and a few other tricks up its sleeve, but
it’s pretty basic on the whole.

Following on Friday’s big announcement about our new citator, today I’m
excited to share that we’ve completed incorporating volumes 1 to 491 of
the third series of the Federal Reporter (F.3d). This has been a
monumental task over the past six months. Since we already have many
cases that were from the same time period and jurisdiction, we had to
work very hard on our duplicate merging algorithm. In the end, we had
were able to get upwards of 99% accuracy with our merging code, and any
cases that could not be merged automatically were handled by human
review. The outcome of this work is an improved dataset beyond any that
has been available previously: In tens of thousands of cases, we have
been able to merge the meta data on Resource.org with data that we
obtained directly from the court websites.

These new cases bring our total number of cases up to 756,713, and we
hope to hit a million by the end of the year. With this done, our next
task is to begin incorporating and data from all of the appellate-level
State Courts. We will be working on this in a …

I’m incredibly excited today to announce that over the past few weeks we
have successsfully rolled out a Citator on CourtListener. This
feature was developed by UC Berkeley School of Information students
Karen Rustad and Rowyn McDonald after a thorough design and development
cycle which included everything from user interviews to performance
optimizations of our citation finding algorithm.

As you’re browsing the site, you’ll immediately see three big new
features. First, all Federal citations to documents that we have in our
collection are now links. So as you’re reading, if there’s a reference
to a prior case that you feel might be useful to your research, you can
just click the link to that case and continue your research there. This
allows you to go upstream in your research, looking at the important
cases that came before.

The second big change you’ll see is a new sidebar on all case pages that
lists the top five cases that reference the one you’re reading. This
allows you to go downstream from the case you’re reading, where you’ll
be able to identify how the case was later interpreted by other courts.

I’ve written
previously
about the lengths we go to at CourtListener to protect people’s privacy,
and today we completed one more privacy enhancement.

After my last post on this topic, we discovered that although we had
already blocked cases from appearing in the search results of all major
search engines, we had a privacy leak in the form of our
computer-readable sitemaps. These sitemaps contain links to every page
within a website, and since those links contain the names of the parties
in a case, it’s possible that a Google search for the party name could
turn up results that should be hidden.

This was problematic, and as of now we have changed the way we serve
sitemaps so that they use the noindex X-Robots-Tag HTTP header. This
tells search crawlers that they are welcome to read our sitemaps, but
that they should avoid serving them or indexing them.

The Law Via the Internet
conference is celebrating its 20th anniversary at Cornell University on
October 7-9th. I will be attending, and with any luck, I’ll be
presenting on the topic proposed below.

Wrangling Court Data on a National Level

Access to case law has recently become easier than ever: By simply
visiting a court’s website it is now possible to find and read thousands
of cases without ever leaving your home. At the same time, there are
nearly a hundred court websites, many of these websites suffer from poor
funding or prioritization, and gaining a higher-level view of the law
can be challenging.
“Juriscraper” is a
new project designed to ease these problems for all those that wish to
collect these court opinions daily. The project is under active
development, and we are looking for others to get involved.

Juriscraper is a liberally-licensed open source library that can be
picked up and used by any organization to scrape the case data from
court websites. In addition to a simply scraping the websites and
extracting metadata from them, Juriscraper has a number of other design goals:

Extensibility to support video, oral argument audio, and other media types

For the past few months, we have been blogging about our research into
how to handle scanned documents at CourtListener since a number of
courts have a habit of releasing their opinions in this manner.
Previously when this happened, it meant that we couldn’t get the text
out of the document, and as a result, it was impossible for anybody to
find these cases on the site.

Obviously, this is a bad situation for our users, so we are excited to
announce that as of today we have a new Optical Character Recognition
(OCR) system for extracting the text from scanned documents. We’re
currently extracting the text from an additional 10,000 opinions that
were previously unsearchable, and going into the future we’ll do this
automatically as we get cases from the courts.

This change further expands the breadth of our coverage, and we hope you
find it to be a useful change!

For the past two years at CourtListener we used a mess of code to scrape
the Federal Court system. This worked remarkably well, but we recently
began expanding our coverage and it was clear a rewrite was needed. For
the past several weeks, we’ve been building a replacement called
Juriscraper that is more reliable, understandable, flexible and expandable.

Unlike our old scrapers, Juriscraper is a library that anybody can pick
up and use, and which allows your project to easily scrape court
websites. It is currently at version 0.1, which supports all of the
courts on CourtListener, and over the next few weeks we’ll be adding
many more courts until we have all of the available courts in the United States.

We hope that this project will be something that others will use, and
that we can thus centralize our scraping efforts. There are many
organizations that are currently scraping court websites, each with
their own implementations that they build and maintain. This creates lot
of duplicated work, and slows down the maintenance for everybody. By
finally creating a liberally licensed shared scraper, we hope to bring
everybody under the same scraping roof so we can share …