Update: I’ve turned off commenting on this article because it was just
a bunch of people asking for help and never getting any. If you need
help with these instructions, go to Stack Overflow and ask there. If you
have corrections to the article, please send them directly to me using
the Contact form.

Tesseract is a great and
powerful OCR engine, but their instructions for adding a new
font
are incredibly long and complicated. At CourtListener we have to handle
several unusual blackletter
fonts, so we had to go
through this process a few times. Below I’ve explained the process so
others may more easily add fonts to their system.

Create training documents

To create training documents, open up MS Word or LibreOffice, paste in
the contents of the attached file named ‘standard-training-text.txt’.
This file contains the training text that is used by Tesseract for the
included fonts.

Set your line spacing to at least 1.5, and space out the letters by
about 1pt. using character spacing. I’ve attached a sample doc too, if
that helps. Set the text …

At CourtListener, we’re developing a new system to convert scanned court
documents to text. As part of our development we’ve analyzed more than
1,000 court opinions to determine what fonts courts are using.

Now that we have this information,our next step is to create training
data for our OCR system so
that it specializes in these fonts, but for now we’ve attached a
spreadsheet with our findings, and a script that can be used by others
to extract font metadata from PDFs.

Unsurprisingly, the top font — drumroll please — is Times New Roman.

Font

Regular

Bold

Italic

Bold Italic

Total

Times

1454

953

867

47

**3321**

Courier

369

333

209

131

**1042**

Arial

364

39

11

41

**455**

Symbol

212

0

0

0

**212**

Helvetica

24

161

2

2

**189**

Century Schoolbook

58

54

52

9

**173**

Garamond

44

42

41

0

**127**

Palatino Linotype

36

24

24

1

**85**

Old English

42

0

0

0

**42**

Lincoln

27

0

0

0

**27**

Attachments

As part of our research for our
post
on how we block search engines, we looked into which search engines
support which privacy standards. This information doesn’t seem to exist
anywhere else on the Internet, so below are our findings, starting with
the big guys, and moving towards more obscure or foreign search engines.

Google, Bing

Yahoo, AOL

Yahoo!’s search engine is provided by Bing. AOL’s is provided by Google.
These are easy ones.

Ask, Yandex, Nutch

Ask (known as teoma), and Yandex (Russia’s search engine, known as
yandex), support the robots meta tag, but do not appear to support the
x-robots-tag. Ask’s page on the topic is
here,
and Yandex’s is here.
The popular open source crawler, Nutch, also
supports the robots HTML
tag, but not the
x-robots-tag
header.
Update: Newer versions of Nutch now support x-robots-tag!

The Internet Archive, Alexa

The Internet Archive uses Alexa’s crawler, which is known as
ia_archiver. This crawler does not seem …

At CourtListener, we have always taken privacy very seriously. We have
over 600,000 cases currently, most
of which are available on Google and other search engines. But in the
interest of privacy, we make two broad exceptions to what’s available on
search engines:

As is stated in our removal
policy, if someone gets in touch
with us in writing and requests that we block search engines from
indexing a document, we generally attempt to do so within a few hours.

If we discover a privacy problem within a case, we proactively block
search engines from indexing it.

Each of these exceptions presents interesting problems. In the case of
requests to prevent indexing by search engines, we’re often faced with
an ethical dilemma, since in many instances, the party making the
request is merely displeased that their involvement in the case is easy
to discover and/or they are simply embarrassed by their past. In this
case, the question we have to ask ourselves is: Where is the balance
between the person’s right to privacy and the public’s need to access
court records, and to what extent do changes in practical
obscurity
compel action on our …

After three months of hard development, I’m pleased to announce that the
new version of CourtListener is going live at this very moment. In this
version, we’ve completely rewritten vast swaths of the underlying code,
and we’ve switched to a hugely more powerful architecture.

The new site comes with some significant improvements:

You can now search by casename, date, court, precedential status or citation

Results can be ordered by date or by relevance

New Boolean operators are supported, and our syntax is much more
intuitive (see here for many more details)

If you want, you can now dig very deeply into the results.
Previously, we had a cap at 1,000 results for a query. Not any more.

Court documents will now show up in our search results within
milliseconds of being found on the court’s website. In the future,
if there’s demand, we may use this to offer Realtime alerts.

We now have snippets and highlighting on our results page.

Finally, some polish everywhere to make things prettier.

Huge performance improvements.

Better support for mobile devices and tablets.

Better support for disabled people, and users that prefer not to use JavaScript.

XRDS Magazine recently ran an article by Steve Schultze and Harlan Yu
entitled Using Software to Liberate U.S. Case
Law. The article describes the
motivation behind RECAP, and outlines the state of public access to
electronic court records.

Using PACER is the only way for citizens to obtain electronic records
from the Courts. Ideally, the Courts would publish all of their
records online, in bulk, in order to allow any private party to index
and re-host all of the documents, or to build new innovative services
on top of the data. But while this would be relatively cheap for the
Courts to do, they haven’t done so, instead choosing to limit “open” access.

[…]

Since the first release, RECAP has gained thousands of users, and the
central repository contains more than 2.3 million documents across
400,000 federal cases. If you were to purchase these documents from
scratch from PACER, it would cost you nearly $1.5 million. And while
our collection still pales in comparison to the 500 million documents
purportedly in the PACER system, it contains many of the
most-frequently accessed documents the public is searching for.

Just a quick note today to share some exciting news and updates about CourtListener.

First, I am elated to announce that the CourtListener project is now
supported in part by a grant from Public.Resource.Org. With this
support, we are now able to develop much more ambitious improvements
to the site that would otherwise not be possible. Over the next few
months, the site should be changing greatly thanks to this support,
and I’d like to take a moment to share both what we’ve already been
able to do, and the coming changes we have planned.

One feature that we added earlier this week is a single location where
you can download the entire CourtListener corpus. With a single click,
you can download 2.2GB of court cases in XML format. Check out the
information on the dump page for more details about when the dump is
generated, and how you can get it: http://courtlistener.com/dump-info/

The second exciting feature that we’ve been working on is a platform
change that enables CourtListener to support a much larger corpus. In
the past, we’ve had difficulty with jobs being performed synchronously
with the court scrapers …

Over the past few months we have been working on cleaning and importing
the 2nd series of the Federal Register (F.2d) from
http://law.resource.org. Today we’re excited to share that we’ve made
over 12,000 meta data additions, corrections or categorizations, and
that we’ve finally added F2 to our corpus.

This expands our coverage to nearly 600,000 searchable cases, and
improves the quality of bulk data that is available for free on the Web.

We’re very excited by these new features, and we hope to import the
third series next. If you’re interested in contributing to this work,
please drop us a line - it’s a huge task cleaning and importing this
information and we can use all the help we can get!

As mentioned in a previous post, we are currently making some changes to
our back end to allow better citation meta data and searching
granularity. As part of these changes, we have made two small changes to
our dump formats.

The first change is to list docketNumber, westCite and lexisCite instead
of caseNumber and westCitation. We previously had many West-style
citations listed as generic case numbers. This wasn’t very accurate, so
we’ve re-organized this to have better granularity.

The second change we’ve made is to how we handle missing or incomplete
data. Previously, if a case was missing data, we would simply not
include it in a dump. This was not the best solution, so we’re now
including any information we have about a case in every dump we create.
In some cases, this can create partial cases that lack vital meta data.

We hope these changes will be easy to work with, and that they’ll cause
no disruption.

One of the coming features at CourtListener
is an API for the law. Part of that feature is going to be some basic
information about the courts themselves, so I spent some time over the
weekend researching courts that served a special purpose but were since abolished.

One such court was the Emergency Court of
Appeals.
It was created during World War II to set prices, and, naturally, was
the court of appeals for many cases. The creation date of the court is
prominently published in various places on the Internet, but the
abolishment history of the court was very difficult to find. After
researching online for some time, and learning that my library card had
expired (sigh), I put in a query with the Library of
Congress, which provides free research of these
types of things.

Within a couple days, the provided me with this amazing response, which
I’m sharing here, and on the above Wikipedia article:

As stated in the Legislative Notes to 50 U.S. Code Appendix §§ 921
to 926, as posted at

http://www.law.cornell.edu/uscode/html/uscode50a/usc_sec_50a_00000921——000-notes.html,
the following explanation is given regarding the amendment and repeal
of Act …

After many months of works and about 100 revisions to the code, today
we’ve rolled out the latest version of the site. This version comes with
some great enhancements:

We rolled this out to our Twitter stream a few months ago, but we
finally have proper branding and a proper logo. We’re still keeping
things simple, but this should make things a little prettier.

We’ve added the search box to all pages so searches are easier to
make and so you can see what search brought you to the document
you’re looking at.

A new favorites feature has been added that allows you to make notes
about cases that interest you, and to see all of your notes in your profile.

The sidebar has been moved to the left in preparation for faceted
searching and browsing

Lots of code clean up, lots of aesthetic fixes and dozens of small
fixes here, there and everywhere.

We’re really happy with this refresh and the new features that are
coming along with it. If you notice anything that’s not working properly
or that could be better, we’re always happy to hear your feedback.

A few years ago, the Library of Congress released a PDF that listed the exact dates that the early Supreme Court Cases were decided. Since the written record only contained the month and year of the decision, this list served as the official record for the cases.

While it was great for the Library of Congress to publish this report, unfortunately they did so in a large PDF rather than a more useful format that could be used by projects such as CourtListener. Attempts to contact the Library of Congress were unable to locate the original version of the document, so we converted the PDF into both a CSV and an ODS spreadsheet so that the data can be easily read by a computer. I’m happy to be releasing these files today so that they can be used by others.

The second project we have been working on at Free Law Project was to import this data into our system. Because citations in the file are not always unique, we had to device a heuristic algorithm to link up the data in the CSV with the data in our system. Today, we’re happy to share that we did …

A few weeks ago, we made a fairly major change at CourtListener.com to
include ID numbers in all of our case URLs. This change meant that links
that were previously like this:

http://courtlistener.com/scotus/Wong-v.-Smith/

Are now like this:

http://courtlistener.com/scotus/V5o/wong-v-smith/

Most of the old links should continue to work, but using the new links
should be much faster and more reliable. The major difference between
the two is the ID number, which is encoded as a set of numbers (in this
case V5o). This ID corresponds directly with the ID number in our
database, aiding us greatly in serving up cases quickly and accurately.

Around the same time as this change, we added social networking links to
all of our case pages to make them easier to share with friends and
colleagues. These links use our new tiny domain, http://crt.li/, and
should thus be ideal for websites like Twitter or Reddit.

In the next few months we will be getting a major new server, and will
be migrating our data to it. This will allow us to serve more data,
and—drum roll please—will allow us to begin …

This release of RECAP fixes an issue introduced by the newest version of
PACER, which has been deployed to several district courts. We’d like to
thank the users that brought this issue to our attention and also
encourage all RECAP users to contact us
if you notice any irregularities in the future. Each district court
operates their own version of PACER, so there are often small
differences in code which can affect the way that RECAP operates.

In addition, we’ve added a feature that will allow CM/ECF users to more
conveniently contribute documents to the RECAP archive. A substantial
number of our users are attorneys who have a separate “ECF” login as
well as a standard PACER account. Many of these users find it easy to
download and pay for PACER documents while logged into the ECF system,
but previous versions of RECAP would not upload these documents to the
shared archive. Version 0.8 changes this behavior, allowing ECF users to
contribute these documents to the RECAP archive.

When we released RECAPover a year
ago, we intentionally
disabled the extension when it detected an …

We are proud to announce beta version
0.7 of RECAP. This release adds
support for Firefox 4 beta, for those of you living on the cutting edge.

We’ve also added a feature requested by our
users.
Before this release, the only way to see if RECAP had any free documents
for a particular case was to purchase and examine the docket report for
that case. In version 0.7, RECAP will notify you before you run a docket
report if there is already free archived docket available. On the docket
query page for a case that has archived information, you should see a
box appear at the bottom of your screen. Clicking on that link will take
you to RECAP’s summary page, which includes any docket information we
have on the case as well as links to any documents we may have. Here’s
an example of what you should see:

Version 0.7 also fixes a number of bugs, both minor and major. Thanks to
a few extremely helpful users, we were able to fix a problem that
prevented RECAP from working correctly behind certain types of proxy
servers. Users behind a corporate proxy or firewall …

One of the ideas behind the RECAP project is that once government data
is made accessible in a free and open format, people will find useful
new ways to search and process that data. We have heard from many folks
looking to do interesting things with the documents archived by RECAP,
and last year a group of students built the searchable web-based RECAP
Archive. Today, Brian Carver shared a
simple tool he built on top of that — a Firefox RECAP search
plugin.
You know that little search box in the top-right corner of Firefox? If
you install his plugin you can choose the RECAP Archive as one of the
search engines in the drop-down menu, so that finding free federal court
documents is even easier.

The U.S. Courts recently conducted a year-long assessment of their
Electronic Public Access program which included a survey of PACER users.
While the results of the assessment haven’t been formally published, the
Third Branch Newsletter has an
interview
with Bankruptcy Judge J. Rich Leonard that discusses a few high-level
findings of the survey. Judge Leonard has been heavily involved in
shaping the evolution of PACER since its inception twenty years ago and
continues to lead today.

The survey covered a wide range of PACER users—“the courts, the media,
litigants, attorneys, researchers, and bulk data collectors”—and Judge
Leonard claims they found “a remarkably high level of satisfaction”:
around 80% of those surveyed were “satisfied” or “very satisfied” with
the service.

If we compare public access before we had PACER to where we are now,
there is clearly much success to celebrate. But the key question is not
only whether current users are satisfied with the service but also
whether PACER is reaching its entire audience of potential users. Are
there artificial obstacles preventing potential PACER users—who
admittedly would be difficult to poll—from using the service? The
satisfaction statistic may be fine at face value, assuming …

One of the most-requested RECAP features is a better web interface to
the archive. Today we’re releasing an experimental system for searching
and browsing, at
archive.recapthelaw.org. There are
also a couple of extra features that we’re eager to get feedback on. For
example, you can subscribe to an RSS feed for any case in order to get
updates when new documents are added to the archive. We’ve also included
some basic tagging features that let anybody add tags to any case.
We’re sure that there will be bugs to be fixed or improvements that can
be made.

The first version of the system was built by an enterprising team of
students in Professor Ed Felten’s “Civic Technologies”
course:
Jen King, Brett Lullo, Sajid Mehmood, and Daniel Mattos Roberts. Dhruv
Kapadia has done many of the subsequent updates. The links from the
RECAP Archive pages point to files on our gracious host, the Internet
Archive.