Bulk Data Downloads: A Breakthrough in Government Transparency

As those of you who follow my tweets know, I spent last week out in Washington, D.C. meeting
with various folks, attending Transparency Camp, and giving a couple of talks. One of the more interesting meetings was with staffers fromCongressman Mike Honda‘s
office. He represents the area around San Jose, including much of Silicon Valley.

Honda sits on a very interesting subcommittee of Appropriations, the one responsible for the
legislative branch, an enormously powerful place to be since they hold the purse strings for
everything from how much money members get for office supplies to how much money employees get
paid. Honda’s staff told me about an interesting rider the subcommittee was working on which
would require the agencies that the U.S. Congress to distribute their data in bulk.

John Wonderlich from the Sunlight Foundation wrote to me this morning to tell me the provision made it
into the Omnibus Appropriations Bill. This is big news. Honda’s staff told me that the
Congressman had been working on this for a year.
Here’s a link to the appropriations
bill. (I sure wish they gave us the ability to pull these things up in HTML, go directly to
a bookmarked section, and usechange control
to see what has changed, but that’s another post).

The money quote is this paragraph right here:

*Public Access to Legislative Data* – There is support for enhancing public
access to legislative documents, bill status, summary information, and other
legislative data through more direct methods such as bulk data downloads and
other means of no-charge digital access to legislative databases. The
Library of Congress, Congressional Research Service, and Government Printing
Office and the appropriate entities of the House of Representatives are
directed to prepare a report on the feasibility of providing advanced search
capabilities. This report is to be provided to the Committees on
Appropriations of the House and Senate within 120 days of the release of Legislative
Information System 2.0.

Advanced search is great, and the Legislative Information System 2.0 thing sounds very
good as well, but I was struck by the phrase “bulk data downloads and other means of no-charge digital access to legislative databases” and the specific
reference to agencies. What would it mean if all the bulk data from the Library of
Congress, Congressional Research Service, Government Printing Office, and “the
appropriate entities of the House of Representatives” were made available? I askedCarl Malamud, who has worked with many of these databases, if this looked like
something real or just another report.

Carl replied:

Wow! This is huge. The language only requires a report, but a report to an Appropriations subcommittee
means a whole bunch, because if they don’t like your report, you don’t get money. (Appropriations
was where the action occurred when we took on the Smithsonian over the Showtime deal. Once
they cut the budget $28 million, they had their attention.) Here’s what this means in
practice:

The Library of Congress sells a series ofexpensive bulk data products
including the Copyright Database, card catalog information in XML, and what are
known as “authority files” which are lists of names, subjects, and other classification
headings so that all libraries can call things by consistent names. Even though the
data is public, it is very expensive today. The Copyright Database, for example, costs $86,625 for
the retrospective and a one-year feed (we
harvested this in 2007 as you reported, but
this would be much easier if they simply provided an FTP server and rsync!)

The Government Printing Office sells the Official Journals of Government, which
we’ve been working very hard onharvesting and purchasing.
If we had $100,000, we would have bought
one of each long ago. This stuff is the official record of the United States.Here’s the full list of
databases, including the Congressional Record, the Compilation of Presidential Documents,
the United States Code, and much more. [Editorial note: The NY Times blog just did a piece earlier today about Carl’s quest to reinvent the mission of the Government Printing Office for the 21st Century.]

The Congressional Research Service is such a no-brainer. With the exception of
classified information, who can afford the luxury of paying for some of the best
research in the world and then just bury it! Taxpayer dollars paid for CRS reports
and they need to be available. (More on this at the
Sunlight Foundation.)

Other Entities of the House is the most impressive clause in that whole paragraph.
My reading is that this clause includes bulk access to broadcast quality video from every
congressional hearing. And if it doesn’t include that, I wish they’d make it clear
in report language. In this day and age, you can’t say a committee hearing is public
if you can’t access it on the Internet. Itty-bitty streaming video using some
proprietary client/format just doesn’t cut it any more. We ran a pilot
with 4 house committees to show that this is very doable and makes a huge difference (check out thebefore and after
shots on this video of Chad Hurley testifying before Congress,)

On video, I want to add
one more note. Policy on what gets archived and distributed from a committee hearing is
up to committee chairmen. It’s a very decentralized system. So, if we’re serious about
putting broadcast quality video from congressional hearings on-line, theLegislative Branch Subcommittee
of Appropriations
would be a wonderful place to start. Happy to help
if they need a hand!

And, if we’re going to do video, there is one more administrative entity in the House
that we should call out. The House Broadcast Studio has a huge archive of prior hearings.
We asked Speaker Pelosi is we could run FedFlix on that archive and her staff sounded
very supportive.
(FedFlix is our program to help government agencies: they send us
video, we digitize it, send it back to them. No cost to the government, more data
for the public domain!) It would be great if the report from the House Broadcast Studio
specifically
dealt with how they’re going to make their archive of several thousand hearings
available as high-resolution, downloadable video.
Again, happy to help if they need a hand.

Bottom line? This is really great if they can pull this off. Congratulations to
Congressman Honda, as well as to the Sunlight Foundation which I know did some heavy
lifting on this issue. (Sunlight has turned into a remarkably effective lobbyist
in favor of transparency. They’re outgunned by K-Street, but they’re definitely
holding their own!)

When Carl Malamud convened a group of 30 open government advocates at O’Reilly’s offices late in 2007, a lot of the discussion focused on this very topic. The group came up with eight guiding principles on the subject of open data. One of the key points was that it is important when government agencies release bulk data, that they do so in the lowest-level format possible. For example, for the Congressional Record and other official journals of government, we want XML plus images, as opposed to just PDF files or other final-form data.

I’d love your thoughts on what government data should be made available, what formats it should be available in, and what you’d do with it if you had access to it. When I spoke with Congressman Honda’s staff, they made clear that they’d love Silicon Valley’s best ideas for other technological reforms that they can include in future legislation. When you’ve got a Congressman who’s paying attention, that’s a great opportunity! I’m fairly sure that the Congressman will be checking the comments on this post, so it’s your chance to let him know what you think.

P.S. Rob Pierson, Congressman Honda’s Online Communications Director, actually gave a Q&A session at Transparency Camp in which he asked for ideas about how to redesign the Congressman’s web site. He got lots of suggestions, including ways to incorporate twitter and facebook feedback, but I’m sure that there are many more ideas. So in addition to responding with ideas about the bulk data provision in the legislation, this is a great chance to give the Congressman feedback on how he can do a better job listening to you.

Update: While it isn’t clear whether CRS reports would be covered under this provision, Senator Lieberman yesterday wrote a letter to the Senate Rules Committee Chairman asking for greater public access to CRS reports. With the new democratic majority in Congress and the White House, transparency measures are sprouting up everywhere.

tags:

Martin Haeberli

Tim,

I would like to see at least the following from the US Patent and Trademark Office’s patent databases:

Frankly, it would be cool and helpful to have a wiki overlay on the full-text of patents and applications where one could comment on pending applications, fix typos, add clarifications, suggest prior art for consideration, etc.

-Public PAIR used to have stable links on a per-patent / patent application basis (so that you could set a web server to watch for key changes), and open access. But a few years ago they closed it down so that you have to use a CAPTCHA to log in every time, and there are no longer stable links.

This makes tracking events on key pending applications and other automation much harder.

Even if the USPTO doesn’t want to add all this open value because of load / performance reasons, other communities likely would bulk download the data and add the value.

On another front, I’d like to see the current federal court PACER system opened up for free public access, full text searching, etc.

Thanks for your leadership!

Martin

http://www.legalsolo.com Angel Maldonado

Writing from Spain where we are pushing too for Spanish government to Open Data.

We are following Carl Malamud’s actions and found great guidelines on the Open Data Principles.

Over here the government body Red.es started up the project Aporta.es (contribute) that sets guidelines to gov. bodies on how to make all their public content available to comply with EU Directive 2003/98/CE of re-utilising public data and Spanish implementation at 37/2007 Law.

I am certain that Carl Malamud and friends are comparatively achieving in the US a lot more and with far less money than Aporta.es in Spain, but we are still hopeful on this initiative. Some well reckoned bloggers and our selves at blog.legalsolo.com are pushing to articulate the non-intuitive value of publishing free and in re-useable formats.

One particular aspect to note on the differences between EU and US with regards to publishing gov data, is that the EU, due to various forms of Data Protection act in each country, can’t publish gov. data that contains individuals data like names, addresses, etc.

This is resolved by detecting, emptying or replacing these sensitive data and hence it delays publication. Some of these processes are automated but some other aren’t in full.

Thanks for all the inspiration!

http://textiplication.com Skott Klebe

Unfortunately, the USPTO is part of the Department of Commerce, in the executive branch, and wouldn’t be covered by this measure. Fully agreed that I’d like that data to be free, though.

Rajat Mittal

We hear about these new transparency acts every now and then in various different areas and although this is huge I would really like to see a data standard being created for any such data access acts. and this group may be the best group to create such a standard.

http://structuralknowledge.com Kevin Webb

Martin,

I second your concerns about the USPTO. I spent many years working at a startup trying to get around the pricing/access restrictions put on PTO data sets. It was a major problem for us.

There’s another layer to this story, one that I actually brought up at TransparencyCamp last weekend:

Based on a conversation I had with the or Patent Commissioner back in 2005, I believe the PTO would like to share more of its data. The commissioner, John Doll, actually has won a medal of service for his work on opening up access to the image file wrapper data and is a committed public servant in every sense of the word. However, the problem as he explained it to me is that the OMB issued a statement in 1996 (as part of Cir. A-130) stating that government agencies have to be careful not to “waste taxpayer money” competing with or duplicating services from existing data providers, be they public or private.

Unfortunately, in the case patents there was such a strong need for data access that the public sector beat the government to the punch by building private services. So this actually blocked the PTO from building its own, more efficient/effective data interfaces. At least as far as the wording of A-130 is concerned.

I confirmed this fact by talking with some of the folks from the former MCNC (North Carolina’s supercomputing center) that help build the public search interface that the PTO currently provides. It sounded like their work was very politically charged for the reasons Commissioner Doll outlined and ultimately they had to dial back the extent of the service they delivered to the PTO.

It’s a tragedy for sure and one that, from own personal experience, has made improving our IP system through outside innovation all the much harder.

I’m excited to see all the amazing change that’s occurring in terms of data access and I only hope that it has similar impact in overturning the rules set out in OMB A-130, if it has not already.

Some folks wanted to know about Speaker Pelosi’s support of the FedFlix program. That letter is here.

I was also discussing Tim’s post with Congressional staff, and they indicated that it wasn’t clear that CRS reports will be covered in the House Omnibus, but they pointed out that the Senate seems to be working that issue pretty hard. In particular, here is a letter from Senator Lieberman on the subject.

Seems like transparency may be bicameral!

http://tim.oreilly.com Tim O'Reilly

In a press conference today, Vivek Kundra, the new Federal CIO appointee, gave two great examples of how release of government data can be an aid not just to government transparency, but can spark entire industries. First, he cited the NIH-supported Human Genome Project, which made possible the entire field of personalized medicine. He noted that there are more than 500 new drugs in the FDA pipeline that were enabled by this data release. Second, he cited the release of GPS data (which many people forget was once an exclusive military asset) as the basis for the entire Geospatial industry. (He didn’t mention all of the other government data sets that provide important layers in the geospatial ecosystem; GPS is only one of many.)

Government transparency is really important, but let’s not think that this stuff is just for policy wonks and open government advocates. Entrepreneurs: PAY ATTENTION!

Mark Meehan

Excuse me, but government transparency is not an excuse to continue to wasteful spending. Throw enough money on the wall and some will stick. Get this money into the private sector and out of government. The Billions if not Trillions of dollars spent of NASA also gave the private sector some wealth, Tang, Teflon, Safest Aircraft in the world. If you have no private sector to utilize the innovations it’s all in vain anyway. Obama needs to STOP creating Panic and realize that Government is NOT the answer. Smart innovators like yourself ARE the answer. Get the money into motivated people like you and we all will prosper.

Behind the search database here is a bunch of interesting information about FCC registered devices, such as wireless access points, wireless phones, etc.

Many of these are available, once found, as .pdfs

However,
a) the search engine is a bit flakey – searching for specific known entries fails, while broader searches can find the same entries.
b) this content has not been indexed by search engines such as google.

It would be great if this were also accessible. I understand Kevin’s point above that this isn’t really in the scope of the legislative initiative which Tim describes, but it would still be nice if one could bulk download this data and/or if it were more accessible.

Thanks,

Martin

http://www.kevinbondelli.com Kevin Bondelli

The appointment of Vivek Kundra gives me hope that a lot of these ideas about opening up government data will become a reality. I was especially pleased by his announcement that he is going to create data.gov for this purpose. Now we’ll see exactly what and how much data will be available there and in what formats.

Jason

Martin, Kevin,

I would like to third your desire for the USPTO to open up their patent data to the world. There is so much information held in their databases that would likely be a treasure trove for researchers and startups.

I know the USPTO can’t compete with commercial providers, but why does that mean they can’t freely publish the raw data? Others can figure out what they want to do with it, but there’s no reason why the raw patent data should be so tremendously expensive.

Additionally, while the USPTO seems to have improved its transparency lately, they are very much concerned with providing the user with web access to individual patents. They then complain when people write robots to rip the data from their website. Clearly there’s a strong need here.

I really hope the Obama administration forces a change in policy at the USPTO in favour of open access to the data.

http://structuralknowledge.com Kevin Webb

Jason,

I agree – the raw data access is the big issue.

In some cases the problem is pricing which can be exorbitant ($30-60K for a set of DVDs or more for real-time FTP access). The pricing does not reflect the actual cost of providing the data.

In other cases the data isn’t available at all. For example, bulk access to PublicPAIR (which I would argue is far more valuable from a business standpoint) isn’t even available for purchase. This lack of availability caused such a problem that the PTO locked down the PublicPAIR with a captcha to prevent crawlers from scraping the data.

However, I feel like this gets at a larger issue about government data. There are really two kinds. The data that’s already free and just isn’t well presented in bulk data feeds and the kind that’s not free and is well represented for those willing to pay. In the end the latter kind of data isn’t free (and probably won’t become free, even under the new rules) becuase it already has some intrinsic monetary value and a market.

Knowing about patents is important to businesses so there’s a market for the data and many organizations gladly pay the tens of thousands it costs to purchase. Meanwhile others (like Tompson/Elsevier) benefit from the restrictions on access by creating roadblocks for startups creating new and better retrieval/analysis tools. So while it might not be a direct “competition” issue regarding bulk access, there are clearly private sector interests for not expanding access.

Another example of this, at the local government level, is access to land parcel data. I think judicial records might also fall into this category. They have intrinsic value and therefore they have a cost for acquisition, sometimes a substantial cost, even if all you’re after is the bulk data.

I’m excited about and appreciate the importance of expanded access for legislative data. However, I hope the conversation about open access can/will also address the questions about these already existing data sets that are available but not truly open.

P.S. There’s another important point here about data quality and the private sector. This is something that is particularly true with patents.

As I understand it the canonical digital copy of patents is considered by many, including one PTO staffer I talked with, to be held by Tompson/MicroPatent. The problem is that the PTO doesn’t have the needed resources to maintain corrections in the digital version and is lagging behind Tompson in incorporating changes. When I talked with him in 2005 this lag was as much as three years.

So is quality/”up-to-dateness” of the data, also part of the question about open access? Is the data really open if you can only get a stale/inaccurate copy?

http://tim.oreilly.com Tim O'Reilly

Kevin,

I think you’ve hit on a really important point, which is also reflected in my comments about Vivek Kundra’s press conference yesterday. The government needs to consider, in its data release policies, which kind of data release creates more value for society:

- a tie up with a single provider (your Thomson examples) who may add value but also charges the public a very high price for access to the data, and who creates barriers to other players in the market who might exploit the same data

- an ecosystem approach, in which data is released as a common good, to be exploited by industry without exclusive agreements.

It seems to me that based on Vivek’s examples of the human genome project and GPS, that more public and economic value is provided by the second approach.

I’d love to see discussion of other data sets that, if turned loose, would have not just transparency value, but could create whole new business ecosystems and industries.

Jason

To clarify, Thompson does not hold any monopolies over any public patent data. The USPTO is the sole publisher of this data, and it is available for anyone to purchase. Once published, patents are not updated.

That said, the price is set artificially high, well above the cost of reproducing the data. For a subscription to the current year’s patents and patent applications (full text & PDFs), you would have to shell out $82,450.

Imagine the barrier to entry this sets for a basement inventor or start-up? It’s peanuts for Thompson, and they like it that way.

I’d also like to hear about other data sets that could have additional benefits. Personally, I’d like to see municipal public transit schedules and routes made available, including live GPS locations when possible.

http://structuralknowledge.com Kevin Webb

Jason,

Not to digress too much as my understanding of the situation with data quality is a bit dated – I had the conversation with folks at the PTO in 2005.

However, I’m pretty certain that it is or at least was a valid concern but only as far as the full-text product was concerned. And no monopoly with MicroPatent implied. As I understand it there are corrections made over time, I assume due to technical/publishing errors (names being left off/incorrect/etc). The explanation I got from the PTO staffer was that the PTO was able to make updates to the TIFF page images but was not merging all the changes back into the full-text version on the website. Thompson, however, was making updates and as a result there was some divergence in terms of the quality of the data provided from the PTO (at least as far as the full-text product was concerned – the TIFFs were correct) and what could be received from the MicroPatent copy. Again the USPTO had/knew the correct version, it just wasn’t always reflected in the digital full-text on the website.

Again this was just my understanding from this conversation. Perhaps (and hopefully!) the change backlog issue has been addressed since. If I’m incorrect about it, my apologies. I found this to be a fascinating point about the internal work flow at the PTO and I love do know if anyone else can help verify and/or update this story.

http://www.kirix.com/blog/ Ken Kaczmarek

>> I’d also like to hear about other data sets that could have additional benefits. Personally, I’d like to see municipal public transit schedules and routes made available, including live GPS locations when possible.

I couldn’t agree more. Transit schedules seem to be a fairly sticky issue, since each authority has its own formats. Google has been trying to sort this out with its GTFS specification:

It’s madness that this stuff isn’t easier to at; hopefully the general movement toward open data will eventually encompass state/local municipalities as well.

http://www.webappwednesday.com Michael R. Bernstein

Other beneficial bulk data sets:

USDA Plant Hardiness Zone data

SBA loan data

Bayh-Dole Act disclosures (be nice to cross-reference these with the patents, too)

Raw FDA clinical trial data

OSHA inspection and accident investigation data

http://www.webappwednesday.com Michael R. Bernstein

As a trivial illustration of what could be done with the OSHA data for example, imagine employers in your area being ranked by their safety records.

http://www.abielr.com/blog/ Abiel Reinhart

I would like to see better distribution of government economic data. In some respects economic data is already distributed well. For instance, most of it is free, and there are often ways to obtain large, machine-readable data sets. Unfortunately there is great inconsistency across data sources, certain data is particularly hard to collect, and web-based browsing interfaces are rudimentary.

What would an ideal government system for economic data look like? Here are a few basic principles that I think are important:

1. All data is presented in machine-readable format. For the most part this is already the case, but there continue to be some reports (notably at Census) where I have been unable to find flat files or even spreadsheets.

2. Machine readable data is presented in a consistent manner, where possible, both within agencies, and across agencies. Right now there is no consistency in formats between different economic statistics agencies. For instance, both the Bureau of Labor Statistics and the Bureau of Economic Analysis have database systems that are internally consistent, but they are not consistent between the two agencies. Thus if you want to write a parser to capture the data, you are going to have to write two separate parsers. Meanwhile over at Census, every section seems to run its own show, and some sections aren’t even bothering to put on much of a show at all. The Fed used to have some of the same problems that Census did, although they have recently created more consistency.

3. Good user interfaces for browsing data. In addition to having machine readable data, it is also nice to have an interface that casual users can use to capture a few series at a time. Some agencies already have a basic browsing interface, but every site I have seen has extensive limitations. Ideally you want a responsive, well-documented system that you can use to rapidly jump between series, view whole tables of data at once, export to CSV or Excel, and visualize using a graphing system at least as nice as the one at Google Finance. Wading through five screens of options or dealing with a Java applet will not cut it.

One of the principles I mentioned above, consistency of data within and between agencies, will make accessing and aggregating all government economic data much easier. However, there is a chance that it could actually prove vastly more beneficial. The reason is that the size of the government’s data library means it has the ability to promote data format standards in a way that smaller private sources cannot. For instance, suppose that the government released all time series data in a consistent manner. You would naturally have a series of tools spring up that would be designed to manipulate that data. Then other organizations could have an incentive to publish their data in this format, so that the set of tools designed to work with government data could also work with their data (tools could include visualization software, automatic downloading into statistical software, etc).

@alanestes mentioned Statistics Canada as a possible model for the US in an earlier post. StatCan is a nice system, but it has a huge downside: you have to pay to get the historical data. Were you to go to their site right now and try to download a full time series of GDP, you would have pay $3 Canadian. Want to get both GDP and consumption? You’ve just spent $6. There may be cheaper ways to get this data in bulk, but in my mind this is not the way to go.

http://tim.oreilly.com Tim O'Reilly

This New York Times article about a global sensing initiative – a cooperation between NASA and Cisco – highlights another important area where government data will become increasingly important:

Another set of data that would be useful: federal grant awards, especially for research. The folks at NSF now have some of that data available by query on Research.gov, apparently as part of the reporting requirements associated with the Transparency Act of 2006. So far, they are only reporting on NSF and NASA awards. Whether this data is available via other means is not clear.

Jason

@Abiel: “Statistics Canada as a possible model for the US in an earlier post. StatCan is a nice system, but it has a huge downside: you have to pay to get the historical data. Were you to go to their site right now and try to download a full time series of GDP, you would have pay $3 Canadian. Want to get both GDP and consumption? You’ve just spent $6. There may be cheaper ways to get this data in bulk, but in my mind this is not the way to go.”

There is a significant difference here between Canadian and US copyright laws. In the US, as I understand it, the US government holds no copyright over any content it creates. In Canada, we have the archaic concept of ‘Crown copyright’ – meaning the government does hold copyright over all its content. That means the gov’t is free to charge for it, have restrictive license agreements, and all that. The US government would not have a similar option.

Also, this is not how StatsCan receives its funding. It is funded from the federal government, and also competes in the private sector internationally.

Sasha

We need to keep in mind that opening up these lines of information sharing would inevitably be opening up the channels for attacks on this data. With making this information available by these means, the agencies involved would have to seriously revamp their data protection software and, strategies, to plan for the incoming flood of attacks.

Featured Video

Big Data and the Hypocrisy of Privacy: Alicia Asín on data, privacy, and the colossal amount of data the IoT will generate.