Public access to legislative information could get a boost this Friday at a House subcommittee hearing. The Legislative Branch Appropriations subcommittee will be marking up Congress’ budget for FY 2013, which will present the opportunity to require that the data behind THOMAS be made available to the public in a better format.

Why does this matter? Simply speaking, our democracy is founded upon an informed public acting through its elected officials to make policy. THOMAS makes this possible, but its limitations make it difficult.

Developers and programmers have worked to overcome THOMAS’s limitations, creating websites like OpenCongress and GovTrack.us that together have nearly twice as many visitors as THOMAS, mobile device apps like Sunlight’s “Congress” Android app that’s been downloaded 400,000 times, as well as integrating the data into news coverage (like at the New York Times) and special purpose sites like WashingtonWatch.com.

Unfortunately, weaknesses in how THOMAS makes the data available limits what can be accomplished by even the most talented developer. No one expects THOMAS to do everything, but it suffers from basic problems. Its web page addresses break after 15 minutes, it doesn’t provide redlines of bills, you can’t get alerts when legislation is moving, and it does a poor job of integrating relevant legislative data. There’s a laundry list of improvements here. In addition, there are other tasks that shouldn’t be done by THOMAS, but should exist… whether as simple as connecting relevant CRS reports to legislation or as dynamic as adding an interactive social media layer.

These are examples of the benefits of opening up the data that drives THOMAS. Beneath the 1990s web interface is an up-to-date database of bills, bill status information, legislative summaries, and much more. Releasing the data in a developer-friendly format (i.e. structured data made available in bulk) would empower innovators to improve upon the services THOMAS provides, and to go in entirely new directions, all at no cost to the public.

When the THOMAS website went live on January 5, 1995, it was the result of a bipartisan effort to grant “citizens across the country and around the world … access, via the Internet, to congressional information.” THOMAS significantly improved how legislative information was made available online — it provided additional materials in a centralized location, and did not charge the public for access — with a pledge that over time “enhancement[s] will be made to THOMAS to upgrade its features.”

While citizens around the world gained access to some congressional information, enhancements to THOMAS’s capabilities have been limited in scope. Its limitations kindled a desire in users to be able to build their own tools to make use of legislative data. These efforts have been severely hampered because THOMAS doesn’t give the public access to its underlying database, instead releasing its information piecemeal through thousands of webpages.

This challenge was partially overcome by technologists like Josh Tauburer, who in 2004 launched GovTrack.us, which he describes in his great new book Open Government Data as “one of the first websites world-wide to offer comprehensive parliamentary tracking for free and with the intention to be used by everyday citizens.” But there’s a catch. The unstructured way the THOMAS data was released required him to find some way to gather and organize the data.

He turned to screen scraping, which involves “programmatically loading up web pages, looking at their HTML source, and extracting information using simple pattern matching.” Jim Harper at Washington Watch, which tracks bills and government spending, also uses screen scraping. They’ve run into similar problems: screen scrapers don’t catch all the data, they’re a pain to build, they easily break, and can suffer from a time lag. All of this could easily be fixed by publicly releasing the structured database behind THOMAS.

In fact, releasing the database — often referred to as providing “bulk access to data” — is a longstanding open data principle that has been called for by many people over the years.

In May 2007, a coalition of organizations and experts released the Open House Report, which recommended (among other things) the creation of a “Legislation Database.”

“Congress should make available to the public a well-supported database of all bill status and summary information currently accessible through the Library of Congress. This database, as well as its supporting files, should be in a structured, non-proprietary format such as XML. “

Around the same time, legislative language was inserted into an explanatory statement accompanying the Omnibus Appropriations Act of 2009 (P.L. 111-8) that declared “There is support for enhancing public access to legislative documents, bill status, summary information, and other legislative data through more direct methods such as bulk data downloads and other means of no-charge digital access to legislative databases.”

This direct endorsement of bulk access to legislative data did not yield measurable results from the Library of Congress, which is responsible for the THOMAS database. Not did the myriad of meetings, phone calls, and letters from congressional staff to the Library.

Over time, there has been a shift of responsibility for THOMAS to the Law Library from other parts of the Library of Congress, as announced in their January 5, 2010 holiday newsletter. Although the newsletter raised hoped that the “analysis of the system’s functionality and content based on user feedback” would lead to improvements in access to the underlying data, no movement on this issue was forthcoming. Even so, the public and members of Congress have continued to press forward on the issue.

For example, in May 2010, I had the opportunity to testify on behalf of Sunlight before the House Legislative Appropriations subcommittee. We called on Congress to:

Grant the public access to legislative documents, bill status and summary information, and other legislative data no later than 120 days after the start of FY 2012. We also ask for the immediate creation of an advisory committee, composed of relevant legislative agency employees and members of the public, that will meet regularly to address the public’s need for access to this information, and the means by which it is provided.

In September 2010, Rep. Foster introduced legislation to improve public access to THOMAS. The bill would have provided bulk access to bill summary and other THOMAS data, created an advisory committee to make recommendations on improving THOMAS, and urged the Library to work towards providing bulk access to the full text of the legislation. The session ended before there was an opportunity for action.

Even though the 112th Congress brought a change in leadership in the House, bipartisan interest in making this information available to the public continued. Indeed, over the years appropriators, overseers, and leadership have pushed the ball forward. In June of 2011, the Committee on House Administration held a hearing on making congressional documents available electronically as a transparency and cost-savings measure. One of the panelists, Cornell’s Tom Bruce, advocated that the House focus on providing legislative data in bulk and in a timely fashion.

In December, Reps. Cantor and Hoyer co-hosted a Congressional Hackathon, which brought together nearly 300 developers and policy wonks to discuss how to use technology to make the legislative branch more open. Out of that meeting came three action items, the first of which was “providing legislative data in a bulk format to enable third-party developers to create more dynamic interfaces for legislative information.”

By the middle of the month, the Committee on House Administration set forth standards for the electronic posting of House and committee documents and data. In January, the House launched a groundbreaking transparency portal. It provides a one stop website where the public can access all House bills, amendments, resolutions for floor consideration, and conference reports in XML, as well as information on floor proceedings and more. Information will ultimately be published online in real time and archived for perpetuity. So far, only documents considered by the full House are available online, but it’s expected that Committee documents will be available by the beginning of 2013.

The House transparency portal is a tremendous breakthrough, but it does have significant limitations. Because it came online in 2012, it doesn’t capture the historical information contained in the THOMAS database. As a House resource, it doesn’t have Senate records. And it doesn’t contain bill summaries, related bills, and other information prepared by the Library of Congress and GPO that are made available through THOMAS. Therese limitations can be overcome in time, and they clearly points the way to the future, especially if the Library of Congress doesn’t act.

On February 2, the House held a full day Legislative Data and Transparency Conference, which brought together nearly all of the key players in making congressional information available to the public. On behalf of Sunlight, I delivered a talk on benchmarks for measuring success for legislative data transparency, which clearly included a call for THOMAS data to be made available in bulk. Surprisingly, the Library of Congress’ representative, when directly asked about THOMAS, indicated the issue wasn’t even on the radar. Three days later, the Sunlight Foundation submitted comments to the House Legislative Branch Appropriations Committee on the importance of making legislative data available to the public, as did Josh Tauburer and Open Congress.

We estimate that for every person that goes directly to the THOMAS website, at least two people visit a third-party website. But even these sites must rely on legislative information generated and maintained by Congress, which is only available through the difficult-to-use THOMAS website. There will always be a need for a congressionally-mandated website, but Congress should ensure that the innovative and transformative uses of legislative information by third parties is grounded upon accurate and timely data. And that means providing bulk access to everyone.

So here we are in May. The three best legislative opportunities to require bulk access to THOMAS this legislative year, in increasing order of difficulty, are in the Leg Branch Approps Subcommittee mark-up on Friday, the full committee mark-up, and in the final vote on the House floor. (The Senate also provides an opportunity, but the House traditionally has led on these issues.)

It’s time to fulfill the promise of citizen access to legislative information. Congress should require bulk access to THOMAS legislative data no later than 120 days of passage of the appropriations bill, and create an advisory committee that regularly meetings to look at public access to legislative information and is composed of people inside and outside of government. It would make information that’s already required to be publicly available much more useful to everyone, and impose (at best) a minimal cost.

THOMAS was created by Congress to make legislative information freely available to the public, but the Library has not kept up with best practices. Congress should break the logjam and keep the promise of making free legislative information available to everyone in a way that encourages the public to make the most of it.