When Pfizer posted an online database of its doctor payments earlier this year, I wrote a tutorial for aspiring journalist-programmers on how to write software to collect data from sites like Pfizer’s, which was difficult to use beyond looking up individual names.

As far as we know, there had not yet been a freely available database of all of the disclosed doctor payments. The federal government will create one in 2013, one of the lesser-known mandates of the health care reform bill passed this year. But until then, the companies haven’t made it easy to comb through their disclosures.

Carole Puls, a Lilly spokeswoman, said the company purposely made its report impossible to download "to protect the integrity of the data." Lilly was concerned someone could change numbers and create a false report outside the company’s Web site, Ms. Puls said.

It didn’t seem right for companies to tout their commitment to transparency while making their records cumbersome to all but the most cursory of examinations. So we decided to create a truly transparent database to aid our investigation and so that readers could use to see what, if any, payments were made to their doctors.

Scrape and Share

There is no data on the Internet that is actually impossible to download. PharmaShine, a company that has created a business around collecting and selling access to the doctor payments data, told the Times that they manually retyped Lilly’s records.

We decided to write some code to automatically copy Lilly’s online data, which probably wasn’t much faster than just hiring a manual transcription service like Amazon’s Mechanical Turk. But it was an interesting challenge, the solution to which we could reuse and share for everyone else who comes across difficult-to-parse documents.

Lilly ended up releasing its list as a downloadable PDF and so we focused our efforts on examining the combined payments list and cross-checking it with federal and state records.

Building a database-backed site was one of our main goals of this project, both so that readers could freely search it, and to start a conversation about what might go into the federal database scheduled for 2013. Since our database's launch in October, readers have searched for payments to their doctors more than a million times. Dozens of news outlets have used our data to do investigations of their own.

We also want to share how we gathered the data from a variety of formats, because the methods apply to many other kinds of public records plagued by digital hurdles and inscrutable formats.