VozData: collaborating to free data from PDFs – The Senate Expenses part II

VozData is a collaborative tool for converting public documents trapped in PDFs into a structured database that all citizens can understand and journalists can report from.

The application is inspired in “Free the files” by Propublica.org and “MP´s Expenses” by The Guardian; it allows readers and users to be part of the process of checking documentation that is vital for citizen participation and control of information produced by governments.

The first project of this initiative focused on expenses by the National Senate and included 6500 renditions of accounts issued by the General Accounting Office (Dirección General de Contaduría) of that chamber for the years 2010, 2011, and 2012, published on the official site of the institution.

Senate Expenses is a project that continues the investigation of the same name, winner of the 2013 DJA (See more here). LA NACION had a great number of public documents in PDF format that presented enormous difficulties for processing, and even applying six different OCR engines, valuable information is still being lost.

For this reason LA NACION uploaded all those documents to Documentcloud.org and developed VozData, an application that calls for open collaboration for all citizens to help us review each PDF document and enter the data in a structured way.

The results of this collaborative work that was fulfilled by more than 500 volunteers in 2 months, is being published in the site in real time in the form of rankings of recipients and type of expense. There is also a ranking of users that reviewed and classified more documents.

By the end of each project data, after LNdata team checks a representative sample, the dataset will be available for download in open data formats (CSV, XLS, etc). The Code behind Vozdata was open sourced by OpenNews Fellows and named Crowdata.

VozData is a new experience in Argentina, a country without FOIA law, and also in Latin America .

LA NACION aims to activate demand of more and better public information produced by governments no matter what format it comes in, it is better to have this documents published online and not hidden or disappearing as many cases have already demonstrated.

So to promote citizen participation, move data one step closer to the citizens and allow them to interact with public documents using collaborative technology is something we think must be part of every media mission at this age.

The first work by the team consisted of exploring information that it contained and understanding what data was important to consider and transform into information understandable for people. A Google Spreadsheets matrix was generated and more than 150 documents were entered manually as a sample. This work allowed us to understand the structure of the documents, detect the common fields and the challenges presented.

Once the data model was built for structuring the information on the renditions , a form was created for users to upload the required data to the site.

One week after launching we had to check data consistency and integrity. For this we split in three and made queries to the database, retrieving those documents that had more than $ 50.000 in their amounts . As we read in the legislation, the Direction of Accountability cannot pay more than $ 50.000. So we found we had more than 50 cases in which this amount was being surpassed.

So we opened case by case and we found some strange cases in which the comma or dot was not correctly recorded in the database, and many amounts were augmented by two ceros.

We reacted building a checking process and reviewed more than 900 cases one by one to found and repair this problem we had in those first cases.

We also did some manual checking, see photo:

3. Technology

To facilitate the classification of the types of expenses included by the users the database was pruned as the documents were verified. The object was to expedite filling the forms, on the one hand, and to establish classification categories and create a database with organized information to facilitate the final interpretation of the uploaded data.

The database generated for each project will be available for downloading once the revision and pruning of the data is finished.

VozData programming was done with Python, with the framework web Django. The database is in PosgreSQL. This was chosen so as to be able to apply similitude algorithms (using the extension pg_trgm) to the values we want to compare when verifying several entries for the same document. Free software tools were chosen so that the generated code would be open. The documents are hosted in DocumentCloud and the “text” function is used to extract text when readable by DCloud´s OCR and add optional detailed information per rendition. (this is explained to the users in a tutorial step by step video).

The architecture of the application is flexible and is not tied to the specific case of the senate expenses so that it can be easily customized by any person not familiar with the programming with administrator authority.

The software is available under a free license and with the name Crowdata. As to the organization of LNData team’s work, all incidents detected were uploaded to GitHub, establishing priority and designating the person in charge of solving them.

4. OPEN COLLABORATION: The long term view for changing to open data culture in Argentina. Partnering with NGO’s and Universities , and citizens!

On March 21 there was a pre-launch event of VozData in the LA NACION building with the object of testing the application and getting feedback from users about the “usability” of the tool.

For this reason LNData called on NGOs, civil institutions focused on transparency, open data and access to public information, and the university community related with these issues, to make the initiative for collaborative opening of data in Argentina sustainable over time. Our object is to produce change in the long run, generate an understanding and expand the culture of open data and transparency in the whole community.

During May 2014 we started a campaign to try to finish “the pile” during #SemanadeMayo that is a patriotic week and culminates May 25th. This historical day of 1810 is well known with a phrase “El Pueblo Quiere Saber” “People wants to know”.

By 11 AM, May 25 the national holiday, we finish to review the 6557 docs and shouted from different places ¡Viva la Patria! A salute to our Homeland.

While submitting this application the team is checking data for opening in machine readable formats and we are launching, project 2 “Senate Expenses 2013″ for opening new public data and continue to develop the collaborating open data spirit within citizenship, NGOs and Universities along Argentina.

VozData is currently being launched with an information campaign. It includes sending information on the project to print media, radios and TV, interviews with the VozData team, stories in the print and online editions of LA NACIÓN, posts on the LNData blog, ads in the print edition of LA NACIÓN and a campaign in social media.