Background

The source data was available in PDF and HTML formats, so it required additional processing before becoming usable.

Data Engineering

Seeking machine-readable versions of the NCSL data, I found an open source GitHub repository, but it hadn’t been updated in a few years. Happy to contribute to an open source project, I sought to update the repository with the most recent years’ data.

Through a Twitter conversation with the repository owner, I learned the existing conversion process involved software called Tabula. When I used Tabula to convert the most recent two years of party composition data, I ran into issues.

After writing a Ruby script leveraging the pdftotext library to convert the PDF files to TXT format, I wrote scripts to convert the TXT files to CSV and JSON formats. I then wrote scripts to convert the gender composition HTML tables to CSV and JSON formats.

After initial satisfaction with the conversion results, I introduced validation checks into the process. Theses validations uncovered a fewerrors in the source data, which I communicated to NCSL via Twitter and remediated by updating the conversion scripts.

When satisfied with the validation effort, I submitted a pull request to add the most recent data and the automated conversion scripts to the original repository.

Software Engineering

After producing a full compliment of machine-readable data, I created an interactive dashboard to consume the data and aid in exploration.