Decoding Data- A View from the Trenches

This has been a busy data month for me, as I prepare zooarchaeological datasets for publication for a major data sharing project supported by the Encyclopedia of Life Computable Data Challenge award. The majority of my time has been spent decoding datasets, so I’ve had many quiet hours to mull over data publishing workflows. I’ve come up for air today to share my thoughts on what I see as some of the important issues in data decoding.

Decoding should happen ASAP. Opening a spreadsheet of 25,000 specimens all in code makes my blood pressure rise. What if the coding sheet is lost? That’s a lot of data down the drain. Even if the coding sheet isn’t lost, decoding is not a trivial task. Though much of it is a straightforward one-to-one pairing of code to term, there are often complicated rules on how to do the decoding. Though an individual with little knowledge of the field could do much of the initial decoding, one quickly arrives at a point where specialist knowledge is needed to make judgment calls about what the data mean. Furthermore, there are almost certainly going to be typos or misused codes that only the original analyst can correct. Decoding should be done by the original analyst whenever possible. If not, it should be done (or at least supervised) by someone with specialist knowledge.

Decoding is expensive. In fact, it is one of the biggest costs in the data publishing process. I’ve decoded five very large datasets over the past few weeks and they required about five to ten times more work than datasets authors submitted already decoded. The size of the dataset doesn’t matter—whether you have 800 records of 100,000 records, data decoding takes time. For example, one of the datasets I edited for the EOL project had over 125,000 specimens. It was decoded by the author before submission. Editing and preparing this dataset for publication in Open Context took about four hours. In comparison, another dataset of 15,000 specimens was in full code and took over 30 hours to translate and finalize for publication. This is something critical for those in the business of data dissemination to consider when estimating the cost of data management. Datasets need to be decoded to be useful, but decoding takes time. Should data authors be required to do that work as part of “good practice” for data management?

Coding sheet formats matter. Ask for coding sheets in a machine-readable format so you can easily automate some of the decoding. Though PDFs are pretty, they’re not great for decoding.

Decoding often has complicated (and sometimes implicit!) rules. Keep all the original codes until you are sure you have finished decoding. Otherwise, you may find you need a code from one field to interpret another field. For example, one researcher used four different codes that all translated to “mandible.” It turns out each code was associated with a certain set of measurements on the mandible. If you decode the elements first (as you would) and make all the mandibles just “mandible,” then you reach the measurements section and realize you still need that original code distinction.

Because of all of this complexity, in practice it is hard to totally automate decoding, even if you are lucky enough to have machine-readable “look-up” tables that relate specific codes to their meanings. In practice, codes may be inconsistently applied or applied according to some tacit set of rules that make them hard to understand. Mistakes happen when unpacking complicated coding schemes. It really helps to use tools like Google Refine / Open Refine that record and track all edits and changes and allow for the role-back of mistakes.

Finally, the issues around decoding help illustrate that treating data seriously has challenges and requires effort. One really needs to cross-check and validate the results of decoding efforts with data authors. That adds effort and expense to the whole data sharing process. It’s another illustration why, in many cases, data sharing requires similar levels of effort and professionalism as other more conventional forms of publication.

Decoding is necessary to use/understand data. Why not do it at the dissemination stage, when it only has to be done once and can be done in collaboration with the data author. Why make future researchers struggle through often complicated and incompletely documented coding systems?

Support for our research in data publishing also comes from the ACLS and the NEH. Any views, findings, conclusions, or recommendations expressed in this post do not necessarily reflect those of the funding organizations.