Discarded Hard Drives: Data Science as Debugging

As a University professor, when setting data orientated projects to Computer Science undergraduates, I used to find it difficult to get students to interact properly with the data. Students tended to write programs to process the data, produce a couple of plots, but fail to develop any understanding of what useful information was in the data, and how best to extract it.

Something that I feel really helped is an analogy that, I think, gets software engineers into the right mindset for working with data analysts.

Data science is not like programming, it is like debugging.

Imagine you are given a body of code, perhaps that was found on a discarded hard drive on the street. Imagine that you knew that the hard drive contained code that you need to make your production system run. You need to integrate some of the code on the hard drive.

Unfortunately, there is a lot more code on the hard drive than the code you need. The code is undocumented, and it’s of uncertain provenance. How do you go about integrating this codebase? The answer is: very carefully.

In the end, the new code you need to run in production may only be a couple of lines: the correctly formated API call to the relevant library on the hard drive. But the amount of thought and time placed into those lines of code will be more than you’ve placed into any line of code you’ve previously written.

The tools we use for data science: interactive notebooks like jupyter, visualization techniques, plotting libraries etc.. These tools are sophisticated ways of peering into the memory of the computer, manipulating what’s in there, making sure the information inside is extracted in a safe way. You will write many lines of code exploring the data. Just as you would write many lines of test code to understand the behaviour of the discarded hard drive does. But the final lines of code that extract value from the data may be very few, like the short API call to code on the discarded hard drive.

This perhaps also explains why it can be hard to plan and resource data science projects, it’s akin to planning and resourcing the debugging of a codebase of unknown provenance.

So what are the lessons here?

When you begin an analysis, behave as a debugger.

Write test code as you go. Document those tests and ensure they are accessible by others.

Understand the landscape of your data. Be prepared to try several different approaches to the data set.

Be constantly skeptical.

Use the best tools available, develop a deep understand how they work.

Never go straight for the goal: you’d never try and write the API call straight away on the discarded hard drive, so why are you launching your classification algorithm before visualising the data?

Ensure your analysis is documented and accessible. If your code does go wrong in production you’ll need to be able to retrace to where the error crept in.

When managing the data science process, don’t treat it as standard code development.

Don’t deploy a traditional agile development pipeline and expect it to work the same way it does for standard code development. Think about how you handle bugs, think about how you would handle very many bugs.

Don’t leave the data scientist alone to wade through the mess.

Integrate the data analysis with your other team activities. Have the software engineers work closely with the data scientists. This is vital for providing the data scientists with the technical support they need, but also managing the expectations of the engineers in terms of when and how the data will be able to deliver.