Why to Keep a Log File and How to Keep It in R

January 29, 2018

2 minute read

One of the first things we are taught in Programming 101 is to write a well-structured and commented code. And as any newbie would, we ignore this lesson and focus on achieving the end result. Recently, I coded a R (the R language!) script to be run on files amounting to 30 GBs! This was my first professional experience after my graduation and I did not want to fuck up. So I structured the code, wrote all the comments and ran it on all the files. And what happened next?!

All files were not structured the same way and so my script broke for a few files, leaving my final data set void of some very important data. Moreover, my script was deleting some rows from every file, and thus I was tampering with the original data set without any logical and concrete reason behind it. It might not sound something as significant as not achieving the final goal, but believe me, in data science, if your data is not representative of the true data set, your analysis is considered void.

A log file is a file that records events that occur during a process. It basically helps to track back the process and discover if anything has gone wrong.

Reasons to Keep a Log File

So how to account for such cases? Maintain a Log File! If you need more reasons for maintaining a log file, here are few I can think of:

Large data sets follow Murphy’s Law. Anything that can go wrong, will go wrong. And a log file is the best way to keep check.

While running a common script on several multiple files, a log file will give you a gist of the whole process.

A log file will help for future reference, both for your own self and also for others who will use the script or the data set again.

What to Write in a Log File

So, okay! I know a log file is important, but what do I write in the log file? It depends on the use case. As someone who works with data daily, I usually maintain the following parameters in my log file:

Achyut Joshi, who in his own words is a jack of all trades but a master of none, is a recent engineering undergrad whose only motive in life right now is to figure himself out. He regularly writes about data, books, and random stuff on his personal blog.