The Essence of Data Visualisation

Last year (2016) after the KCSE results were released, my friend Chris Orwa, a data scientist, wrote a blogpost analysing the data on the performance of the past years, looking to see what changed in 2016. It’s a good post and you are encouraged to have a look at it. However, I had one problem with it, the visualisation he used. I told him that I thought I could do better.

Some background KCSE stands for the Kenya Certificate of Secondary Education, which is taken at the end of secondary school in Kenya. This exam is used to determine which university you’ll attend and what course you can take, therefore they’re a very high stakes and could very much determine your quality of life in the future. The first KCSE exam was held in 1989.

Chris’ heat map

It’s taken me 7 months to get round to this post :D.

I felt his choice of visualisation wasn’t the best because it doesn’t clearly demonstrate one of his key points:

Of another interest is the consistent band between C plain and D plain between 2006 and 2015. There’s no gender bias in this range, the same percentage between boys and girls (just birds of the same feather) :). In 2016, most of this band of brothers and sisters regressed further with most having a D minus. Even though there isn’t evidence of them participating in exam irregularities, the strict examination environment made their performance worse.

If you’re not someone used to seeing heat maps like the one he used, it’s almost impossible to visualise his statement: …most of this band…regressed further with most having a D minus. In the heat map, this is represented by the dark blue squares on the 2016 rows. The dramatic shift this represents in the data is difficult to visualise but I found the perfect way to do this. A line graph.

But first did you know that in 2015 the number of students doing KCSE went from 45967 to 521240? That’s 1134% increase!!! I discovered this quiet by accident when I tried plot the raw numbers:

Correcting for that I converted the number of students in each year into the percentages to make the years more comparable

What the data tells us is that in previous years(2010-2015) grading was done on a curve, given how consistently the percentage of students in each band has remained every year. In Chris’ post he does point out that the year 2015 is considered the most wildly leaked exam in Kenya’s history while that year’s curve is not all that different from previous years. The slight difference, if you look carefully, is that more people(in absolute percentages) did better than in earlier years.

2016, in my opinion, was a radical shift in how exams were marked in that there was no standardization of grades at all. About 66% of students got grades of D+ or lower compared to the average of about 45% in previous years.

The point of this post is to highlight different ways of presenting data and how they might be interpreted. The job of a data scientist is to use data to tell stories. Therefore it’s of extreme importance to think of your audience and the best way to bring your point across. As data scientists, it’s very easy to make assumptions about the audience and their level of familiarity with how we work, we must be cognizant of these assumptions and question them at every turn.

I would love to hear your thoughts on this, which way of presentation is better, mine or Chris’? I will say this for his visual; it’s able to pack in more information than mine. Let me know in the comments.