This year @InfoTimes_ won the @GENinnovate ‏ Data Journalism Award with the story "Murder in the arms of marriage". For the @journocode #ddj advent calendar, data journalist @IslamSalahuddin describes how he and his team worked on the project from start to finish.

In the story of "Murder in the arms of marriage", the data team at the award-winning InfoTimes
analyzed news reports on murder cases between wives and husbands in Egypt.
They worked on the project for ten months, and it helped them win the GEN Data Journalism Award this year.
Islam Salahuddin will describe for us how the team worked on the project from beginning to end, going through the details of the different steps.

InfoTimes is Egypt's only data-driven journalism agency and one of very few data-focused newsrooms in the Middle East. The team had started working on the project before I joined up nine months ago. I was responsible for it from nearly the middle of the process until the story was published.

The goal was to analyze the reasons, tools and frequency of murder and attempted murder crimes between wives and husbands in Egypt. The story idea came from a then-arising public debate, especially on social media, about the increase in the frequency of such crimes. So, we decided to settle the debate and explore what is behind it – with data.

News stories as a data source

First, we needed a dataset to analyze. In Egypt, we still do not have NGOs or other organizations that collect data about such crimes, or about almost anything else! Contacting the police departments or any governmental institution was not an option because, you know, transparency about data is not really their thing in our region. That's why we decided to depend on news stories as a source for data about the crimes that we aim to analyze.

To make things clear: news stories are not the best source of data:

News don't cover each and every crime that happened in the geographic area and during the time period we were concerned with.

Some news reports may not be fully accurate.

Some news stories may lack fundamental information about the case.

This is in addition to the possibility of repetition and other issues.

This treemap compares the ways in which husbands and wives are killed. The female murder victims are stabbed more than twice as much as men, while the male victims are poisoned more.

How to work with shaky data

Since we had no better choice than depending on the news sources, though, we dealt with these issues directly:

We tried to make it as clear as possible for the readers that we were analyzing only a sample of the crimes that actually happened, that this sample was not representative, that it just depended on the news coverage – and we admitted that it, consequently, followed its biases.

We tried to verify each crime that went through our analysis from multiple sources of news.

We worked on completing the missing pieces of information from complementary news sources.

After all of that – a practice which is even more important in data journalism – we declared our data source for the readers, together with a short description of the data we gathered from it, as an intro for the story.

This bubble chart compares the reasons from which husbands and wives are killed. We can see that wives are killed for more various reasons. It also shows how cheating is a major reason for killing husbands.

Scraping: Extract the info you need from a website

We then had to decide which news website we will depend on for the coverage. The editor chose the site Youm7 because it is known for being one of the websites that has the most continuous and systematic coverage on crime. It is also the most-read news website in Egypt according to Alexa. It is well-known to Egyptian readers as well. In addition, its website structure was consistent and organized, so it was easy to scrape stories from it. The word “scraping” means to extract the news stories from the website and record them in a dataset, basically a table of rows and columns. Each row recorded one story, detailing, across the columns, its title, date, excerpt, category and URL. So we scraped the crime section of the website, using a piece of programming code. And so we had a dataset of recorded news stories about the covered crimes that happened in Egypt during a certain time period. It still contained a lot of information than what we did not need, in addition to being messy and incomplete with regard to the murder cases. Afterwards, the team went through the news stories. We had to classify them based on whether the story was about a murder or attempted murder between spouses, or about something else. So we excluded all the news stories about other crimes that we were not concerned with – and found ourselves left with 222 cases to put our focus on.

This chart compares the number of victim husbands and victim wives in each quarter across the years covered. The lines represent the average number of quarters over the year. We can see that the victim is usually the wife across all the years.

How to make your data show the full picture

To be able to analyze it thoroughly, we had to extend the dataset with more details about each case. We added new columns, which were not originally present in the dataset we scraped, to show: Who was the murderer (the husband or the wife), who was the victim and, additionally, the tool of the murder, the place of the crime, the age of the murderer and the age of the victim. This is when I joined up the team, and this was the most time-consuming stage.

The dataset was still messy, because of initial mistakes in the scraped data plus human mistakes made by the team. It was my responsibility to get down to clean the data. Typical errors are typos, writing the same word with two different spellings or different formats, in addition to duplications and other problems. I did the data cleaning for this story using Google Spreadsheets. I could use Microsoft Excel, but I personally often prefer the former for its online availability and simpler interface. After the data was clean, I started my analysis, also using Google Spreadsheets, and found some results that could be interesting for the readers to build the story on.

This diagram compares between the ages of criminals from both sides. The width of the grey area reflects the width of the age range of the murderers. We can see that in the sample we have analyzed, husbands tend to kill their wives at almost all stages of their lives. However, they are most likely to commit the crime in their thirties.

After discussions with my editor on these results, we started drafting, editing, co-editing, polishing and translating the story. While I got to editing the text, I was also working with my editor to see which data visualizations would best tell the story we had. I built the visualizations using Tableau Desktop. It is a relatively easy and quick tool to visually dig deep into your dataset and come up with appealing, interactive visualizations that can be presented to the public online. The resulting visualizations are usually quite slow to load, though, which is a catastrophic issue for websites, especially journalistic ones. Tableau visualizations also usually do not look exactly the same across different browsers and devices, which means you'll have to compromise in some of your design choices most of the time. Anyway, we believed Tableau Desktop would do it in our case, and, to a large extent, it did. The discussions between me and my editor at this stage included choosing the chart types and choosing colors, shapes and sizes. Almost all of the successful choices of colors in particular are my editor's (He's my ex-editor now, so I'm not dissembling! Swear! I mean, kind of!).

The final visualization is an interactive dashboard that lets the viewer hover over each case to know all the data we have for it. It also allows the viewer to use multiple filters to explore our data from the angle of interest.

Bringing a new perspective to a live debate

We published the story in both Arabic and English. The Arabic is for the typical local readers, and the English is for the local readers with English preference, non-Arab readers and for global awards and competitions. The story with the title ‘Murder in the arms of marriage: Story of 222 cases' was one of the stories that helped us win the GEN's Data Journalism Awards 2018 as the best small data journalism team. Locally, it was one of the stories that touched on a live debate with an approach that the audience were not yet familiar with, so it provoked some sense of interest in both our analyses in particular and in data-driven journalism in general.

About

Islam Salahuddin

I'm a freelance data journalist, researcher, writer and translator. I previously worked for InfoTimes. I'm also particularly interested in interactive storytelling techniques and innovation in media.

Runs on:

How many stickers do you have on your laptop?Guess what. None!

How many pie charts have you built?Hehehehe. A lot! But most of them change later. I always try to escape from pie charts.

Rate your CMS from 1 to 10. 1. Very bad!

How many times per week do you have to explain what "data journalism" is? Two to three times a day. It is good to have such an interesting topic when you are with friends or family, or when you are getting introduced to someone new. The sense of uniqueness is worth it!

Swear words per day? I believe this depends on whether you mean secretly or openly. I'm generally very much successful in keeping the swearing out of my public speaking.

How big was your biggest data set?I cannot remember, but I still remember the first time I reached the maximum number of cells in Google Spreadsheets (which is 200,000 cells). That was about two years ago. I felt like I was reaching for the sky!

Your funniest file name?I have a folder with the title 'لا شيء', which means 'Nothing!' in Arabic. This is the folder where I keep some of my writings that I cannot classify or name!