|

|

|

Mapping the GDELT data in R (and some Russian protests, too)

In this post I show how to select relevant bits of the GDELT data in R and present some introductory ideas about how to visualise it as a network map. I've included all the code used to generate the illustrations. Because of this, if you here for the shiny visualisations, you'll have to scroll way down

The Guardian recently published an article linking to a database of 250 million events. Sounds too good to be true, but as I'm writing a PhD on recent Russian memory events, I was excited to try it out. I downloaded the data, generously made available by Kalev Leetaru of the University of Illinois, and got going. It's a large 650mb zip file (4.6gb uncompressed!), and this is apparently the abbreviated version. Consequently this early stage of the analysis was dominated by eager anticipation, as the Cambridge University internet did its thing.

The data comes bundled with a few python and R scripts, which I quickly set about shaping to my needs. Although hampered by my lack of python knowledge I got the extract script running easily enough, and soon was pulling out all events associated with Russia. It was still a bit unclear what these 'events' were.

According to David Masad, there are four types of event:
1. Material Cooperation
2. Verbal Cooperation
3. Verbal Conflict
4. Material Conflict

I was still not sure what these exactly represented, but decided to press on and look at the data more closely. The events recorded are very obviously are skewed towards the present, so any time-series conclusions should be taken with a pinch of salt, or at the very least adjusted somehow.

I was a bit worried about how R would handle a file that size, so I tried reading in a few lines at a time. The first line looked like this.

Clearly the data is tab delimited. The first column is the date, the second and third actor codes, then three columns I later discovered were 'event code, quad category, and Goldstein Scale'. The last six columns are longitude and latitude, representing the location of the actors and the event.

By the way, for anyone preferring to do this in Python, which frankly makes a lot of sense, here is a great tutorial for finding exactly the data you need.

Now for loading the data in. As it turned out I had enough memory to load the data, but would run out quite quickly if I kept it all in its raw shape. This was when I realised just how massive this data set is: more than 3 million events featuring Russia. Wow.

First I renamed the columns, fixed the dates, and saved the file in R's native Rdata format (this got the data down to a positively miniscule 37mb)

The list of actors looks promising: Russia, Russian government, a generic 'GOV' (what's that about?), USA, MED and MIL which I imagine are medical and military respectively, followed by the Russian military etc etc.

The actor2 columns looked quite similar, but featured some intriguing entries, such as IGOWSTNAT. Ideas, anyone? [Turns out this IGO refers to international or regional inter-governmental organisations ]

As for date distribution, it is reasonably sparse until the mid 90s, since which there has been a general rise, with sharp increases in events during the crises of the late 90s and late 2000s.

Now, what about plotting some on a map? For this initial plot i ignored actor and date data, and just used the final two columns of event location, grouping them by count. This left 30000 distinct geo locations in which events with Russian involvement have been recorded.

Clearly there is a lot going on here. Maybe focusing on one type of activity would be more productive. This is where it really gets interesting. The data set uses the CAMEO coding scheme , which includes quite detailed information about event type. Let's isolate examples of civil disobedience: protests, rallies, hunger strikes, strikes, boycotts, obstructions, violent protest, etc.

This reduces the data to 25 000 events. Still a lot.

But the CAMEO specifications include information about actors, e.g. the NGOs involved. Hence the mysterious REB, SPY, JUD, etc.

By keeping only entries where at least one agent is a representative of so-called civil society, we are left with 6443 entries.

On the left is all civil society protest activity from 1979 until July 2011, and on the right in the following 12 months. I've not analysed the plot in great detail, but to the naked eye it looks like there were half the number of (reports of) protests and acts of civil disobedience in Moscow in 2011-12 as for the preceding 30 years, which you may or may not find surprising.

On the subject of protest movements, Alex Hanna has already looked at what GDELT has to say about the Arab spring.

Back to Russia: a more interesting plot can be achieved by drawing lines between the geo location of actors. For the plot below I have only included events between July 2011 and mid 2012, when the dataset ends.

In the plot below the circles represent the number of events occurring in a given location, while the shaded lines represent events involving two actors in different locations. The red end of the edge is the origin, the white the destination. I removed all links to the USA or the southern hemisphere, as these obstruct the map. I was interested to note how few events link Russia and the former East European Satellite states (Poland, Hungary, etc.), while noting that there seems to be an extraordinary amount of activity linking Russia and Israel, and possibly also Syria. Finally, it seems that events taking place in the Russian regions, especially the Caucasus, very rarely elicit an international response:

13 comments:

Great work! Excuse me for the noobish question, but how do you extract just the Russian events from the R scripts provided? I'm still fairly new to all this coding business, any advice would be helpful.

When you download the data a series of python scripts are provided. I'm a bit of a noob in python myself, but what I did was open the file GDELT.select.py, and hardcoded it to give me what I wanted. To do this I blanked out lines 88-138 (these are for the command line functionality) and added the lines

srccode = "RUS"tarcode = "ALL"

Then I just ran the script through Idle.

Whatever you do, don't try to load all the data into R directly!Good luck, R

Hi Eric, great question. Yes, it functions like grep, which is quite handy. Later in the analysis I took different approaches depending what I was looking for. So when I zoomed in on the protests I refined my search using grep to keep only events where at least one of the actors involved was relevant to me. The coding table is here: http://web.ku.edu/~keds/cameo.dir/CAMEO.CDB.09b5.pdf

When I tried to map all the Russian events (http://quantifyingmemory.blogspot.co.uk/2013/04/big-geo-data-visualisations.html) I just kept everything. In the link I just posted I also write a bit more about the difficulty of slicing the data according to the coding schemes used - it works great if the problem is big enough, but if you're looking for unusual stuff, e.g. actors who feature rarely, Cold War espionage assassinations, or similar, then the errors in the coding become a real issue.

Thank you, Rolf, for your thorough reply. The time lapse video is awesome by the way.

I also think that, without precise classification of actor codes and event codes, some errors are inevitable when we try to aggregate information from multiple aspects. Focused study with controllable subsets of codes may be more precise. kinda tradeoff, imo.

Again, this is quite an inspiration. Will go deep and play with it. Thank you.

Hi Rolf, Very wonderful job! Btw, I wonder if there is a way to draw the point with different colors? I have made the color parameter to some factor variable but I have the error below: Error: Discrete value supplied to continuous scale Have you also tried this?

This sounds like a ggplot thing - stackoverflow will be the place to look, but in a nutshell, if you map colour to a variable, it goes within the aesthetics (e.g. aes(x=yourVar,y=yourVar2,colour=yourGroupVar), whereas if you want to name a colour you move it outside the aesthetics: (aes(x,y),colour="red"). Hope that helps? Best, R

Thanks Rolf, this is some awesome stuff! I wonder how I just came across this post after 2 years... Anyway, check out my analysis of conflict dynamics using GDELT data. https://csaladenes.wordpress.com/2015/05/23/insurgent-dynamics-a-systematic-analysis-of-social-unrest-using-the-gdelt-event-database/

Gclub Slot Online casino games are a great way to play online casino games in a matter of minutes. Online gambling games today is considered a pleasure. And the challenge of your online gambling expertise is very much because playing online gambling games is different from playing gambling games altogether.

By the side of the player will play the game online gambling through the computer screen with the latest online gambling games to change the view. Online casino games that you have never seen before. By default, playing casino games through this online system. It's like playing online games online. Have fun, have a lot of games to choose from the satisfaction you like games that play the game.

And all important online casino games can be played at any time, no need to spend time downloading the program, just have the Internet to play all games, and playing online gambling games on the site of this online casino is fun. The player and his friends can join and start playing gambling online easily at the website Holiday Palace

Rolf Fredheim

I am a Postdoctoral Research Fellow at the University of Cambridge, where I work on the Conspiracy and Democracy Project. In my work I apply quantitative methods to big textual data, in particular Russian media and literature. I also create a lot data visualisations, most recently in d3. in my spare time I enjoy rowing and support Tottenham Hotspur.