If you're reading this, chances are you're already excited about Global Data on Events, Location and Tone, better known as GDELT. If you aren't, you should be. Lotshasbeenwritten about how revolutionary this dataset might be, and I won't try to add to it here.

Instead, let's dive right in! In this tutorial, I'll go through extracting some basic time series from GDELT.

To follow along, go download the data from the GDELT website and unzip it. The data is about 4.6 GB uncompressed in a series of text files, one per year

We're going to need only a few libraries to start with: Matplotlib for visualization, datetime for handling date objects, and Pandas for handling, aggregating and reshaping some of the data. Pandas provides great functionality to easily plot time series, so we'll use it for that too. We'll also import defaultdict while we're at it, since it's often useful for data collection.

It's important to know how big our dataset is. It's also important to know if the data available over time is biased -- does GDELT have more events for recent years than for distant ones? If so, is that because more has happened recently, or because the data collection has gotten better?

The paper introducing GDELT (warning: large PDF) goes over this, but it'll be good practice to replicate some basic diagnostics.

So let's start with a simple count of just how many events -- all events -- the dataset has per month (which is a common typical unit of temporal aggregation). To do that, we'll open each file, figure out which month each event (meaning each row) occured in, and add them up.

In [5]:

monthly_data=defaultdict(int)# We'll use this to store the countscount=0# While we're at it, let's count how many records there are, total.foryearinrange(1979,2013):#print year # Uncomment this line to see the program's progress.f=open(PATH+str(year)+".reduced.txt")next(f)# Skip the header row.forraw_rowinf:try:row=raw_row.split("\t")# Get the date, which is in YYYYMMDD format:date_str=row[0]year=int(date_str[:4])month=int(date_str[4:6])date=dt.datetime(year,month,1)monthly_data[date]+=1count+=1except:pass# Skip error-generating rows for now. print"Total rows processed:",countprint"Total months:",len(monthly_data)

Total rows processed: 67927691
Total months: 402

Now we just turn this dictionary into a Pandas series, and plot it. Pandas will automatically recognize that we're dealing with a time series, because it's useful like that.

In [6]:

monthly_events=pandas.Series(monthly_data)monthly_events.plot()

Out[6]:

<matplotlib.axes.AxesSubplot at 0x10804f510>

As we might expect, the number of events in the dataset isn't uniform, and goes up rapidly in the later years.

Let's repeat the analysis above, but now examine material cooperation and conflict. Very (very very) roughly, is the world becoming more cooperative, or more violent?

In [7]:

material_coop=defaultdict(int)material_conf=defaultdict(int)foryearinrange(1979,2013):f=open(PATH+str(year)+".reduced.txt")next(f)# Skip the header row.forraw_rowinf:try:row=raw_row.split("\t")# Check the quadcat, and skip if not relevant:ifrow[4]notin['1','4']:continue# Get the date, which is in YYYYMMDD format:date_str=row[0]year=int(date_str[:4])month=int(date_str[4:6])date=dt.datetime(year,month,1)ifrow[4]=='1':material_coop[date]+=1elifrow[4]=='4':material_conf[date]+=1except:pass# Skip error-generating rows for now.

In [8]:

# Convert both into time series: monthly_coop=pandas.Series(material_coop)monthly_conf=pandas.Series(material_conf)# Join the time series together into a DataFrametrends=pandas.DataFrame({"Material_Cooperation":monthly_coop,"Material_Conflict":monthly_conf})trends.plot()

Out[8]:

<matplotlib.axes.AxesSubplot at 0x1080a7dd0>

Both seem to have roughly the same shape as the total counts, with material conflict slightly but persistently remaining more likely than material cooperation.

The Israeli-Palestinian conflict gets a lot of media attention, so we would expect it to be well-represented in the dataset. It's generally considered to be fairly important, with effects spilling over far from where it is actually taking place. It is also one of the case studies that Leetaru and Schrodt use to compare GDELT against a similar dataset in their paper.

All GDELT events have a source and a target actor. These are coded down to an impressive level of specificity, often down to whether a political party is a member of the government or the opposition when the event occurs. For a first pass, however, only the highest level of the actors will suffice. These will be ISR for Israel, and all Israeli actors; and either PSE or PAL for all Palestinian actors. We'll grab only those events which involve Israel-coded actors acting on Palestinian-coded actors, or vice versa.

Incidentally: learn from my mistakes, and RTFM. My first pass of this analysis was way off because I didn't read the GDELT documentation closely enough, and thought that the actor prefix for Palestine was PAL. In fact, almost all of the events are coded as PSE, the UN code for the Palestinian Occupied Territories. RTFM.

Pandas provides some powerful table manipulation tools; I'm partial to pivot tables, possibly due to several years of using Excel heavily for work. Let's pivot the data so that we count the number of events by QuadCat for each month.

In [11]:

pivot=pandas.pivot_table(ilpalcon,values="Day",rows=["Year","Month"],cols="QuadCat",aggfunc=len)pivot=pivot.fillna(0)# Replace any missing data with zerospivot=pivot.reset_index()# Make Year and Month regular columnspivot.head()

Out[11]:

QuadCat

Year

Month

1

2

3

4

0

1979

1

1

16

8

13

1

1979

2

0

14

5

5

2

1979

3

17

47

13

15

3

1979

4

2

18

10

56

4

1979

5

14

55

26

40

Now that we have a nice table of monthly event counts, we need to index it by date. It would also be nice to rename the columns to the QuadCat description. To create a date from the Year and Month, we need to create a function that generates a datetime object from them, and apply it to each row.

In [12]:

# date-generating function:get_date=lambdax:dt.datetime(year=int(x[0]),month=int(x[1]),day=1)pivot["date"]=pivot.apply(get_date,axis=1)# Apply row-wisepivot=pivot.set_index("date")# Set the new date column as the index# Now we no longer need the Year and Month columns, so let's drop them:pivot=pivot[["1","2","3","4"]]# Rename the QuadCat columnspivot=pivot.rename(columns={"1":"Material Cooperation","2":"Verbal Cooperation","3":"Verbal Conflict","4":"Material Conflict"})

In [13]:

pivot.plot(figsize=(8,4))

Out[13]:

<matplotlib.axes.AxesSubplot at 0x10849e910>

Interestingly, it looks like Verbal Cooperation is the most common form of interaction, even when violence (Material Conflict) spikes. We can also clearly see the peace process of the 90s, where Verbal Cooperation events are significantly greater than all others, and the spike in Material Conflict when the Second Intifada breaks out.

Finally, let's see what a general 'peace index' might look like, measuring the difference in volume between cooperation and conflict events.