News is increasingly being produced and consumed online, supplanting print and broadcast to represent nearly half of the news monitored across the world today by Western intelligence agencies. Recent literature has suggested that computational analysis of large text archives can yield novel insights to the functioning of society, including predicting future economic events. Applying tone and geographic analysis to a 30–year worldwide news archive, global news tone is found to have forecasted the revolutions in Tunisia, Egypt, and Libya, including the removal of Egyptian President Mubarak, predicted the stability of Saudi Arabia (at least through May 2011), estimated Osama Bin Laden’s likely hiding place as a 200–kilometer radius in Northern Pakistan that includes Abbotabad, and offered a new look at the world’s cultural affiliations. Along the way, common assertions about the news, such as “news is becoming more negative” and “American news portrays a U.S.–centric view of the world” are found to have merit.

The emerging field of “Culturomics” seeks to explore broad cultural trends through the computerized analysis of vast digital book archives, offering novel insights into the functioning of human society (Michel, et al., 2011). Yet, books represent the “digested history” of humanity, written with the benefit of hindsight. People take action based on the imperfect information available to them at the time, and the news media captures a snapshot of the real–time public information environment (Stierholz, 2008). News contains far more than just factual details: an array of cultural and contextual influences strongly impact how events are framed for an outlet’s audience, offering a window into national consciousness (Gerbner and Marvanyi, 1977). A growing body of work has shown that measuring the “tone” of this real–time consciousness can accurately forecast many broad social behaviors, ranging from box office sales (Mishne and Glance, 2006) to the stock market itself (Bollen, et al., 2011).

Can the public tone of global news data forecast even broader behaviors, such as the stability of nations, the location of terrorist leaders, or even offer new insight on conflict and cooperation among countries, as accurately as it predicts movie sales or stock movements? This study makes use of a 30–year translated archive of news reports from nearly every country of the world, applying a range of computational content analysis approaches including tone mining, geocoding, and network analysis, to present “Culturomics 2.0.” The traditional Culturomics approach treats every word or phrase as a generic object with no associated meaning and measures only the change in the frequency of its usage over time. The Culturomics 2.0 approach introduced in this paper focuses on extending this model by imbuing the system with higher–level knowledge about each word, specifically focusing on “news tone” and geographic location, given their importance to the understanding of news coverage. Translating textual geographic references into mappable coordinates and quantifying the latent “tone” of news into computable numeric data permits an entirely new class of research questions to be explored via the news media not possible through the traditional frequency count approach.

This study will explore how the latent tone of a large digital news archive can be visualized to understand macro–level changes in global society in both time and space. Measuring the tone of news coverage about a single geography over time, a fundamentally new approach to conflict early warning is developed that “passively crowdsources” the global mood about each country in the world. This is found to offer highly accurate short–term forecasts of national stability. Focusing on the spatial dimension and moving from the country to the city level, the geographic framing of the news is found to offer significant insights into both nationalistic views of the world and the way in which cultures and “civilizations” are portrayed by the media. Finally, mapping the geographies most closely associated with Osama bin Laden by the news media prior to his capture is found to fairly accurately pinpoint his actual location. Global news media tone that is temporally and spatially aware is found to offer an intriguing new approach to modeling the behavior of global society itself.

Data sources

Capturing the global news discourse and accurately measuring the local press tenor in nearly every country of the world requires a data source that continuously monitors domestic print, Internet, and broadcast media worldwide in their vernacular languages and delivers it as a uniform daily translated compilation. Global news databases like NewsBank’s Access World News emphasize English–language “international” editions of foreign media, intended for a foreign audience, while traditional news aggregators like LexisNexis do not include substantial non–U.S. content. Both cover only print media, while broadcast forms one of the primary news sources in many regions of the world, such as the Middle East (Howard, 2010). Even international newswires like Reuters include limited coverage of many regions (the entire continent of Africa represents just five percent of Reuters World Service) (Thomson Reuters, 2011) and do so through the eyes of Western–trained reporters framing events for their Western audiences.

Recognizing the need for on–the–ground insights into the reaction of local media around the world in the leadup to World War II, the U.S. and British intelligence communities formed the Foreign Broadcast Information Service (FBIS — now the Open Source Center) and Summary of World Broadcasts (SWB) global news monitoring services, respectively. Tasked with monitoring how media coverage “varied between countries, as well as from one show to another within the same country ... the way in which specific incidents were reported ... [and] attitudes toward various countries,” (Princeton University Library, 1998) the services transcribe and translate a sample of all news globally each day. The services work together to capture the “full text and summaries of newspaper articles, conference proceedings, television and radio broadcasts, periodicals, and non–classified technical reports” in their native languages in over 130 countries (World News Connection, 2009) and were responsible for more than 80 percent of actionable intelligence about the Soviet Union during the Cold War (Studeman, 1993). In fact, news monitoring, or “open source intelligence,” now forms such a critical component of the intelligence apparatus that a 2001 Washington Post article noted “so much of what the CIA learns is collected from newspaper clippings that the director of the agency ought to be called the Pastemaster General.” (Pruden, 2001)

While products of the intelligence community, FBIS and SWB are largely strategic resources, maintaining even monitoring coverage across the world, rather than responding to hotspots of interest to the U.S. or U.K. (Leetaru, 2010). A unique iterative translation process emphasizes preserving the minute nuances of vernacular content, capturing the subtleties of domestic reaction. More than 32,000 sources are listed as monitored, but the actual number is likely far lower, as the editors draw a distinction between different editions of the same source. Today, both services are available to the general public, but FBIS is only available in digital form back to 1993, while SWB extends back more than three decades to 1979, and so is the focus of this study. During the January 1979 to July 2010 sample used in this study, SWB contained 3.9 million articles. The only country not covered by SWB is the United States, due to legal restrictions of its partner, the CIA, on monitoring domestic press.

Internet–based news

The mandate of the open source intelligence community is to follow news where it is produced and consumed, and while broadcast and print have dominated the majority of SWB’s 70–year history, the global growth of Web–based news has not been lost on it. As Figure 1 shows, over the last 15 years Internet–based news has displaced print and broadcast to represent 46 percent of all content monitored globally by the service last year. Many of these are traditional print and broadcast outlets that are simply monitored online now, while numerous new outlets have sprung up in Web–only format across the world. The open source intelligence community has thus increasingly transitioned from a print and broadcast monitoring service to a Web crawling and translation service.

Not all news outlets are available online, especially in regions with lower Internet penetration, and by using SWB as the primary data source for this study, translated Web–based news is supplemented with print and broadcast news. Through its combination of print, broadcast, and now Internet news, SWB offers a continuous 30–year view into the global media system and its evolution. Such a multi–decade longitudinal dataset is critical to being able to place contemporary developments into perspective: if a country’s tone suddenly becomes sharply more negative, is that a major development, or when looking back over 30 years is it merely a regular seasonal change in tone?

Figure 1: Percent of Summary of World Broadcasts content sourced from the Internet January 1994–July 2010.

Comparison sources

To ensure that the results of this paper are not merely artifacts of the Summary of World Broadcast collection and to explore the way in which news collections themselves can yield highly disparate world views, two comparison datasets are used: the complete full text of the New York Times 1945–2005, and an archive of global English–language Web–based news content 2006–present. The full text of all 5.9 million news articles published in the New York Times from 1 January 1945 to 31 December 2005, totaling 2.9 billion words, offers the complete population of the paper of record of the United States. While the Times itself is a sampling of global news considered by its editors to be of interest to an American audience, having the complete population of its reporting allows a comparison with SWB’s sampled news archive. The digitized Times compilation used here ends with 31 December 2005, preventing its application to more recent conflicts, so a manual process was used to update the archive with all articles mentioning Egypt or Cairo from 1 January 2006 through 31 May 2011.

A web crawl of English–language Web–based news sites from across the world is used as a proxy for English Web–only news to test how much additional insight is gained through SWB’s ability to penetrate non–Web broadcast and print media and translate vernacular languages. The crawl includes roughly 10,000–100,000 articles per day from 1 January 2006 through 31 May 2011, and includes all URLs indexed by Google News’ front page, main topic pages, and individual country feeds (using its “location:” functionality).

Social media and other news alternatives

Egypt would seem the ideal test case to explore social media content, having the highest penetration of social media of any Middle Eastern or North African country, with more than five million Facebook users. In the weeks prior to the street protests, social media played a critical organizing role, helping to overcoming information asymmetry in letting protesters know there were others willing to take to the streets (Kirkpatrick, 2011). However, recognizing the power of social media, the government took strong steps to wrest control of the social media discourse, limiting Internet access, posting statements of support for the regime, falsely announcing that protests had been canceled and trying to obtain information about protesters (Preston, et al., 2011).

Social media appears to have played more of an organizing role, with the traditional state–controlled news media having a far greater effect in guiding broader public opinion towards the protests. One of the first sites secured by the Army when it entered Cairo was the state television headquarters, and TV programming focused on the lawlessness caused by the protests while highlighting the steps the government was taking to restore peace (Fahim, 2011). Indeed, state television’s coverage of the protests, depicting them as “foreign and violent” or ignoring them altogether, isolated the protesters and helped the regime regain its balance in the early days of the protests (Fahim, et al., 2011). Organizers later conceded that relying on social media alone to get their message out, even in a country as wired as Egypt, was not enough and traditional mainstream news media remains the dominate force in driving public opinion in that country (Fahim, et al., 2011).

Citizen media also presents many unique challenges to computational analysis. While some platforms like Twitter do provide programmer interfaces to their content, and blogs are available through several blog aggregators with RSS feeds, other platforms like Facebook actively prevent crawling even for academic study (Warden, 2010). In addition, social media and other localized indicators tend to be in vernacular languages, making use of localized slang or idiomatic expressions, requiring significant translation effort. Social media also show strong geographic disparity, with Twitter users in California and New York producing more content per capita than anywhere else in the United States or even Europe (Signorini, et al., 2011), while questions have been raised as to whether Twitter captures world events as well as it does entertainment and cultural news (Taylor, 2011).

Search engine trends have also been used to estimate on–the–ground interest in a topic by a specific population. A 2011 study measured Arabic–language searches of Egyptians for several political parties, attempting to gauge which party was attracting the greatest interest (Koehler–Derrick and Goldstein, 2011). Yet, Internet penetration varies by country and some governments actively block searches on key topics. In addition, previous studies have suggested search behavior may in fact be driven more strongly by news attention than interest cycles (Cooper, et al., 2005).

Given its importance as an early organizing tool, social media may provide an important complement to traditional news content as an early precursor of citizen unrest, but the technical and linguistic complexities, especially the need to operate on large numbers of vernacular languages across the world, made it beyond the scope of this study. Yet, as the next section illustrates, monitoring mainstream media alone yields a strong picture of Egypt’s social–media–driven revolution, suggesting mainstream media has not necessarily been fully supplanted as a pulse of global views on the world.

Method

The complete run of all three news sources were subjected to two key text mining techniques: sentiment mining (Hu and Liu, 2004) and full–text geocoding (Goldberg, et al., 2007). Sentiment mining counts up the number of words in a document that appear in precompiled dictionaries of “positive” and “negative” words to determine the density of emotional language and its overall “tone.” A document with many words like “terrible”, “awful”, and “horrific”, and few words like “good” or “nice” would be given a highly negative score by the algorithm, while one with more positive language would be given a more positive score. While not as accurate as humans, automated sentiment mining systems are robust enough that they are now used by most large companies to monitor the online discourse about their products and learn which areas consumers like and dislike.

It is important to note that computer–based tone scores capture only the overall language used in a news article, which is a combination of both factual events and their framing by the reporter. A classic example of this is a college football game: the hometown papers of both teams will report the same facts about the game, but the winning team’s paper will likely cast the game as a positive outcome, while the losing team’s paper will have a more negative take on the game, yielding insight into their respective views towards it. Capturing the global reaction to a political event requires precisely this type of composite tonal measure that emphasizes the editorial framing of the event‘s significance and tone.

Given the growing popularity of computerized content analysis, there are numerous manually–compiled lists of words grouped into categories capturing specific dimensions of text, such as the degree of anxiety shown by the writer or the level of authoritative language. A cross–section of openly available dictionaries, covering more than 1,500 categories, was tested before settling on tone as the most reliable metric of conflict. Categories tested included the Regressive Imagery Dictionary (Martindale, 1975), WordStat’s version of WordNet 2.0 (Provalis Research, 2005) and the 1911 Roget Thesaurus (Provalis Research, 2003), both the H4 and Lasswell General Inquirer dictionaries (Stone, et al., 1966), the Body Type Dictionary (Wilson, 2006), and the Forest Value Dictionary (Bengston and Xu, 1995). Graphs of the density of each category in news coverage of Egypt, Libya, Tunisia, and Serbia were examined to look for movement around the periods of major conflict. While some measures showed strong movement for one or two of the four test countries, only tone consistently showed change during periods of conflict. This is in keeping with previous work showing that the vast majority of available text categories show little consistent predictive power (Bollen, et al., 2011).

Finally, full–text geocoding attempts to identify, disambiguate, and convert textual geographic references to geographic coordinates. A geocoder searches a document for mentions of locations such as “Cairo” and uses the surrounding context to estimate which of the 39 locations on earth named Cairo the given reference likely refers to. Ultimately the word “Cairo” is converted to an approximate latitude and longitude coordinate. Converting a collection of textual documents to geographic coordinates that can be mapped, such tools allow the geographic focus of an archive to be explored. Both the sentiment mining and full–text geocoder algorithms used in this paper were adapted from the Carbon Capture Report (http://www.carboncapturereport.org/) project.

Forecasting unrest: Conflict early warning

“Japanese radio intensifies still further its defiant hostile tone; in contrast to its behavior during earlier periods of Pacific tension, Radio Tokyo makes no peace appeals. Comment on the United States is bitter and increased.” — 6 December 1941 (Mercado, 2001)

From the founding of FBIS and SWB in the leadup to World War II, one of their primary tasks was to analyze the tone of domestic radio broadcasts around the world to determine their posture towards the West. The very first analysis report by SWB’s partner service FBIS was dated 6 December 1941, noting that Japanese radio had dropped its appeals for peace and had increased its criticism of the United States. The Japanese struck Pearl Harbor the following morning. While news monitoring will never perfectly predict the precise details of conflict, it can offer critical advanced warning that the posture of a country has changed. Knowing that the Japanese had dropped their appeals for peace and ramped up their attacks on the United States did not itself portend imminent war, but could have provided critical insight to military planners that there was an increased likelihood of possible conflict.

Recognizing the ability of the news to offer crucial predictive insights into conflict, the U.S. government has funded “conflict early warning” research for more than 40 years. The majority of such projects utilize broad annual indicators such as average educational attainment or GDP, or rely on “event databases” where news reports of violence are converted to database entries recording the date, location, and number involved (O’Brien, 2010). However, annual indicators can be misleading, as Egypt’s GDP had been increasingly steadily in recent years, while event databases record physical manifestations of violence, such as riots, meaning by the time they warn of a surge in violence, that country is already in the midst of conflict. Indeed, the outcomes of existing work have been extremely poor. Several popular models of civil war have been shown to miss 90 percent of the cases they were designed to describe, with one model correctly predicting 0 of 107 wars it supposedly explained (Ward, et al., 2010).

FBIS and SWB were founded on a qualitative humanities approach to analysis, rather than these modern quantitative political science methods. As the quote which opens this section shows, the linguistic narrative of the news media provides a rich backdrop of tone and latent clues to global views on an event or country. Leveraging automated techniques like sentiment mining, is it possible to use the tonal information of the news to more accurately forecast potential conflict around the world? In particular, by “passively crowdsourcing” the global tone of all coverage discussing a country, does this tone offer insight into future conflict outbreak in that county?

Egypt

On 25 January 2011, popular dissent with the Egyptian state culminated in mass protests that continued through President Mubarak’s resignation on 11 February. Figure 2 shows the average tone by month from January 1979 to March 2011 of all 52,438 articles captured by SWB mentioning an Egyptian city anywhere in the article. Only articles explicitly mentioning an Egyptian city were included to filter out casual references to Egypt to return only articles reporting on the country in more detail. To normalize the data, the Y axis reports the number of standard deviations from the mean, with higher numbers indicating greater positivity and lower numbers indicating greater negativity. January 2011 reports only the tone for 1 January through 24 January, capturing the period immediately preceding the protests. Only twice in the last 30 years has the global tone about Egypt dropped more than three standard deviations below average: January 1991 (the U.S. aerial bombardment of Iraqi troops in Kuwait) and 1–24 January 2011, ahead of the mass uprising. The only other period of sharp negative moment was March 2003, the launch of the U.S. invasion of neighboring Iraq.

January was a tumulteous month for Egypt, with the Tunisian revolution having occurred just weeks prior and a devastating bombing of a Coptic Christian church in Alexandria on New Year’s Eve that killed 21 and injured 70. This domestic terrorism attack stoked anger at the government, which had long justified its limits on personal freedoms with enhanced domestic security. Indeed, one local editorial went as far as to call the bombing the result of “betrayal, negligence, and lack of concern on the part of the government” and that the attack was directed “against the State” rather than Christians, with Egypt becoming a “State without law.” (Al–Badi and Sha’ban, 2011) Global media coverage captured this progression towards negativity, recording in particular the massive outpouring of international condemnation of the church bombing and public views of other countries on how the bombing, coming on the heels of the Tunisian revolution, could destabilize Egypt.

Monitoring these qualitiative aspects of news coverage provides substantial benefits over the traditional quantitative political science event database approach. An event database can only capture that a bombing took place, but a church bombing in one country might result only in condemnations, while in another it might push it over the edge to revolt. Measuring the global news tone essentially conducts a passive “poll” of the press across the world, summarizing their combined views on the likely outcome of the event, recording whether a bombing results in only a few isolated factual reports, or widespread extreme negativity.

Despite being hailed as a social media revolution, monitoring the tone of only mainstream media around the world would have been enough to suggest the potential for unrest in Egypt. While such a surge in negativity about Egypt would not have automatically indicated that the government would be overthrown, it would at the very least have suggested to policy–makers and intelligence analysts that there was increased potential for unrest. As for the likely impact on the government, Figure 3 lends additional insight, tracing the average monthly tone of all 13,061 stories mentioning Egyptian President Hosni Mubarak during the same time period. The weeks preceeding the protests contained the most negative discourse of his nearly 30–year rule. Combined with the graph of tone towards Egypt, this would have suggested to a policy–maker at the time that there could be an increased possibility of unrest in Egypt, possibly even affecting its previously untouchable head of state.

To verify that these results are not merely artifacts of the SWB data collection process, Figure 4 shows the average tone by month of Summary of World Broadcasts Egyptian coverage plotted against the coverage of the New York Times (16,106 Egyptian articles) and the English–language Web–only news (1,598,056 Egyptian articles) comparison datasets. SWB has a Pearson correlation of r=0.48 (n=63) with the Web news and r=0.29 (n=63) with the New York Times, suggesting a statistically significant relationship between the three. All three show the same general pattern of tone towards Egypt, but SWB tone leads Web tone by one month in several regions of the graph, which in turn leads Times tone. All three show a sharp shift towards negativity 1–24 January 2011, but the Times, in keeping with its reputation as the Grey Lady of journalism, shows a more muted response. Yet, the fact that the Times shows a similar overall trend curve to SWB is strong evidence that SWB’s strategy of sampling the global press is not a primary driving force in its results, given that the Times was analyzed in its entirety. Given SWB’s increasing reliance on Web–based news, it is not surprising that it is highly correlated with the Web dataset. This is in spite of SWB’s incorporation of local translated Web content, while the Web collection here consists only of English–language news. The five–year time span of the Web collection means it is not possible to place current events in a historical context to determine the signficance of tonal shifts, but the close alignment of its tone with SWB suggests it may become an increasingly–competitive alternative to SWB in the future for some types of analysis.

Figure 4: Tone of coverage mentioning Egypt, January 2006–March 2011 for Web, Summary of World Broadcasts, and New York Times (January 2011 is 1–24 January). Y–axis is Z–scores (standard deviations from mean).

To verify that the findings are not artifacts of the tonal dictionary selected, the monthly average tone of SWB coverage of Egypt was recalculated using two well–known tonal dictionaries, the Dictionary of Affect in Language (DAL) (Sweeney and Whissell, 1984) and the Affective Norms for English Words (ANEW) (Bradley and Lang, 1999). Both dictionaries have been applied extensively in the literature to quantify the valiance of textual content, and DAL in particular has a long history of validation tests. DAL tone is correlated at r=0.66 (n=387) with the Carbon Capture Report’s tone engine, while ANEW is correlated at r=0.46 (n=387). All three dictionaries exhibit the same macro–level patterns with the sharp surge in negativity in January 2011, but the Carbon Capture Report engine’s specific tuning for use on news content means it yields a slightly clearer picture.

Only articles mentioning an Egyptian city are considered for the average monthly tone to discard casual references to Egypt. Articles mentioning a city in Egypt are more likely to be about Egypt in some fashion than ones that mention only the name of the country. Removing this requirement and including all articles that mention Egypt (95,288 articles) results in a tonal curve that is correlated at r=0.85 (n=387) with the city–filtered tone, showing that the results are not an artifact of this filtering process. Rather, by removing noise articles, the resulting tonal graph shows sharper up/down movement, making trends clearer.

As discussed earlier, the SWB content above includes all articles monitored by SWB from every country in every language. Like all monitoring services, limited resources mean that SWB is unable to capture 100 percent of the daily global discourse and thus must select only a sample to record. Even higher population countries like Egypt may have months where very little coverage is captured from their domestic press, while smaller countries may have multiple months without coverage recorded in SWB. To examine the impact of this, the average tone by month was computed for only those articles sourced from a news outlet based in Egypt (18,978 articles), and again for only those articles sourced from Arabic–language outlets anywhere in the world (24,024 articles).

For the period January 1979 to March 2011, there were 50 months that had less than three articles from an Egyptian source mentioning an Egyptian city, while in 25 months there were less than three articles from an Arabic–language source. (Three articles a month was set as the cutoff below which there was too little coverage to generate meaningful tone averages.) There is an average of 245 articles a month from all sources that mention Egypt, compared with 56 articles a month from Egyptian sources and 66 articles a month from Arabic sources. Despite these limitations, the tone by month during this period of news from all countries is highly correlated with news from Egyptian and Arabic sources. Tone from Egyptian–based sources is correlated at r=0.63 (n=337), while Arabic–language press is correlated at r=0.67 (n=362). The major difference is that both are significantly more muted, and have less pronounced declines in January 2011, likely a reflection of the sharp state media controls that exist in many Arabic–language countries. As noted earlier, Egypt in particular deployed significant resources to mute mainstream coverage of the unrest. Thus, basing tone about each country on a composite of all global coverage, rather than limiting to only coverage from a specific country, mitigates these issues of state media control and censorship, as well as ensuring a higher volume of content, especially for smaller countries.

Tunisia

Tunisia’s revolution set off the Arab Spring, but unlike the other countries examined here, there were simply too few articles in SWB mentioning specific cities in Tunisia (6,636) to perform city–level filtering. In addition, many months had less than 10 articles worldwide mentioning the country, compared with 245 a month for Egypt. For Tunisia, all mentions of the country, regardless of whether they mentioned specific cities in Tunisia, were counted, resulting in 16,856 articles. This results in a weaker tonal profile that is less selective and refined than that used for the other countries. Nevertheless, the two–week period prior to Tunisian President Ben Ali’s resignation was the sixth–most negative period in the last 30 years, coming after a decade–long plunge towards increasing negativity.

By the time full–scale protests in Libya erupted on 15 February 2011, negative tone in the prior two weeks had reached levels seen only four times prior in the last 30 years (14,109 articles). Unlike Egypt and Tunisia, Libya did not show a sharp increase in negative tone in the weeks prior to the first isolated protests in early January and it was not until the beginning of February that tone reached extreme lows. This likely reflects the fact that protests in Libya did not gather steam until mid–February, and hence there was no major shift in tone until the situation was poised on the verge of descent into full conflict.

The ethnic conflicts in the Balkans in the 1990s caught many by surprise and offer an ideal cross–check to ensure that the findings are not an artifact of the time period or geographic region. Serbia’s tonal descent began in June 1990, with a sharp and steady fall towards increasing negativity through March 1991, with the Karadordevo Agreement. Tone stayed at its most negative levels in the measurement period to that point through late 1993. Tone began to increase towards positivity beginning in July 1995, peaking with the Dayton Agreement in December 1995. In early 1996, tone began a sharp downward slide again. By early 1998, tone was at levels not seen since the first conflict. Indeed, this period has the highest levels of negative tone ever seen in Serbia over this 30–year period (96,251 articles), reflecting the intensity of the conflict during this period.

Unlike the Arab Spring’s focus on personal freedoms and governmental representation, the Balkans conflict was primarily ethnically driven. Similar to tone, the average density of ethnic discourse can be measured to determine how “ethnically charged” news coverage of a country has become. Figure 8 shows the percent of all Serbian coverage each month in SWB that included at least one reference to an ethnic group in the Balkans. The increasing density of ethnic discourse is strongly inversely correlated with tone at r=-0.73 (n=379), meaning that ethnic references increased at nearly the same rate as tone about Serbia became more negative, suggesting strong ethnic contextualization of the negativity. Rather than ethnicities, mentions of religion, social groups, needs such as food or water access, desire for democracy, and other topics can be used to similarity contextualize shifts in tone.

Figure 8: Percent of articles mentioning Serbia that also reference a Balkan ethnic group, Summary of World Broadcasts, January 1979–March 2011.

Saudi Arabia

A reliable forecasting technique must be accurate at predicting both presence and absence of the phenomena being measured. Saudi Arabia with 31,196 articles, offers an excellent case study of a country that has managed to resist regime change, even as several of its geographically and culturally proximate neighbors have undergone revolutions. As Figure 9 illustrates, there is a local trough in March 2011, where tone became more negative than previous, but this level had been reached repeatedly over the previous ten years, whereas tone for Egypt, Libya, Tunisia, and other countries undergoing revolution reached sharply lower levels than previously seen. Thus, substantial tonal movement towards negativity is an indicator of possible unrest, while absence of such movement indicates greater likelihood of stability.

Looking beyond the tone towards a single country, what does the tone of the entire world look like aggregated by month? Is the world as a whole becoming more negative, at least according to the news? Figure 10 shows the average tone of the entire New York Times by month from January 1945 to December 2005. The Times exhibits a strong decade–long trend towards negativity from the early 1960s to the early 1970s, before recovering towards slight negativity, and has trended slightly more negative in recent years up to the 11 September 2001 attacks, which caused news to become sharply more negative in the following four years. The New York Times has a strong U.S. focus, however, so Figure 11 shows the tone of all Summary of World Broadcasts news January 1979 to July 2010 (content after July 2010 was available only for articles mentioning one of the countries above), showing a steady, near linear, march towards negativity. For the period of overlap, January 1979 to December 2005, the two have a Pearson correlation of r=0.55 (n=324), suggesting that news as a whole is becoming more negative.

Figure 10: Average monthly tone of New York Times news content 1945–2005 (Y–axis are standard deviations from mean).

Figure 11: Average monthly tone of Summary of World Broadcasts news content, January 1979–July 2010 (Y–axis are standard deviations from mean).

The spatial dimension of news

Conflict early warning uses the spatial dimension of the news only as a filtering mechanism, honing in on city–level references to reduce noise. Yet, location is a critical component of the news, with a typical news article averaging one location mention every 200–300 words. The New York Times, with 2.9 billion total words from 1945–2005, mentions 369,000 unique locations more than 10.4 million times, or around one location every 279 words. The Summary of World Broadcasts, with 1.2 billion words from 1979–2010, has 201,000 unique locations mentioned roughly 5.81 million times, or around one geographic reference every 215 words. While the previous section explored news tone visualized through time, this section will visualize it spatially, exploring several key questions about journalism.

The maps below compare the world in 2005 according to the New York Times and the Summary of World Broadcasts. Each city or other geographic landmark (such as islands, oceans, mountains, rivers, etc) is color–coded on a 400–point scale from bright green (high positivity) to bright red (high negativity), based on the average tone of all articles mentioning that city in 2005. Each article mentioning two or more cities together results in a link being drawn between those cities, and the average tone of all articles mentioning both cities is used to color–code that link on the same color scale as the cities. Alternatively, a grammatical approach could have been used, with data mining tools that could have teased out just the tone of each individual city mentioned, rather than assigning the overall document tone to all cities mentioned in that document. The more simplistic method here was chosen in order to capture geographic “framing.” In essence, if a city is mentioned in a positive light in highly negative documents over a long period of time, that city is being contextualized by the news media as having some relationship with the negative events, which this technique captures. More critically, this approach also captures the framed connections among cities. A typical New York Times article about a bombing in a foreign city will usually include a quote from the White House condemning the attack. The White House has nothing to do with the attack itself, but is being contextualized as an actor in the events as they are described to an American audience.

The two maps are highly divergent, with the Times mentioning 19,785 distinct locations on Earth in 2005 to SWB’s 29,592. Reflecting the Times’ U.S. focus, 338 (1.1 percent) of the SWB locations and 8,045 (40.7 percent) of the Times’ were in the United States. Most surprisingly, however, is that just 3,080 locations (6.7 percent) are mentioned by both sources. A large portion of this difference appears to be that the Times assumes its American readership is unfamiliar with the geography of most foreign countries and so tends to report locations at the province level or as “near” major cities. SWB reports, coming from countries with ties to the countries being discussed, or from the country itself, tend to use a greater density of precise local city and landmark references. Most strikingly, however, the Times map shows a world revolving around the United States, with nearly every foreign location it covers being mentioned alongside a U.S. city, usually Washington D.C. Foreign coverage is very uneven, with Africa, Southern Asia, and Latin America being especially poorly represented, while Europe is well represented. SWB shows a far more balanced view of the world, with far better coverage of most geographic locales and no single set of countries dominating the connections. Clicking on the images below will open an animated GIF movie showing each year in sequence, showing the world over the last half and quarter centuries, respectively. In addition, an animated GIF for the New York Times (1945–2005) is at http://contentanalysis.ichass.illinois.edu/Culturomics20/nyt-movie-1000x1000.gif (a larger version is at http://contentanalysis.ichass.illinois.edu/Culturomics20/nyt-movie-2000x2000.gif). An animated GIF for Summary of World Broadcasts is at http://contentanalysis.ichass.illinois.edu/Culturomics20/swb-movie-1000x1000.gif (a larger version is at http://contentanalysis.ichass.illinois.edu/Culturomics20/swb-movie-2000x2000.gif).

Figure 12: Global geocoded tone of all New York Times content, 2005.Note: Click on image to see animation.

Figure 13: Global geocoded tone of all Summary of World Broadcasts content, 2005.Note: Click on image to see animation.

Mapping the world of Bin Laden

Narrowing the focus from the entire world to that of a particular actor, could global news have given us insights into the hiding place of Osama Bin Laden? The topic has attracted a fair bit of attention over the past decade, including a 2009 study that combined satellite imagery and biogeographic analysis in an attempt to pinpoint his location to three buildings in Pakistan (Gillespie, et al., 2009). Figure 14 shows all geographic references and their co–occurrence links in coverage mentioning Osama bin Laden in SWB content January 1979 through April 2011 (only “bin laden” was used as the search criteria to avoid the transliteration issues associated with his first name).

From his rise in the global media in the late 1990s to the month prior to his capture, Bin Laden has been most commonly associated with Pakistan and in the map below all roads appear to lead to Northern Pakistan. Indeed nearly 49 percent of all articles mentioning Bin Laden included a city in Pakistan and both Islamabad and Peshawar rank in the top five non–Western cities associated with him. The next four most closely associated countries are the United States (38 percent), Iran (33 percent), Afghanistan (28 percent), and the Philippines (20 percent). The city of his capture, Abbottabad, makes only a single appearance in an article on 16 April 2011 regarding the arrest of a terror suspect in the city (Mir, 2011). However, Abbottabad is less than 200 kilometers from both of two most popular cities associated with him, or roughly the radius between Islamabad and Peshawar.

While far from a definitive lock on Bin Laden’s location, global news content would have suggested Northern Pakistan in a 200 km. radius around Islamabad and Peshawar as his most likely location, and that he was nearly twice as likely to be making his residence in Pakistan as Afghanistan.

Scholars have long sought to organize the world into “civilizations” that collect countries together by shared cultural or political foundations. Perhaps the most famous theory of world civilizations was put forth by Samuel Huntington in his controversial 1996 book entitled The clash of civilizations and the remaking of world order, where he organized the world into ten major divisions. The majority of data–driven civilization theories rely on demographic information such as ethnic group or religious affiliation distribution or economic data such as trade ties. Media–driven studies of the “relatedness” of countries have focused primarily on how often the press of one country covers events in another, measuring media selection bias (Wu, 2000). Yet, as the maps in the previous section illustrate, the global news media appears to cluster regions together, relating cities in one area to those in another more closely than to the rest of the world, offering an implicit grouping of “civilizations.”

Figure 15 visualizes the way in which the global news media frames the world for its readership and the “civilizations” that result. As with Figure 13, all mentions of a city or geographic landmark across SWB content 1979–2009 were pooled together, and a link established between each pair of cities that appear in an article together. These were then aggregated up to the country level, yielding a network diagram with the countries of the world as nodes and the edges between them recording the number of articles those two countries appeared together in. Many countries appear together in just a handful of articles out of SWB’s 3.9 million from this period, and so to reduce noise, only links representing five percent or more of one of the two countries’ total appearances were retained. This discarded single isolated connections of a pair of countries, while retaining those that were more regularly discussed together in the news. To capture the intensity of news attention as accurately as possible, edge weights were not normalized, meaning that higher–volume countries that occur together frequently will have a stronger edge weight than those with lower coverage volumes.

The Louvain hierarchical modularity finding process (Blondel, et al., 2008) was then used to extract the “natural communities” of nodes within the network. Modularity finding organizes a network into clusters of nodes where the nodes within each group are more closely connected to each other than to the rest of the network. In the case of this news relatedness network, modularity finding locates groups of countries that are mentioned together more often with each other than with other countries. The resulting partition finds that the global news media, as captured by SWB, divides the world into six major civilizations, visualized spatially in Figure 16. Overall, these civilizations appear to closely track geographic proximity, which might intuitively make sense, with events in one country involving or being related to those in a neighboring country. Notable outliers include Spain, where colonial ties to South America appear to overcome its affinity to Europe, and France and Portugal, reflecting their ties to Africa. The most geographically diverse cluster is centered on the Middle East, but also includes Canada, Norway, and the United Kingdom. The smallest cluster consists of India and several of its immediate neighbors.

Most theories of civilizations feature some approximation of the degree of conflict or cooperation between each group. Figure 17 displays the average tone of all links between cities in each civilization, visualizing the overall “tone” of the relationship between each. Group 1, which roughly encompasses the Asiatic and Australian regions, has largely positive links to the rest of the world and is the only group with a positive connection to Group 4 (Middle East). Group 3 (Africa) has no positive links to any other civilization, while Group 2 (North and South America excluding Canada) has negative links to all but Group 1. As opposed to explicit measures of conflict or cooperation based on armed conflict or trade ties, this approach captures the latent view of conflict and cooperation as portrayed by the world’s news media.

Figure 17: Average tone of links between world “civilizations” according to SWB, 1979–2009.

Figure 18 shows the world civilizations according to the New York Times 1945–2005. It divides the world into five civilizations, but paints a very different picture of the world, with a far greater portion of the global landmass arrayed around the United States. Geographic affinity appears to play a far lesser role in this grouping, and the majority of the world is located in a single cluster with the United States. It is clear from comparing the SWB and NYT civilization maps that even within the news media there is no one “universal” set of civilizations, but that each country’s media system may portray the world very differently to its audience. By pooling all of these varied viewpoints together, SWB’s view of the world’s civilizations offers a “crowdsourced” aggregate view of civilization, but it too is likely subject to some innate Western bias.

Monitoring first broadcast then print media over the last 70 years, nearly half of the annual output of Western intelligence global news monitoring is now derived from Internet–based news, standing testament to the Web’s disruptive power as a distribution medium. Pooling together the global tone of all news mentions of a country over time appears to accurately forecast its near–term stability, including predicting the revolutions in Egypt, Tunisia, and Libya, conflict in Serbia, and the stability of Saudi Arabia. Location plays a critical role in news reporting, and “passively crowdsourcing” the media to find the locations most closely associated with Bin Laden prior to his capture finds a 200km.–wide swath of northern Pakistan as his most likely hiding place, an area which contains Abbottabad, the city he was ultimately captured in. Finally, the geographic clustering of the news, the way in which it frames localities together, offers new insights into how the world views itself and the “natural civilizations” of the news media.

While heavily biased and far from complete, the news media captures the only cross–national real–time record of human society available to researchers. The findings of this study suggest that Culturomics, which has thus far focused on the digested history of books, can yield intriguing new understandings of human society when applied to the real–time data of news. From forecasting impending conflict to offering insights on the locations of wanted fugitives, applying data mining approaches to the vast historical archive of the news media offers promise of new approaches to measuring and understanding human society on a global scale.

About the author

Kalev Leetaru is Assistant Director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts, and Social Science at the University of Illinois and Center Affiliate of the National Center for Supercomputing Applications. Among his research areas are “big data” analysis using massive text archives, and he has a book surveying the field of computational content analysis coming from Routledge in Fall 2011.
E–mail: leetaru [at] illinois [dot] edu

Acknowledgements

This research was supported in part by the National Science Foundation using Teragrid resources on the Nautilus SGI UV supercomputer at the National Institute for Computational Sciences (award TG–HUM110001). The author thanks Pragneshkumar Patel at NICS for technical assistance.

References

Du’a Al–Badi and Abd–al–Wahab Sha’ban, 2011. “Memorandum to the Presidency demands putting Al–Awwa, Imarah, and Zaghlul to military trial,” Al–Wafd, in Cairo, in Arabic (4 January), as reported by Summary of World Broadcasts.