Mapping News Outlets: How Different Data Gives Different Perspectives

Kalev Leetaru
, ContributorI write about the broad intersection of data and society.Opinions expressed by Forbes Contributors are their own.

A Google Earth display. (TIMOTHY A. CLARY/AFP/Getty Images)

One of the truisms underlying the data revolution is the idea that hidden in massive pools of data is “universal truth” just waiting to be uncovered by the right algorithms. Indeed, the vast majority of data-driven research today adheres to this vision, believing that there is such a thing as “truth” that can be discerned regardless of the underlying dataset being analyzed. Unfortunately, few data scientists spend the time actually examining the data they use and thus don’t realize that a great deal of the findings of modern data-driven research are actually merely artifacts of the dataset being examined. Conversely, when researchers believe their findings are artifacts of a dataset, they might actually reflect instead genuine societal trends. Mapping the geography of the world’s news media offers a useful lesson in the different views we see depending on the data we use.

The ability of digitized and born digital news to allow machines to peer across hundreds of millions or even billions of news articles to tease apart the macro-level patterns of global society has opened up incredible possibilities for understanding the world in which we live. Aggregating media geographically is a common way of examining how different countries or regions of countries are covering a topic or event. For example, how is the domestic press in each country of the world covering a given public figure?

Performing such analyses requires being able to categorize each news outlet in the world by its country of origin. In short, to place the New York Times in the US, BBC in the UK, Xinhua in China and so on.

At first glance, it might seem relatively trivial to determine the geographic location of each of the world’s news outlets. After all, news organizations tend to be legal entities with mailing addresses, addresses of incorporation, headquarters offices and domain names registered to a physical location. However, this simplicity belies the very real complexity underlying something as simple as assigning a location to a news outlet.

Historically “whois” domain registration information was a fast way to identify the country of origin of a news outlet. One could run a simple “whois” request and get back the mailing address on file for the news outlet’s website, which presumably would yield its city and country of origin. However, this quickly breaks down due to the complexity of news outlet ownership. For the Los Angeles Times this would have historically yielded the Tribune Company in Chicago, Illinois as the owner of its website, while Nigerian news outlet Vanguard News’ website was registered to a small cottage in Bridgnorth, Shropshire, England, population 12,000. In some parts of the world smaller outlets that cover the perspectives of minority communities or opposition groups are registered and operate in foreign countries, often in Europe or the US, to escape the reach of governmental forces that would otherwise shut them down.

In other cases, outlets with multiple offices will often register their domain to their US or European office. For example, allAfrica has offices in Cape Town, Dakar, Abuja, Monrovia, Nairobi, and Washington, DC, but historically used their DC office as the address on file for their website.

Moreover, a rapidly increasing fraction of news sites use domain privacy services that act as an intermediary shielding their contact information from public view, making whois data nearly useless today for identifying an outlet’s country of origin.

News outlets also tend to prioritize “.com” domain names, rather than using the country-specific TLD’s of their home country. Even those outlets that do use country-specific TLDs may operate additional generic “.com” websites, such as BBC’s dual bbc.co.uk and bbc.com and The Guardian’s guardian.co.uk and guardian.com.

The physical location of the servers hosting the outlet’s website are also unreliable indicators of the outlet’s location given the prevalence of centralized website hosting even 20 years ago and the near-universal practice today of leaving hosting to dedicated cloud facilities, rather than running the site on a server in a broom closet in the basement.

From a data standpoint, what’s fascinating about all of this is that when one digs into domain registrations, incorporation documents, headquarters locations, mailing addresses and other “traditional” data sources about news outlets’ locations, one gets a view of the world that is not necessarily in keeping with how the consumers of those outlets see them.

It may be the case that allAfrica’s website is registered to its Washington, DC office, but most of its readers will likely consider it to be an Africa-focused outlet, rather than a US outlet. Similarly, Vanguard News’ website may have been registered to a rural cottage in the UK, but it is likely to be seen as a Nigerian outlet. Indeed, across the world the geographic landscape of domain registrations and addresses of incorporation do not align well with our perceptions of those outlets, especially for smaller outlets in countries with degraded press freedoms.

How else can one catalog the geographic landscape of the world’s news outlets? Especially in a scalable fashion that can rapidly and autonomously catalog new outlets as they come online across the globe each day?

One approach is to leverage the geographic bias inherent in all journalism – that news outlets prioritize local events and issues over those elsewhere in the world. In turn, this geographic affinity affects how we see an outlet. A human rights newspaper incorporated in the UK and with its management being London-based, but with its reporting staff based in Syria and focusing the majority of its coverage on Syria would likely be viewed as a Syrian outlet by most readers. In the web era, geographic focus matters more than geographic location. Of course, this doesn’t always work, as government-owned propaganda outlets in one country might each focus on a country of interest to that government. Yet, even here there is little difference between this and the traditional alternative of simply opening an outlet on that country’s soil.

How can geographic bias be turned into a catalog of news outlet locations?

The most simplistic approach, that of searching each article published by an outlet for country name mentions, unsurprisingly works very poorly. The reason? The simple fact that the New York Times doesn’t add “, United States of America” to every mention of a location in the US. The Times assumes that when it mentions the State of Virginia, its readership will recognize that Virginia is part of the US and thus it doesn’t need to say “Virginia, USA.”

Instead, the best approach turns out to be textual geocoding, in which the full text of each news article is processed through algorithms that identify potential mentions of location and then use the full contents of the article to confirm the mention and disambiguate its location (separating Paris, Illinois from Paris, France, for example). By geographically annotating every news article published by an outlet over time and then simply assigning each outlet to the country it focuses the majority of its attention on, it turns out we end up with a fairly precise and highly accurate geographic catalog of the world’s media.

Illustrating this approach in practice, my open source GDELT Project has monitored more than 750 million global news articles in more than 65 languages over the past 3 years, performing full textual geocoding on every one of them, cataloging more than 6.2 billion textual mentions of location.

As a testament to the power of modern cloud-based analytic tools, a single line of SQL code using Google’s BigQuery platform can process these 6.2 billion location mentions and generate a final geographic estimate for each of the nearly 200,000 online news outlets monitored by GDELT, all in just over 16 seconds.

Putting this all together, we often talk of using big data to uncover the “universal truths” of our world. In reality, even the most mundane question, such as where to place a news outlet on a map, can have different answers depending on which dataset one looks at and the lenses through which one sees the question. Placing news outlets at the location of their legal incorporation, mailing address or owner of their website yields a distinct map that is not well aligned with how ordinary consumers of news see many of those outlets. Instead, by geocoding all of their articles and viewing the outlets as a collection of more than 6.2 billion mentions of location, we see journalism through a geographic lens and are able to leverage the geographic affinity of reporting to place outlets in the places their readers see them as, creating a view much more aligned with our common understanding of the media’s geographic landscape.

I would like to thank Google for the use of Google Cloud resources including BigQuery for this analysis.