Which News Organizations Influence Wikipedia?

The New York Times leads the Wikipedia record of 2013’s biggest news stories, according to a very simple (and perhaps simplistic) analysis of the online encyclopedia’s citations. The analysis and process are quite basic, a demo of a new perspective on the impact of news reporting: Figuring out what journalism seeps into Wikipedia, the default first stop for almost everyone researching a new topic.

Throughout thirty Wikipedia pages covering a list of “top ten stories of 2013”, the top ten most cited sources (of any kind, not just news companies) as of January 11th 2014 were:

The New York Times (226 citations)

The Washington Post (206 citations)

CNN (183 citations)

The Guardian (152 citations)

Reuters (125 citations)

BBC (124 citations)

The Huffington Post (80 citations)

NBC News (78 citations)

The New Republic (78 citations)

USA Today (75 citations)

However, the full list has a very long tail. Out of the 4266 total linked citations across 947 separate organizations, 2939 citations were outside this top ten.

Out of all possible sources, for ten big stories of 2013, Wikipedia cites these organizations.

You can also examine the full data set, the results of scraping webpages and counting the times a source appeared. No sophisticated statistical analysis has been applied at this early demo stage.

At first glance, it’s only The New Republic that jumps out as a surprising appearance: All those ten sources are well-known journalism brands. However, the surprise for me was the news organizations that don’t make the top ten; CBS News, ABC News, FOX News. Allowing for the fact that ideological position is somewhat in the eye of the beholder, this top ten strikes me as leaning left overall (and would probably strike Rush Limbaugh as archetypal “liberal elite”). In fact, the deeper you go down this list, the more those subtle surprises emerge; but more of that later.

The results are highly dependent on the definition of ‘top news stories’, which is itself a massively subjective concept. I used this below list. It was prompted by some news organizations end of year lists but filtered through my own news values).

The Affordable Care Act

The Boston Marathon Bombing

The George Zimmerman Trial

The Egyptian Coup

The NSA & Edward Snowden

Nelson Mandela’s Death

Pope Francis

The Syrian Civil War

The US Economy

The US Government Shutdown

A list of the thirty Wikipedia pages relating to those stories is here. Further below, I go into detail about the process and how other researchers might alter the news mix and analysis for their own purposes. Alternatively you can download the code from GitHub if you want to run or modify the process to your own ends.

Why Does it Matter What Wikipedians Cite?

Wikipedia is the source for millions of people’s research. As such, it’s an influential repository of the public record. Far from the only resource, but certainly a hugely important one.

While working on this, I liked to think of little Janice getting her assignment to write about Syria, or Egypt, or Pope Francis. Where does she go? Even with the best efforts of teachers to broaden the research base, there’s a fair chance she’ll pass through Wikipedia at some point. It’s the sixth most visited site in the world, according to Alexa. According to Wikipedia’s own stats, if Janice visited the “Syria Civil War” page in December 2013, hers’ would be amongst 230,000 page views. The page for Obamacare (actually called the Patient Protection and Affordable Care Act) was viewed 179,000 times that month.

So, understanding what gets into Wikipedia is an extra way of understanding impact – one of the research areas Tow Center Director Emily Bell called out in 2012. With more development, research into journalistic impact on Wikipedia might sit alongside the kind of report that ProPublica publishes; an audit of the legislative, policy and societal change relating to their investigations – a report closely aligned with 4th Estate values. On the other hand Wikipedia citations of journalistic sources seem to tell us more about how well journalists are fulfilling Reithian values; encapsulated in the phrase to “Inform, Educate and Entertain”. (Conservative as that may feel nowadays.) It’s a field of study that is deepening beyond the metrics of page views, unique visitors and time on site. News rooms watch social media metrics; an area of study for Tow Fellows Brian Abelson and Michael Keller.

More Detail on the Results

The first point to re-state here is that my starting collection of news stories and topics reflects my news values. I’m a 36-year-old white male living in Manhattan, doing a research fellowship at Columbia University after having worked for the Australian publicly funded broadcaster for many years. I’m fairly progressive and privileged, and move in similar circles (most of the time). In 2012 The Pew Research group noted that American news consumers care least about foreign news, echoing their earlier studies. By contrast, my list of news topics includes Egypt and Syria. That said, I think the list of stories covers events that affect a large number of people in momentous ways – even if they weren’t the biggest topics of discussion on social media: No Miley here, I’m afraid.

Often, when I read these kinds of studies, the results hold few surprises. But if you’d asked me ahead of time whether The Huffington Post would be in the top ten, I’d have said ‘no’. I also wouldn’t have guessed that the top ten citation sources for this bundle of subjects would all be news organizations. With such a small data set, I suspect that some of these inclusions might be anomalies; so I’ve broken it down a little. Further on I’ve indicated some future directions of study to get deeper insights.

The main factor in The New York Times’ dominance is consistency. It appears in the top ten sources for each individual news story, normally in the top three and is well represented in the topics with high number of citations. The other news brands’ placings are a bit more volatile; no other news organization is in the top ten for every one of these topics.

These graphs also illustrate just how long the long tail is – for all stories. But it also makes the analysis more debatable. Is the language of a ‘top ten’ as relevant when 70% of the citations fall outside that group?

Heavily sourced or long pages may be distorting the overall numbers: There is a big difference in the number of citations for some topics than others; For example the Edward Snowden articles had 601 citations across the three pages, whereas Pope Francis had only 332. The New Republic benefited from this effect; it was cited 78 times in the articles about Obamacare; enough to make it the ninth most cited source in the combined list. In later analysis it may be useful to weight appearances according to article length and their total number of citations.

I was surprised at how few citations came from direct sources. For NSA/Snowden, less than 2% of citations came from government domains, the vast, vast, vast majority came from established news organizations, with a few from .org domains and a few blogs. That’s less true for the pages about The Affordable Care Act and the US economy. The authors of those pages still heavily cite news organizations, but they also use government and agency websites.
As might be expected, particular publications with specialist beats were heavily cited on the Wikipedia pages relating to their turf; Orlando publications were strong on The Zimmerman Trial. The Boston Globe dominated the Marathon Bombing pages.

Most of top ten sources fit within Wikipedia’s policy on verification. Although it guides editors towards ‘academic and peer-reviewed publications’, it subsequently suggests they use sources with ‘a reputation for fact-checking and accuracy’. Wikipedia policies say “do not reject sources just because they are hard or costly to access” and three of the top ten cited sources do indeed have pay walls for repeat visitors: The New York Times, The Washington Post and The New Republic. That said, articles from the famously hard to reach JSTOR and other academic repositories were not common on these pages.

This is not a big data set, and it’s only one method of enquiry. The results are not statistically significant for all news topics on Wikipedia, much less all Wikipedia articles. They tell us about numbers for thirty pages covering these ten newsworthy topics. As you’ll see from the process outline below, the selection of pages was systematic, but doesn’t cover all the pages potentially seen by a person researching these ten news topics. Further more, this method doesn’t reflect whether the information in the citation has been accurately represented on Wikipedia, or if the page has vast swathes of contradictory information, sourced or not.

These genuine limitations aside, I think this small amount of data raises interesting questions for future research, and provides a conversation starter. To examine all the results data in detail, you can go to this speadsheet, or download csv files from Github.

The Process

Picked Ten News Stories
As noted above, I didn’t try to be objective on this; I picked stories that I thought were impactful and ‘important’. I veered away from sensationalist and celebrity-powered stories, because serious.

Entered those terms, restricting the site to en.wikipedia.org
I used the Safari browser, not logged in to Google. Although we don’t entirely know how Google’s search results are produced, we do know that the same search terms can change over time, and vary depending on factors including the searchers’ location.
The search string was “[SEARCH TERM] site:en.wikipedia.org”

Picked out the Wikipedia pages that were actually ‘about’ that event.
For example, the search terms returned results which included templates, category pages, and Wikipedia pages about other subjects (for example, “Edward Snowden” produced “Sarah_Harrison_(journalist)‎”). There is certainly room for debate about this step: My decisions regarding what a page is “about” do not necessarily reflect what a person searching Google would click through to.

The listed Wikipedia pages were then scraped using a short python script.

Fetch the pages

Find the citations

Group URLs by domain and brand
This involved some judgement calls; for example, all Al Jazeera URLs were grouped together, regardless of whether they were from Al Jazeera English, America, or some other sub-brand. Likewise, CBS brands were aggregated, which included various CBS local brands. However the most contentious decision might have been to group a number of NBC brands; including MSNBC, NBCNews. The effects of these decisions were logged, and can be seen in the log files on the GitHub repository, as can the code that implemented those decisions.

The domains were counted, both by news topic, and as a master list.

Limitations of the process & analysis

Throughout the text above, I’ve referred to some of the ways that this research is only a small step. Here are more:

This was a snapshot on a single day.
Wikipedia is a dynamic resource. These methods give us no insight into how pages and sources change over time, whether the composition of sources moves away from News media, and towards other slower sources as a news topic ‘ages’.

It’s a quantitative approach.
There’s no consideration of the context of the citation; whether the cited source is accurately invoked. Or how significant the cited fact is in the overall article: For example, the Wikipedia page on Tylenol (not included in this top ten new topics list) includes an entire section based on reporting by ProPublica.

Syndication and subsequent reporting is not considered.
Aside from Google-hosted AP and AFP reports, I made no effort to understand whether original reporting from one company or journalist was being syndicated to the domain Wikipedia sourced. Currently, journalists on traffic-focussed sites re-write competitors original reporting – this analysis does not try to correct for that (although Wikipedia’s sourcing policy does discourage sourcing from syndicated pages).

This counts linked articles only.
As such, it excludes books listed in bibliographies, or citations that don’t have live online sources.

This uses web domains as representations of organizations
There is not a 1:1 relationship between the domain in a URL and an organization in the real world. Brand considerations, technical considerations and organizational considerations contribute to making this a complicated relationship. However, these complicated relationships need to be managed in the code; guardian.co.uk, guardiannews.com, and m.guardian.co.uk all need to be aggregated as one organization. In this long tail, there may be some obscure web domains that haven’t been accurately aggregated, but those errors don’t impact the top end of the rankings.

Where this sits in the context of impact study

The authors of the Tow Center’s ‘Post Industrial Journalism’ Report have explained why the news industry is starting to examine impact and noted some companies who are incorporating impact audits into their operations. Before that, in 2011, Ethan Zuckerman asked what the metrics should be for civic journalism. Jonathon Stray’s 2012 article in The Neiman Lab asked what the right metrics are for journalism more widely.

Brian Abelson’s ‘Impact Bibliography’ is a good place to read more widely on the subject. His list of sources includes plenty of papers from the non-profit sector; development organizations – and their funders – have long been concerned with measuring the value their dollars buy.

Few news organizations have such a clearly understood impact measure as Bloomberg. The company’s entire – extremely profitable – business model is based on clients being able to realise a return on paying $20,000+ per terminal each year. So, Bloomberg journalists constantly hear that they should be producing news that ‘moves the market’. Their pay is reportedly linked to how well they do so[1].

The sciences have traditionally used peer-reviewed journals’ Impact Factor to understand influence. That field is also developing ‘Altmetrics’, which attempt to include attention on social media, and readership of individual articles.

Next Steps For Me

This first small bit of scraping and analysis prompts me towards much more work on Wikipedia. Obviously, it’ll be interesting to analyse some bigger data sets: ‘All of english Wikipedia’ springs to mind (and was suggested to me by Michael Keller). I’m also interested in working on large groups of Wikipedia pages that concern and surround controversial topics including family planning, gun laws and shale gas (Taylor Owen suggested that). This might incorporate semantic analysis; computationally grouping Wikipedia pages into topic areas.

Changes over time seem like an interesting subject as well – understanding how the composition of citation sources changes as pages mature: As more sources become available, perhaps the result of more careful original research, do they replace journalism’s ‘first draft’?

I suspect that as I work on larger and more complex data sets, I’ll also need to develop my thinking about how to analyse and express the numbers of citations.

In terms of utility, the process of getting data into the program is fairly clunky. That can be improved. I might even look at hosting the program somewhere and giving it a web interface. No promises though; it could get expensive.

(Thinking about the limitations of this first analysis, and the vast swathes of opportunities still to be realized, a reasonable person might ask “why publish anything at all right now, even a blog post?” Think of this as a demo, a milestone to force thinking and gather feedback.)

Next Steps For You

Your newsroom might have a particular focus; a country, a city, a selection of obsessions or topics. If so, my top ten news stories of 2013 probably aren’t yours. Feel free to grab the code and adapt away.

If you’ve got expertise in statistics or semantic analysis, I’d love to hear from you. Neither of those skill sets is in my quiver.

If you’ve got any other comments, suggestions or critiques, the box below awaits your keystrokes.

[1] This prompts an observation that heavily cited news organizations don’t have much opportunity to directly monetise their inclusion in Wikipedia. Their brand prominence and traffic might go up marginally, but we’re not talking about a major profit center.