Last month’s Wired magazine showed an infographic with a headline that read: ‘History’s most influential people, ranked by Wikipedia reach’ with a group of 20 men arranged in hierarchical order — from Jesus at number 1 to Stalin at number 20. Curious, I wondered how ‘influence’ and ‘Wikipedia reach’ was being decided. According to the article, ‘Rankings (were) based on parameters such as the number of language editions in which that person has a page, and the number of people known to speak those languages’. What really surprised me was not the particular arrangement of figures on this page but the conclusions that were being drawn from it.

According to the piece, César Hidalgo, head of the Media Lab’s Macro Connections group, who researched the data, made the following claims about the data gathered from Wikipedia:

a) “It shows you how the world perceives your own national culture.”

b) “It’s a socio-cultural mirror.”

c) “We use historical characters as proxies for culture.”

And finally, perhaps most surprising is this final line in the story:

Using this quantitative approach, Hidalgo is now testing hypotheses such as whether cultural development is structured or random. “Can you have a Steve Jobs in a country that has not generated enough science or technology?” he wonders. “Ultimately we want to know how culture assembles itself.”

It is difficult to comment on the particular method used by this study because there is little more than the diagram and a few paragraphs of analysis, and the journalist may have misquoted him, but I wanted to draw attention to the statements being made because I think it represents the growing phenomenon of big data analysts using Wikipedia data to make assumptions about ‘culture’.

National culture?

Hidalgo claims that Wikipedia can show you ‘how the world perceives your own national culture’. But Wikipedia is actually pretty bad at showing national differences because of two key reasons: a) Wikipedia divides projects by language, and languages are necessarily cross national (Portuguese, for example, is spoken by people from two different continents; Arabic Wikipedia has contributions by people on three continents), and b) Different Wikipedia language editions are not necessarily being edited by people from the place traditionally associated with that language.

In 2011/12, I did a study of Swahili Wikipedia and learned, surprisingly, that the majority of edits were being made, not by East Africans where Swahili is a lingua franca but by Europeans, with only one of the three Swahili Wikipedia bureaucrats (Muddyb from Tanzania) a native Swahili speaker. Long-time Swahili Wikipedia editor and bureaucrat from Tanzania, Muddyb, complained about the lack of local contributions to the encyclopedia on his blog:

The most active users who are contribute on the Swahili Wikipedia is Wazungu (the White people) who coming from various areas (but mostly from Germany). Kipala is the only user who change the Swahili Wikipedia from the lower level to the highest level. Kipala was the one who took it from the 140 to the 1,000 mark – and make it the first African Wikipedia to reach over the 1,000.

The notion of home/first languages

Similarly, there is a common misconception that editors will edit the encyclopedia in their home/first language rather than a “foreign” language. The Makmende case study showed how Kenyans (who generally speak a native, regional language like Kikuyu or Maasai as well as the “official languages” of Swahili and English) preferred to write the article in English than in Swahili (where the article still does not exist). In interviews, I learned that the majority of Kenyan Wikipedia Chapter members choose to edit in English and that more Kenyans actually read English Wikipedia than Swahili Wikipedia, despite the fact that Swahili Wikipedia is one of the largest African Wikipedias with almost 23,000 articles (Zachte, 2011) (1). In a country where the national language is Kiswahili and the official languages are Kiswahili and English, Swahili Wikipedia gets less than 1% of Wikimedia readership, while English Wikipedia receives 89.6% of all Wikimedia requests. The largest readership of Swahili Wikipedia comes from Tanzania at only 20%, followed closely by the United States at 16.5%, Kenya at 6.6% and Germany at 5.6%. Swahili Wikipedia is actually being read either by East African expats living abroad and/or by Westerners with ties to the region.

There are many reasons why this is the case – the most important is the fact that language in Kenya (as in many places on the continent) is intimately connected with an individual’s access to education, employment, and political participation (Simpson, 2008). Even though Swahili was chosen as the national language in Kenya after independence, the colonial experience had a large impact on the way that languages were perceived, since being fluent in a colonial language ultimately meant access to the best positions during colonial times (Simpson, 2008). The legacy of this (and as Simpson argues, a characteristic of African traditional society) is that different languages are used for different situations in modern day Kenya: English for government, business and science and technology discussions, Swahili (or a version, like Sheng) for everyday conversation.

So, even though many East Africans (like Muddyb) complain about the lack of local support for Swahili, there are very important reasons why people choose to edit and read the English version instead. But this also means that looking at Wikipedia data about Swahili cannot ‘show you how the world perceives your own national culture’ because Wikipedia editors of particular language versions are often not even native speakers of the language which is supposedly reflecting this culture.

The role of individuals

The second problem with using ‘Wikipedia as a proxy for culture’ is that some language versions are very small, and that individuals can significantly skew the data. Take this example below of Mark Graham’s visualization of Swahili Wikipedia.

Wikipedia editors are made up of a particular slice of the population and this tends to reflect the kinds of articles that are representation (and not represented) on Wikipedia. Metafilter has a great discussion about the infographic and I really loved what user ‘eotvos’ had to say about why someone like Vasco da Gamma or Muhammad doesn’t show up in the graphic:

Once again, the subject as has the potential to be a lot more interesting than either the analysis or the article. While this doesn’t say much at all about “influence” or national reach or whatever else the authors suggest, comparisons of wikipedia databases in multiple language could be fascinating.

Consider Copronymus’ lovely example of Vasco da Gamma and Muhammad above. I suspect most of us have a reasonably good idea of what it means to have a wikipedia page about a person in English, or in Spanish. But, how ought we to think about wikipedia articles in languages with less reach?

To pick an example, what does it mean to have a wikipedia page in Nahuatl? The number of literate Nahuatl speakers with Internet access who don’t also speak one of the big 10 languages may not be zero, but it’s vanishingly close to zero. People who create wikipedia pages in Nahuatl are clearly doing something other than informing the monolingual Nahuatl-speaking public about Vasco da Gama. What are they up to, and what can we learn about them from their editorial choices?

Instead of measuring Vasco da Gama’s influence among Nahuatl speakers, these data probably say something about the canon of world knowledge as it is understood by a very small group of tech-savvy language activists. A detailed analysis of that could be really exciting.

November 27, 2012

Indeed. Wikipedia data about different language encyclopedias has interesting things to show us about the culture of Wikipedians themselves. The ways in which that data points to yearning for friendships across the seas, to access to what some see as greener pastures, to the things that they want people who are watching or reading to see and the things that they see should be part of the global English corpus. But Wikipedia is no proxy for national culture.

In the next post, I’ll talk about the particular mechanics of editing on Wikipedia that shape what ends up being represented.

Footnotes:

(1) Afrikaans and Swahili Wikipedias tend to periodically switch with one another as being the largest African Wikipedia. See Ian Gilfillan for more on this.