The election results on this Wikipedia page are wrong, I can tell. As we collect data for the Social Election Prediction Project, I am reviewing many a Wikipedia political party page and every so often I see mistakes. For this project I am checking that the page exists, ensuring that the page existed before the date of the election so that a voter could have used it to find out political information beforehand. I am not, it should be noted, checking for accuracy of information. Yet sometimes there are errors that glare. As an occasional Wikipedia editor and a stickler for correcting errors, I feel a strong urge to correct the mistakes I come across. Yet, as an academic looking at this page in a research context I am hesitant to alter that which I am studying. What are the ethical boundaries for academics conducting research on Wikipedia?

In 2012, Okoli et al. wrote an overview of scholarship on Wikipedia, a huge and varied field, totaling almost 700 articles in peer-reviewed journals in disciplines ranging from Computer Science, to Economics to Philosophy (Okoli et al, 2012). The Okoli article, titled, “The people’s encyclopedia under the gaze of the sages: A systematic review of scholarly research on Wikipedia,” is comprehensive on the subject of all Wikipedia research up to that date, but does not deal extensively with ethics. The ethical issues that are addressed are those that are linked with privacy concerns of studying the Wikipedia community. In their article on using wikis for research, Gerald Kane and Robert Fishman note that while all Wikipedia data is available under General Public License, or GPL, and so can be used without copyright concerns, researchers should still be cognizant of the privacy of Wikipedia editors (Kane & Fishman, 2009). For example many of the editors Kane and Fishman interacted with were hesitant to connect their real world identity with that of their identity on Wikipedia, and so did not want to conduct conversations through email or any other platform.

Of course, acting as a part of a community is not always a research taboo. Participatory action research, a method that arose from psychologist’s Kurt Lewin’s action research, emphasizes collaboration between researchers and the communities at hand. However, while participatory action research could apply for someone editing a Wikipedia article, studying the behavior of other editors and working with other editors to define the study, Wikipedia editors are not the subjects of the Social Election Prediction Project. The Social Election Prediction Project is a study of Wikipedia as an informational object. The subjects are voters seeking information before an election, and Wikipedia is simply a tool to help us measure their information-seeking behavior.

The ethical ambiguities of researching Wikipedia are just a symptom of Web 2.0., where everyone is a potential contributor. The same question could be asked of researchers studying Twitter for example, should they tweet? It depends on the objective of the study. For the Social Election Prediction Project, I have not edited any Wikipedia page that I am looking at for research purposes. While I could not alter the outcome for this specific project as we are looking at past elections and so historic page views, in some small way, improving political Wikipedia pages could make more people turn to Wikipedia for political news. However, I will continue to do minor edits for the Wikipedia pages I read in my own time. While not acting as researcher, I can be collaborator and reader both.

“There remains a mistaken belief that qualitative researchers are in the business of interpreting stories and quantitative researchers are in the business of producing facts.” (boyd & Crawford, 2012) The Social Election Prediction project is once again in the data collection phase and we’re here to discuss some of the data collection decision points we have encountered thus far or, in other words, the subjective aspect of big data research. This is not to denigrate this type of quantitative research. The benefits of big data for social science research are too numerous to list here and likely any reader of this blog is more than familiar. In the era of big data, human behaviour that was previously only theorized is now observable at scale and quantifiable. This is particularly true for the topic of this project, information seeking behaviour around elections. While social scientists have long studied voting behaviour, historically they have had to rely on self-reported surveys for signals as to how individuals sought information related to an election.

Now, certain tools such as Wikipedia and Google Trends provide an outside indication as to how and when people search for information on political parties and politicians. However, although Wikipedia page views are not self-reported, this does not mean that they are objective. Wikipedia data collection requires the interjection of personal interpretation; the typical measure of subjectivity. These decisions tend to fall into two general categories: the problem of individuation and the problem of delimitation.

When is something considered a separate entity and when should it be grouped? The first is a frequently occurring question in big data collection. For this project, this question has reoccurred with party alliances and two-round elections. If we are collecting Wikipedia pages to study information-seeking behaviour related to elections, should we consider views only of the page of a party alliance or of the individual party as well? This is a problem of individuation, deciding when to consider discrete entities as disparate and when to count them as a single unit. The import of party alliances varies by country but big data collection necessitates uniformity for the analysis stage. So, a decision must be made. The same issue arises with two-round elections. Should they be considered as one election instance or two? Again, a uniform decision is necessary for the next step of data analysis.

For decisions of delimitation one must set a logical boundary on something continuous. Think, time. For the Social Election Prediction project, we are collecting the dates of all of the elections under consideration, so that we can compare the Wikipedia page views for the various political parties involved prior to the election. For most electoral systems, the date of an election is simple, but for countries like Italy and the Czech Republic with two-day elections, the question of when to end the information-seeking period arises. The day before the election begins? After the first day? There is uniform data solution to this question, only yet another subjective decision by the data collector.

In the article quoted above, boyd and Crawford question the objectivity of data analysis but the subjective strains in big data research begin even earlier, with the collection stage. Data is defined in the collection stage, and these definitions, as with the analysis, can be context specific. Social media research faces the same definitional problems but many of the collection decisions have already been made by social media platform. Of course, same criticisms could be raised about traditional statistical analysis as well. While there may be unique benefits to big data research, it faces many of the same problems as previous research methods. Big data often seen as some sort of “black box” but the process of building that box can be just as subjective as qualitative research.

My colleague Taha Yasseri and I are currently working on a Fell Fund project on social media data and election prediction, looking especially at data from Google and Wikipedia (first paper out soon; will also be presenting on that at IPP 2014 which should be great). As part of that we thought we’d have a bit of fun looking at Scotland’s independence referendum on Wikipedia.

For election prediction the method is relatively straightforward: examine readership stats on the party Wikipedia pages of the country in question, and see which page is read the most (of course that doesn’t correspond straight away to election results – would that life were so simple – and the idea of the project is to see what corrections and biases need to be accounted for to make it work). It isn’t quite so clear how to do that for Scotland, but (just for fun really) we compared the following pages:

First we look at the UK and Scotland -> interesting how Scotland has leapfrogged the UK in the last days of the independence campaign. Points to a yes victory?

In terms of flags, though, the Union Jack is well ahead of the Saltire, peaking in the last few days. Is it a last minute outbreak of unionism?

In terms of national dishes, meanwhile, Haggis has been dominating Fish and Chips for the full period of the campaign, with interest in Haggis especially spiking in the last couple of days.

Well, one of these graphs will predict the winner of the referendum: we just don’t know which one ;-) More seriously, I think its interesting how most of these terms are spiking in the days before the vote, showing again how the social web really responds to political events.

UPDATE: Taha has passed me the comparison of the Yes and No campaign pages, as below. Yes for a narrow win following months of No dominance – you heard it here first.

Just wanted to put up a quick plug for the euandi voting advice application [VAA] which has recently been launched by the European University Institute. I was one of the 100 or so political scientists across Europe who got together to produce the application. Fill in a short questionnaire and it will tell you the extent to which other parties share your values (both in your country and across Europe), as well as telling you which other areas in Europe contain like minded individuals.

There are lots of VAAs around at the moment; the novelty with this one is that there is the option to then go on to connect to people with similar political views through Facebook, with the aim of (for example) getting a European Citizens’ Initiative started. Overall aim is to promote more transnational politics in the European Parliament elections, which are at the moment almost overwhelmingly dominated by national political concerns. Pretty neat stuff and it will be interesting to see what comes of it over the next month or so.

Online political information seeking, at least in the data we’ve gathered so far, happens in short, concentrated bursts. When we began the project, I (JB) was hoping that these bursts would tell us something about how people inform themselves about contemporary democratic politics. However we quickly saw in our first post that the peak of information seeking activity falls after the election itself takes place.

How can this be explained? So far we’ve been toying with two theories, developed out of the observations below. One: this behaviour is driven by news media coverage. People see the elections reported on TV or in the papers, then look them up online to find out more. If the peak in news coverage coincides with the day of the election, and its aftermath, then its logical that the peak of information seeking would occur shortly after that.

The second is that this behaviour is instead kind of replacing news coverage. People want to know the result of the election, especially if they participated, but for whatever reason the news media does an ineffective job of informing them of the result, so they look online instead.

One way of trying to distinguish between these two theories is by looking at information seeking activity during the European Parliament elections in countries with different election dates. The European elections in 2009 ran from the 4th to the 7th of June, but the final results were only announced on the 7th. Countries voting on the 4th, 5th and 6th would therefore have had a kind of information gap, whereby voters couldn’t find out the precise result of the elections. If information seeking is driven by a media effect, we might expect these countries to peak on the 8th (when the results are reported). If it is driven by a media replacement strategy, we would expect it to come the day after the relevant country’s election. Right?

Below are the info seeking graphs for the Netherlands and the Czech republic. As usual we are looking at page views of the Wikipedia page for the 2009 European Parliament elections in the language of the country of interest (so the Dutch and Czech versions). The Netherlands voted on the 4th of June, while the Czech Republic voted on the 5th.

Somehow, these graphs offer support for both theories, because they contain two peaks, one the day after the elections, and one the day after the results were reported. Well, none of this is perfect of course. Just because they can’t report the final result doesn’t mean the media can’t report (I believe) regional results within their country, and I think exit polls are also allowed. Finally they may just cover the elections on the day, even without reporting the results in detail. So media effects could still be the driving force. More importantly perhaps, the two theories aren’t really mutually exclusive.

In future work we are going to look at other Wikipedia pages which are more specific to the country in question. This will allow us to look at other early voting countries which don’t have a unique language (Austria, the UK, Ireland and perhaps Cyprus if the stats are high enough).

In the last post we looked at patterns of access to the Wikipedia article on the European Parliament election, 2009 identified an electoral information cycle which consists of a build up period, a peak of information seeking, and a period of decline. In 14 of the 19 language groups we looked at the dimensions of this pattern were very similar: that is, build up periods featured little information seeking until the few days before the election, peaks were short in duration and occurred just after the election, and periods of decline were very rapid.

In this post we want to focus on the 5 languages which didn’t fit into this trend, which were Estonian, Italian, Maltese, Norwegian and Dutch. Conveniently, each one of these languages overlaps almost perfectly with a single country, meaning that we can engage in some speculation about how the characteristics of different national systems affect patterns of electoral information seeking.

Both Malta and Norway featured very low absolute numbers of views to the web page in question, which presumably relates to the very low absolute population of Maltese speakers, the fact that Norwegian citizens don’t vote in EU Parliamentary elections (Norway does adopt lots of EU legislation through EFTA, so a very politically motivated Norwegian might still get interested to an extent), and of course the fact that English is widely spoken in both countries, meaning that they would have access to the English version of WIkipedia.

Of more interest to us are the patterns emerging from the Estonian, Dutch and Italian cases. Each one is distinct so we’ll look at each of them in turn.

Above is the situation in the Netherlands, which is unusual in that it presents two peaks of activity. The first falls the day after the Dutch portion of the EU elections took place (on the 4th of June), the second on the day after the last day of the election period defined by the EU as a whole. Both events are likely to provoke media coverage and hence drive searching, consistent with our media effects thesis expounded in the last post. However it is also worth noting that the European Commission prohibits distribution of country specific election results until all results are in from all countries in Europe. This means that Dutch voters would have had a several day wait to find out the results of the election they participated in. It could be in other words that instead of being generated by a media effect, this searching is generated by people who want to know the election result but haven’t been able to find it reported in the media.

Above is the situation in Italy. The country is unusual in that the build up period features a relatively large amount of activity (with a clearly visible weekly cyclical pattern), and also in that the peak of information seeking is sustained over several days. A couple of thoughts emerge. Firstly, Italy is known for having an electoral silence law which prevents opinion polling in the 15 days before the election, and any type of campaigning in the day or so before it. It could be that this relatively high build up period is a result of a dearth of information in the media about the elections.

Secondly, the Italian part of the European Parliament elections took place over two days (Saturday afternoon and Sunday). This may explain the longer peak, as people who participated on Saturday might search for the results on Sunday, whilst others might look on Monday.

Finally, we present the view from Estonia. The absolute numbers in this country are very small meaning that we shouldn’t read too much into it. However the pattern is highly unusual, with a peak falling well in advance of the election, meaning that we wanted to say something about it.

The Estonian part of the 2009 European Parliament elections was unusual for two reasons. Firstly, it featured a dramatic 17 point rise in turnout. Secondly, against a backdrop of frustration with the major parties, two independent candidates gained a significant proportion of the votes. How does this relate to an early peak? Honestly it’s not quite clear: perhaps more would need to be known about the specifics of the campaign. Nevertheless it feels relevant that this unusual pattern of information seeking coincided with an unusual electoral result.

Conclusions: Of course this is all speculation at the moment, however what emerges from this is an alternative thesis to the “media effect” discussed in the last post. Rather than seeking electoral information in response to reporting in the news media, voters may seek it because of a lack of such reporting (either because of an electoral silence law or a delay in the publication of the results). Such effects may be sharpened in an environment with new or unusual candidates. This is important for our overall question because it suggests that those who are seeking the information are likely to be those who participated in the election.

Next steps: see if any of these ideas can find support in the results from other countries. Did other countries which voted on the 4th of June have a twin peak style pattern? Are there any other electoral silence laws? What about victories for independent candidates?

When do people start getting interested in elections, and how does this differ in different countries? In this post we try to get a handle on this question looking at data drawn from Wikipedia. Wikipedia is a useful resource because it has editions in a huge variety of languages, even if the absolute penetration of each varies.

We pull out information on daily readership statistics for the Wikipedia article on the European Parliament election, 2009 for May and June. The election itself was held between 4-7 June in the 27 member states of the EU. We look at 19 different language versions of the article, all of which are either official EU languages or at least widely spoken in one region. As most of the languages we look are roughly unique to one European country (with some caveats) this gives us an idea of how public attention to the issue of the election builds up and then decays past election time in each country.

The figures above shows the case of Swedish, Polish and English Wikipedias by way of example. The overall picture can be broken down into three periods: pre-election, the moment of the election itself, and post-election. Several things are worth highlighting. First, although there is clearly some activity before election time, there isn’t really a “build up” until just a few days before the election. Second, the peak in activity broadly coincides with the election itself, but actually falls just after it. Thirdly, attention decays very quickly after the peak back towards the base level. However note that the height of the peak can vary significantly from one language to another.

Out of the 19 language editions that we have studied, 13 + English follow this three stage pattern very closely. The figure above shows a curve collapse for these 13 language editions after normalisation to the maximum of the daily page view of each language (in the next post we will focus on the 5 remaining outlier language editions).

We speculate that this graph shows the majority of electoral information seeking occurs in response to publicity generated by media events, and hence reflects the continued importance of the mainstream media for the functioning of democracy. Somewhat paradoxically however, these media seem to stimulate major public interest in the European Parliament elections only after they take place, rather than pushing people to inform themselves before participating.

For our purposes, the next question is of course whether the height of the peaks, which represents several thousand individuals going to the page at election time, corresponds to any electoral outcomes (e.g. turnout). We will tackle this in a future post.