Thursday, October 17, 2013

A small experiment with Twitter's language detection algorithm

Some time a go I captured quite a lot of geo-located tweets for a spatial statistics project I'm doing. The tweets I collected were all confined to be in Belgium. One of the things I looked at was the language of tweets. As you might know, Belgium officially has three languages, Dutch, French and German. Of course, when you analyze a large set of tweets, you can't manually determine the language, on the other hand blindly relying on Twitter's language detection algorithm doesn't feel good either.

That's why I set up a little experiment to assess to what extent Twitter's language detection algorithm can be trusted, in the context of my geo-location project. I stress this because I don't have the ambition to make overall judgments on how Twitter takes care of language detection.

First, let's look at the languages as determined by the Twitter language detection algorithm of the 150,000 or so tweets I collected. The barchart below shows the frequency of each of the languages.

I'm not sure if this chart is readable enough, so let me guide you through it.
The green bars are the 3 official languages of Belgium, Dutch, French and German.
French and Dutch take the top positions, German is on the seventh position.
Based on population figures you would expect more Dutch posts than French posts, while this data shows the opposite.
There can be many good reasons why this happens. To start with the obvious, the twitter population is not the general population, and hence the distribution of languages can be different as well.
Another obvious reason is that tweets can also come from foreigners, tourists for instance. While the sample is large (about 150,000 tweets), I need to rely on Twitter on providing a good sample of all tweets, and I'm not too sure about that.
Also, it might be possible that Dutch speaking Belgians tweet more in English than their French speaking counterparts.
And finally, it is possible that the Twitter detection algorithm is more successful in detecting some languages than others.

The fact that English (the blue bar) comes in third will not come as a surprise. Turkish is fourth (the top red bar), which can be explained by the relative large immigrant population coming from Turkey. The other languages, such as Spanish and portuguese (the remaining red bars) decrease quite rapidly in terms of frequency. But notice that the scale of the chart is somewhat deceiving in that the lower ranked languages such as Thai and Chinese, that are barely visible in the chart still are representing 40 and 20 tweets respectively. Overall this looks like another example of a power law, where we see that a few languages are responsible for the vast majority of tweets, while a large number of languages are used in the remaining tweets

You will have noticed that the fifth most important language, the orange bar is "Undecided", these are the tweets where the Twitter detection algorithm was not able to detect which language was used. Two other cases stand out (purple bars) on positions 9 and 10 are Indonesian and Tagalog. Tagalog is an Austronesian language spoken in The Philippines. In a blog post on the Twitter languages of London by Ed Manley (@EdThink) had noticed that Tagalog came on the seventh place in London. He writes:

One issue with this approach that I did note was the surprising popularity of Tagalog, a language of the Philippines, which initially was identified as the 7th most tweeted language. On further investigation, I found that many of these classifications included just uses of English terms such as ‘hahahahaha’, ‘ahhhhhhh’ and ‘lololololol’. I don’t know much about Tagalog but it sounds like a fun language.

Here are the eight first Tagalog tagged tweets in my dataset:

@xxx hahaha!!!

@xxx hahaha

@xxx das ni goe eh?

@xxx hahaha

SUMBARIE !

Swedish couple named their kid "Brfxxccxxmnpcccclllmmnprxvclmnckssqlbb11116." The name is pronounced "Albin.

#LRT hahahahahaha le salaud

hahah

Basically what we see in Belgium is very similar as what was observed in London. Tweets containing expression such as 'hahaha' are catalogued as Tagalog. So for my spatial statistics exercise (and for this experiment) I think it is safe to consider both Tagalog and Indonesian as Undecided.(My thoughts go to the poor researchers in The Philippines who must face quite a challenge when they analyze Twitter data. On the other hand, they now have, yet another, good reason not to touch Twitter data ;-)

Back to the experiment. I took a simple random sample of 100 tweets and asked 4 coders (including myself) to determine in what language a tweet was expressed. I gave the coders only minimal instructions in an attempt not to influence them too much. I did provide them with a very simple 'coding scheme', based on the most common languages (Dutch, French, or English, and a category for both the cases where the coder was not able to determine the language used and all other languages). Now, this might sound like a trivial exercise, but a tweet like "I'm at Comme Chez Soi in Brussel", can be seen as English, French or Dutch, depending on how you interpret the instructions.

This results in datamatrix consisting of 100 rows and 5 columns (i.e. the language assessments of Twitter and the 4 coders). A data scientist will immediately start to think how to analyze this (small) dataset. There are many ways of doing that. Let's first start with the obvious, i.e. comparing the Twitter outcome with one of the coders. You can easily represent that in a frequency table:

EN FR NL WN EN 14 2 1 2 FR 3 34 0 3 NL 1 0 24 0 WN 5 5 1 5

The rows represent the language of a tweet according to Twitter (EN=English, FR=French, NL=Dutch and WN=Don't know or another language). The columns represent the language according to the first coder. We now have different options. Some folks do a Chi-Square-test on this table, but this is not without problems. To start with, testing the hypothesis of independence is not necessarily relevant for assessing the agreement between two coders and we can get into troubles with zero or near zero cells and marginals. Either way, here are the results for such a test:X-squared = 136.6476, df = 9, p-value < 2.2e-16

As the $p$-value is smaller than the usual 0.05, we would reject the null hypothesis and thus accept that the two coders are not independent and hence somehow 'related'. Again, this seems to be a rather weak requirement given the coding task at hand. Also, $\chi^2$ is sensitive to sample size, so just simply increasing the number of tweets would eventually lead to significance in case we wouldn't have reached it at $n=100$.

One of the alternatives for that is to normalize the $\chi^2$-statistic somehow. There are many ways to do that, one approach is to divide by the sample size $n$ and the number of categories (minus 1). This is called Cramer's v:
$$r_V=\sqrt{{\chi^2 \over n \times \min[R-1, C-1]}}$$,
where $C$ is the number of columns and $R$ is the number of rows. Cramer's v is often used in statistics to measure the association between two categorical variables. If there is no association at all it becomes 0 and perfect association leads to 1. In this example $R=C=4$ because we consider 4 language categories which then results in $r_V=0.6749016$.

Sometimes simpler or at least more obvious approaches are used, such as taking the proportion of the items for which the two coders agreed. If we assume that both coders have used the same number of categories $G=R=C$, we can formalize this with:
$$r_{pca}= {\sum_{i=1}^G f_{ii}\over n}$$.
In the example this results in $r_{pca}=0.77$. So in more than three quarters of tweets, Twitter and the first coder agree on the language.
The drawback here is that we don't account for chance agreement. Cohen's $\kappa$ is alternative for that. This is generally done by subtracting the original statistic by its expected value and by dividing by the maximum value of that statistic minus the expected value. In the case of Cohen's $\kappa$ this results in:
$$r_\kappa={r_{pca} - E(r_{pca})\over 1-E(r_{pca})}$$,
with
$$E(r_{pca})={\sum_{i=1}^G{f_{i.}\times f_{.j}\over n}\over n}$$,
in which $f_{i.}$ and $f_{.j}$ are the marginal frequencies. Calculating this for our example yields $r_\kappa=0.6766484$

Yet another interesting alternative are approaches which consider the ${n \choose 2}$ pairs of judgments rather the $n$ judgments directly. This approach is popular in the cluster analysis and psychometrics literature, with indexes such as the Rand Index and all sorts of variations on the that index, such as the Hubert and Arabie Adjusted Rand Index. Recently I stumbled on a very interesting article "On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index" in Journal of Classification by Matthijs J. Warrens, that I recommend very strongly.

But one of the issues that is tackled less often in the literature is the fact that in this type of situations we have often more than one coder or judge. The classical approach is then to calculate all pairwise combinations and take a decision from there.

Incidentally, there are a few areas in research where multiple coders are often used, i.e. in qualitative research. Indeed, qualitative research, has a long tradition to handle situations where 'subjectivity' can play an important role. Very often this is done, amongst others, by using multiple coders. The literature on the methodology is quite separate from the mainstream statistical literature, but nonetheless there are some interesting things to learn from that field. One of the popular indices in qualitative research is Krippendorff's $\alpha$.

In Content Analysis reliability data refers to a situation in which independent coders assign a value from a set of instructed values to a common set of units of analysis. This overall reliability or agreement is expressed as:
$$ \alpha=1-{D\over E(D)}$$,
in which $D$ is a disagreement measure and $E(D)$ its expectation, and the details of the calculation would lead us too far. A simple example is available on the wikipedia page.

The index can be used for any number of coders, it deals with missing data, and can handle different levels of measurement such as binary, nominal, ordinal, interval, and so on. It claims to 'adjusts itself to small sample sizes of the reliability data'. It is not clear to me where and to what extent these claims are proven. Nonetheless in practice this index is used to have one single coefficient that allows to compare reliabilities 'across any numbers of coders and values, different metrics, and unequal sample sizes'.

I used the irr library in the R-language to calculate Krippendorff's $\alpha$ for all 5 coders, which resulted in $0.796$, which is just below the commonly used threshold in the social sciences. So we can't claim that all coders, including Twitter, agreed completely on the language detection task, on the other hand we are not too far of what would be considered good.

There were 84 tweets where all 4 human coders agreed on. In 71 of those 83, Twitter came up with the same language as the human coders. That's about 85%. That's not excellent, but it's not bad either.

Let's take a look at a few examples where all 4 human coders agreed, but Twitter didn't:

The examples 1,4 and 8, seem intrinsically hard because there is no correct answer, so we can't hold that against Twitter. The examples 2,3 and 6 seem to be very straightforward cases that Twitter didn't capture. Example 5 was catalogued as French by Twitter, while the human coders put it in the rest/Don't know category.

All in all I believe that the number of obvious mistakes is not too high, although that assessment, of course, depends on the type of application. I can very well imagine that for some applications this is not good enough.

Based on all the different indices, interpretations and examples, my conclusion is that for my spatial statistics project, the Twitter language detection algorithm is not perfect, but good enough. I will use the language suggestion, but only after regrouping and after making sure that Tagalog and the like are recoded towards 'undecided'.

About Me

Istvan Hajnal is a veteran of more than 20 years in the fields of data analysis, survey methodology and market research. First at the University of Leuven, Belgium and then
about 10 years with The Nielsen Company, the world's largest Market Research Company. Istvan is currently Insights Director, Marketing & Data Sciences for GfK, Belgium.
He received a master's degree in computer science (Leuven), a master's degree in quantitative applications in the social sciences (Brussels) and finally a Phd in Social sciences from the University of Leuven.
He blogs about Data Science but occasionally also on management and leadership in general and the Market Research Industry in particular.