Those of you who know me may know that I’m a big fan of emoji. I’m also a big fan of linguistics and NLP, so, naturally, I’m very curious about the linguistic roles of emoji. Since I figured some of you might also be curious, I’ve pulled together a discussion of some of the very serious scholarly research on emoji. In particular, I’m going to talk about five recent papers that explore the exact linguistic nature of these symbols: what are they and how do we use them?

Emoji are more than just cute pictures! They play a set of very specific linguistic roles.

Dürscheid & Siever, 2017:

This paper makes one overarching point: emoji are not words. They cannot be unambiguously interpreted without supporting text and they do not have clear syntactic relationships to one another. Rather, the authors consider emoji to be specialized characters, and place them within Gallmann’s 1985 hierarchy of graphical signs. The authors show that emoji can play a range of roles within the Gallmann’s functional classification.

Allography: using emoji to replace specific characters (for example: the word “emoji” written as “em😝ji”)

Ideograms: using emoji to replace a specific word (example: “I’m travelling by 🚘” to mean “I’m travelling by car”)

Border and Sentence Intention signals: using emoji both to clarify the tone of the preceding sentence and also to show that the sentence is over, often replacing the final punctuation marks.

Based on an analysis of a Swiss German Whatsapp corpus, the authors conclude that the final category is far and away the most popular, and that emoji rarely replace any part of the lexical parts of a message.

Na’aman et al, 2017:

Na’aman and co-authors also develop a hierarchy of emoji usage, with three top-level categories: Function, Content (both of which would fall under mostly under the ideogram category in Dürscheid & Siever’s classifications) and Multimodal.

Function: Emoji replacing function words, including prepositions, auxiliary verbs, conjunctions, determinatives and punctuation. An example of this category would be “I like 🍩 you”, to be read as “I do not like you”.

Content: Emoji replacing content words and phrases, including nouns, verbs, adjectives and adverbs. An example of this would be “The 🔑 to success”, to be read as “the key to success”.

Multimodal: These emoji “enrich a grammatically-complete text with markers of
affect or stance”. These would fall under the category of border signals in Dürscheid & Siever’s framework, but Na’aman et all further divide these into four categories: attitude, topic, gesture and other.

Based on analysis of a Twitter corpus made of up of only tweets containing emoji, the authors find that multimodal emoji encoding attitude are far and away the most common, making up over 50% of the emoji spans in their corpus. The next most common uses of emoji are to multimodal:topic and multimodal:gesture. Together, these three categories account for close to 90% of the all the emoji use in the corpus, corroborating the findings of Dürscheid & Siever.

Wood & Ruder, 2016:

Wood and Ruder provide further evidence that emoji are used to express emotion (or “attitude”, in Na’aman et al’s terms). They found a strong correlation between the presence of emoji that they had previously determined were associated with a particular emotion, like 😂 for joy or 😭 for sadness, and human annotations of the emotion expressed in those tweets. In addition, an emotion classifier using only emoji as input performed similarly to one trained using n-grams excluding emoji. This provides evidence that there is an established relationship between specific emoji use and expressing emotion.

Donato & Paggio, 2017:

However, the relationship between text and emoji may not always be so close. Donato & Paggio collected a corpus of tweets which contained at least one emoji and that were hand-annotated for whether the emoji was redundant given the text of the tweet. For example, “We’ll always have Beer. I’ll see to it. I got your back on that one. 🍺” would be redundant, while “Hopin for the best 🎓” would not be, since the beer emoji expresses content already expressed in the tweet, while the motorboard adds new information (that the person is hoping to graduate, perhaps). The majority of emoji, close to 60%, were found not to be redundant and added new information to the tweet.

However, the corpus was intentionally balanced between ten topic areas, of which only one was feelings, and as a result the majority of feeling-related tweets were excluded from analysis. Based on this analysis and Wood and Ruder’s work, we might hypothesize that feelings-related emoji may be more redundant than other emoji from other semantic categories.

Barbieri et al, 2017:

Additional evidence for the idea that emoji, especially those that show emotion, are predictable given the text surrounding them comes from Barbieri et al. In their task, they removed the emoji from a thousand tweets that contained one of the following five emoji: 😂, ❤️, 😍, 💯 or 🔥. These emoji were selected since they were the most common in the larger dataset of half a million tweets. Then then asked human crowd workers to fill in the missing emoji given the text of the tweet, and trained a character-level bidirectional LSTM to do the same task. Both humans and the LSTM performed well over chance, with an F1 score of 0.50 for the humans and 0.65 for the LSTM.

So that was a lot of papers and results I just threw at you. What’s the big picture? There are two main points I want you to take away from this post:

People mostly use emoji to express emotion. You’ll see people playing around more than that, sure, but by far the most common use is to make sure people know what emotion you’re expressing with a specific message.

Emoji, particularly emoji that are used to represent emotions, are predictable given the text of the message. It’s pretty rare for us to actually use emoji to introduce new information, and we generally only do that when we’re using emoji that have a specific, transparent meaning.

If you’re interested in reading more, here are all the papers I mentioned in this post:

First, a confession: part of the reason I’m writing this blog post today is becuase I’m having major FOMO on account of having missed #FAT2018, the first annual Conference on Fairness, Accountability, and Transparency. (You can get the proceedings online here, though!) I have been seeing a lot of tweets & good discussion around the conference and it’s gotten me thinking again about something I’ve been chewing on for a while: what does it mean for a machine learning tool to be fair?

If you’re not familiar with the literature, this recent paper by Friedler et al is a really good introduction, although it’s not intended as a review paper. I also review some of it in these slides. Once you’ve dug into the work a little bit, you may notice that a lot of this work is focused on examples where output of the algorithm is a decision made on the level of the individual: Does this person get a loan or not? Should this resume be passed on to a human recruiter? Should this person receive parole?

Fairness is a balancing act.

While these are deeply important issues, the methods developed to address fairness in these contexts don’t necessarily translate well to evaluating fairness for other applications. I’m thinking specifically about tools like speech recognition or facial recognition for automatic focusing in computer vision: applications where an automatic tool is designed to supplant or augment some sort of human labor. The stakes are lower in these types of applications, but it’s still important that they don’t unintentionally end up working poorly for certain groups of people. This is what I’m calling “parity of utility”.

Parity of utility: A machine learning application which automatically completes a task using human data should not preform reliably worse for members of one or more social groups relevant to the task.

That’s a bit much to throw at you all at once, so let me break down my thinking a little bit more.

Machine learning application: This could be a specific algorithm or model, an ensemble of multiple different models working together, or an entire pipeline from data collection to the final output. I’m being intentionally vague here so that this definition can be broadly useful.

Automatically completes a task: Again, I’m being intentionally vague here. By “a task” I mean some sort of automated decision based on some input stimulus, especially classification or recognition.

Human data: This definition of fairness is based on social groups and is thus restricted to humans. While it may be frustrating that your image labeler is better at recognizing cows than horses, it’s not unfair to horses becuase they’re not the ones using the system. (And, arguably, becuase horses have no sense of fairness.)

Preform reliably worse: Personally I find a statistically significant difference in performance between groups with a large effect size to be convincing evidence. There are other ways of quantifying difference, like the odds ratio or even the raw accuracy across groups, that may be more suitable depending on your task and standards of evidence.

Social groups relevant to the task: This particular fairness framework is concerned with groups rather than individuals. Which groups? That depends. Not every social group is going to relevant to every task. For instance, your mother language(s) is very relevant for NLP applications, while it’s only important for things like facial recognition in as far as it co-varies with other, visible demographic factors. Which social groups are relevant for what types of human behavior is, thankfully, very well studied, especially in sociology.

So how can we turn these very nice-sounding words into numbers? The specifics will depend on your particular task, but here are some examples:

Is there a statistically-significant difference in the word error rate for commercial speech recognition algorithms based on the speaker’s dialect, race and gender? (Answer: yes, yes and maybe, with General American and white talkers having the lowest word error rates.)

So the tools reviewed in these papers don’t demonstrate parity of utility: they work better for folks from specific groups and worse for folks from other groups. (And, not co-incidentally, when we find that systems don’t have parity in utility, they tend work best for more privileged groups and worse for less privileged groups.)

Parity of Utility vs. Overall Performance

So this is the bit where things get a bit complicated: parity of utility is one way to evaluate a system, but it’s a measure of fairness, not overall system performance. Ideally, a high-performing system should also be a fair system and preform well for all groups. But what if you have a situation where you need to choose between prioritizing a fairer system or one with overall higher performance?

I can’t say that one goal is unilaterally better than the other in all situations. What I can say is that focusing on only higher performance and not investigating measures of fairness we risk building systems that have systematically lower performance for some groups. In other words, we can think of an unfair system (under this framework) as one that is overfit to one or more social groups. And, personally, I consider a model overfit to a specific social group while being intended for general use to be flawed.

Parity of Utility is an idea I’ve been kicking around for a while, but it could definitely use some additional polish and wheel-kicking, so feel free to chime in in the comments. I’m especially interested in getting some other perspectives on these questions:

Do you agree that a tool that has parity in utility is “fair”?

What would you need to add (or remove) to have a framework that you would consider fair?

What do you consider an acceptable balance of fairness and overall performance?

Would you prefer to create an unfair system with slightly higher overall performance or a fair system with slightly lower overall performance? Which type of system would you prefer to be a user of? What about if a group you are a member of had reliably lower performance?

Is it more important to ensure that some groups are treated fairly? What about a group that might need a tool for accessibility rather than just convenience?

Like this:

I recently gave a talk at PyCascades (a regional Python language conference) on character encodings and I thought it would be nice to put together a little primer on a couple different important character encodings.

So many characters, so little time.

If you’re unfamiliar with character encodings, they’re just a variety of different systems used to map a string of binary (i.e. 1’s and 0’s) to a specific character. So the Euro character, €, would be represented as “111000101000001010101100” in a character encoding called UTF-8 but “10100100” in the Latin 9 encoding. There are a lot of different character encodings out there, so I’m just going to cover a handful that I think are especially interesting/important.

Name: ASCII

Created: 1960

Also known as:American Standard Code for Information Interchange, US-ASCII

Most often seen: In legacy systems, especially U.S. government databases that need to be backward compatible

ASCII was the first widely-used character encoding. It has space for only 128 characters, and is best suited for English-language data.

Most often seen: Representing languages not covered by ASCII (like Spanish and Portuguese).

Latin 1 is the most popular of a large set of character encodings developed by the ISO (International Organization for Standardization) to deal with the fact that ASCII only really works well for English by adding additional characters. They did this by adding one extra bit to each character (8 instead of the 7 ASCII uses) so that they had space for 256 characters per encoding. However, since there are a lot of characters out there, there are 16 different ISO encodings that can handle different alphabets. (For example, ISO 8859-5 handles Cyrillic characters, while ISO 8859-11 maps Thai characters.)

While the ISO was developing a standard set of character encodings, pretty much every large software company was also developing thier own set of proprietary encodings that did pretty much the same thing. Windows-1252 is slightly tweaked version of Latin 1, but Windows also had a bunch of different encodings, as did Apple, as did IBM. The 1980’s were a wild time for character encodings!

Name: Shift-JIS

Created: 1997

Also known as: Shift Japanese Industrial Standards, SJIS, Shift_JIS

Most often seen: For Japanese

So one thing you may notice about the character encodings above is that they’re all fairly small, i.e. can’t handle more than 256 characters. But what about a language that has way more than 256 characters, like Japanese? (Japanese has a phonetic writing system with 71 characters, as well as 85,000 kanji characters which each represent a single word.) Well, one solution is to create a different character encoding system specifically for that language with enough space for all the characters you’re going to want to use frequently. But, like with the ASCII-based encodings I talked about above, just one encoding isn’t quite enough to cover all the needs of the language, so a lot of variants and extensions popped up.

Which brings us to UTF-8, the current standard for text encoding. The UTF encodings map from binary to what are known as Unicode codepoints, and then those codepoints are mapped to characters. Why the whole “codepoints” thing in the middle? To help overcome the problems with language-specific encodings that were discussed above. There are over one million codepoints, of which a little over 130,000 have actually been assigned to specific characters. You can update which binary patterns map to which codepoints independently of which codepoints map to which characters. The large number of code points also means that UTF encodings are also pretty future-proof: we have space to add a lot of new characters before we run out. And, in case you’re wondering, there is a single body in charge of determining which code points map to which characters (including emoji!). It’s called the Unicode Consortium and anyone’s free to join.

Like I mentioned, there are lots of different character encodings out there, but knowing about these five and how they’re related will give you a good idea of the complexities of the character encodings landscape. If you’re curious to learn more, the Unicode Standard is a good place to start.

One thing I remember very clearly from writing my dissertation is how confused I initially was about which particular methods I could use to evaluate how often my models were correct or wrong. (A big part of my research was comparing human errors with errors from various machine learning models.) With that in mind, I thought it might be handy to put together a very quick equation-free primer of some different ways of measuring error.

The first step is to figure out what type of model you’re evaluating. Which type of error measurement you use depends on the type of model you’re evaluating. This was a big part of what initially confused me: much of my previous work had been with regression, especially mixed-effects regression, but my dissertation focused on multi-class classification instead. As a result, the techniques I was used to using to evaluate models just didn’t apply.

Today I’m going to talk about three types of models: regression, binary classification and multiclass classification.

R-squared: This is a measurement of how correlated your predicted values are with the actual observed values. It ranges from 0 to 1, with 0 being no correlation and 1 being perfect correlation. In general, models with higher r-squareds are a better fit for your data.

Root mean squared error (RMSE), aka standard error: This measurement is an average of how wrong you were for each point you predicted. It ranges from 0 up, with closer to zero being better. Outliers (points you were really wrong about) will disproportionately inflate this measure.

Binary Classification

In binary classification, you aim to predict which of two classes an observation will fall. Examples include predicting whether a student will pass or fail a class or whether or not a specific passenger survived on the Titanic. This is a very popular type of model and there are a lot of ways of evaluating them, so I’m just going to stick to the four that I see most often in the literature.

Accuracy: This is proportion of the test cases that your model got right. It ranges from 0 (you got them all wrong) to 1 (you got them all right).

Precision: This is a measure of how good your model is at selecting only the members of a certain class. So if you were predicting whether students would pass or not and all of the students you predicted would pass actually did, then your model would have perfect precision. Precision ranges from 0 (none of the observations you said were in a specific class actually were) to 1 (all of the observations you said were in that class actually were). It doesn’t tell you about how good your model is at identifying all the members of that class, though!

Recall (aka True Positive Rate, Specificity): This is a measure of how good your model was at finding all the data points that belonged to a specific class. It ranges from 0 (you didn’t find any of them) to 1 (you found all of them). In our students example, a model that just predicted all students would pass would have perfect recall–since it would find all the passing students–but probably wouldn’t have very good precision unless very few students failed.

F1 (aka F-Score): The F score is the (harmonic) mean of both precision and recall. It also ranges from 0 to 1. Like precision and recall, it’s calculated based on a specific class you’re interested in. One thing to note about precision, recall and F1 is that they all don’t count true negatives (when you guessed something wasn’t in a specific class and you were right) so if that’s an important consideration for your model you probably shouldn’t rely on these measures.

Multiclass Classification

Multiclass classification is the task of determining which of three or more classes a specific observation belongs to. Things like predicting which icecream flavor someone will buy or automatically identifying the breed of a dog are multiclass classification.

Confusion Matrix: One of the most common ways to evaluate multiclass classifications is with a confusion matrix, which is a table with the actual labels along one axis and the predicted labels along the other (in the same order). Each cell of the table has a count value for the number of predictions that fell into that category. Correct predictions will fall along the center diagonal. This won’t give you a single summary measure of a system, but it will let you quickly compare performance across different classes.

Cohen’s Kappa: Cohen’s kappa is a measure of how much better than chance a model is at assigning the correct class to an observation. It range from -1 to 1, with higher being better. 0 indicates that the model is at chance levels (i.e. you could do as well just by randomly guessing). (Note that there are some people who will strongly advise against using Cohen’s Kappa.)

Informedness (aka Powers’ Kappa): Informedness tells us how likely we are to make an informed decision rather than a random guess. It is the true positive rate (aka recall) plus the true negative rate, minus 1. Like precision, recall and F1, it’s calculated on a class-by-class basis but we can calculate it for a multiclass classification model by taking the (geometric) mean across all of the classes. It ranges from -1 to 1, with 1 being a model that always makes correct predictions, 0 being a model that makes predictions that are no different than random guesses and -1 being a model that always makes incorrect predictions.

Packages for analysis

For R, the Metrics package and caret package both have implementations of these model metrics, and you’ll often find functions for evaluating more specialized models in the packages that contain the models themselves. In Python, you can find implementations of many of these measurements in the scikit-learn module.

Also, it’s worth noting that any single-value metric can only tell you part of the story about a model. It’s important to consider things besides just accuracy when selecting or training the best model for your needs.

Got other tips and tricks for measuring model error? Did I leave out one of your faves? Feel free to share in the comments. 🙂

Like this:

One of the things I really enjoy about my current job is chatting with other data science folks. Almost inevitably in the course of these conversations, the old “Python vs. R” debate comes up.

For those of you who aren’t familiar, Python and R are both programming languages often used by data scientists and other folks who work with data. Python is a general-purpose programming language (originally designed as a teaching language) that has some popular packages used for data analysis. R is a computer language specifically designed for doing statistics and visualization. They’re both useful languages, but R is much more specialized.

I use both Python & R, but I tend to prefer R for data analysis and vitalization. I also love kitchen gadgets. (I own and routinely use a melon baller, albeit only very rarely for actually balling melon.) My hypothesis is that my preference for R and love of kitchen gadgets share the same underlying cause: I really like specialized tools.

I was curious to see if there was a similar relationship for other people, so I reached out to my Twitter followers with a simple two-question poll:

185 people filled out the poll (if you were one of them, thanks!). Unfortunately for my hypothesis, a quick analysis of the results revealed no evidence that there was any relationship between whether someone prefers Python or R and if they like kitchen gadgets. You can check out the big ol’ null result for yourself:

Regardless of how poorly this experiment illustrates my point, however, it still stands: R is a specialized tool, while Python is general purpose one.

I like to think of R as a bread knife and Python as a pocket knife. It’s much easier to slice bread with a bread knife, but sometimes it’s more convenient to use a pocket knife if you already have it to hand.

This blog post is a little different from my usual stuff. It’s based on a talk I gave yesterday at the first annual Data Institute Conference. As a result, it’s aimed at a slightly more technical audience than my usual stuff, but I hope I’ve done an ok job keeping it accessible. Feel free to drop me a comment if you have any questions or found anything confusing and I’ll be sure to help you out.

You can play with the code yourself by forking this notebook on Kaggle (you don’t even have to download or install anything :).

To explore working with multilingual data, let’s look a real-life dataset of user-generated text. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages. It’s a realistic cross-section of the type of linguistic diversity you’ll see in a large text dataset.

# read in our datatweetsData=pd.read_csv("../input/all_annotated.tsv",sep="\t")# check out some of our tweetstweetsData['Tweet'][0:5]

Maybe you’ve got a deadline coming up fast, or maybe you didn’t get a chance to actually look at some of your text data and just decide to treat it as if it were English. What could go wrong?

To find out, let’s use Spacy to tokenize all our tweets and take a look at the longest tokens in our data.

Spacy is an open-source NLP library that is much faster than the Natural Language Toolkit, although it does not have as many tasks implemented. You can find more information in the Spacy documentation.

The five longest tokens are entire tweets, four produced by an art bot that tweets hashes of Unix timestamps and one that’s just the HTML version of “<3” tweeted a bunch of times. In other words: normal Twitter weirdness. This is actual noise in the data and can be safely discarded without hurting downstream tasks, like sentiment analysis or topic modeling.

The next five longest tokens are also whole tweets which have been identified as single tokens. In this case, though, they were produced by humans!

The tokenizer (which assumes it will be given mainly English data) fails to correct tokenize these tweets because it’s looking for spaces. These tweets are in Japanese, though, and like many Asian languages (including all varieties of Chinese, Korean and Thai) they don’t actually use spaces between words.

In case you’re curious, “、” and “。” are single characters and don’t contain spaces! They are, respectively, the ideographic comma and ideographic full stop, and are part of a very long list of line breaking characters associated with specific orthographic systems.

In order to correctly tokenize Japanese, you’ll need to use a language-specific tokenizer.

The takeaway: if you ignore multiple languages, you’ll end up violating the assumptions behind major out-of-the-box NLP tools¶

Probably the least time-intensive way to do this is by attempting to automatically identify the language that each Tweet is written in. A BIG grain of salt here: automatic language identifiers are very error prone, especially on very short texts. Let’s check out two of them.

LangID does…alright, with three out of five tweets identified correctly. While it’s pretty good at identifying English, the first tweet was identified as Azerbaijani and the second tweet was labeled as Malay, which is very wrong (not even in the same language family as Turkish).

Let’s look at another algorithm, TextCat, which is based on character-level N-Grams.

# N-Gram-Based Text Categorizationtc=TextCat()# try to identify the languages of the first five tweets againtweetsData['Tweet'][0:5].apply(tc.guess_language)

0 tur
1 ind
2 bre
3 eng
4 eng
Name: Tweet, dtype: object

TextCat also only got three out of the five correct. Oddly, it identifier “bed” as Breton. To be fair, “bed” is the Breton word for “world”, but it’s still a bit odd.

The takeaway: Automatic language identification, especially on very short texts, is very error prone. (I’d recommend using multiple language identifiers & taking the majority vote.)¶

Even if language identification were very accurate, how much data would be just be throwing away if we only looked at data we were fairly sure was English?

Note: I’m only going to LangID here for time reasons, but given the high error rate I’d recommend using multiple language identification algorithms.

# get the language id for each textids_langid=tweetsData['Tweet'].apply(langid.classify)# get just the language labellangs=ids_langid.apply(lambdatuple:tuple[0])# how many unique language labels were applied?print("Number of tagged languages (estimated):")print(len(langs.unique()))# percent of the total dataset in Englishprint("Percent of data in English (estimated):")print((sum(langs=="en")/len(langs))*100)

Number of tagged languages (estimated):
95
Percent of data in English (estimated):
40.963625976

Only 40% of our data has been tagged as English by LangId. If we throw the rest of it, we’re going to lose more than half of our dataset! Especially if this is data you spent a lot of time and money collecting, that seems downright wasteful. (Plus, it might skew our analysis.)

So if 40% of our data is in English, what is the other 60% made up of? Let’s check out the distribution data across languages in our dataset.

# convert our list of languages to a dataframelangs_df=pd.DataFrame(langs)# count the number of times we see each languagelangs_count=langs_df.Tweet.value_counts()# horrible-looking barplot (I would suggest using R for visualization)langs_count.plot.bar(figsize=(20,10),fontsize=20)

There’s a really long tail on our dataset; most that were identified in our dataset were only identified a few times. This means that we can get a lot of mileage out of including just a few more popular languages in our analysis. How much will we gain, exactly?

print("Languages with more than 400 tweets in our dataset:")print(langs_count[langs_count>400])print("")print("Percent of our dataset in these languages:")print((sum(langs_count[langs_count>400])/len(langs))*100)

This time, the longest tokens are Spanish-language hashtags. This is exactly the sort of thing we’d expect to see! From here, we can use this tokenized dataset to feed into other downstream like sentiment analysis.

Of course, it would be impractical to do this for every single language in our dataset, even if we could be sure that they were all identified correctly. You’re probably going to have to accept that you probably won’t be able to consider every language in your dataset unless you can commit a lot of time. But including any additional language will enrich your analysis!

The takeaway: It doesn’t have to be onerous to incorporate multiple languages in your analysis pipeline!¶

As we saw, this option will result in violating a lot of the assumptions built into NLP tools (e.g. there are spaces between words). If you do this, you’ll end up with a lot of noise and headaches as you try to move through your analysis pipeline.

Option 2: Only look at English

In this dataset, only looking at English would have led to us throwing away over half of our data. Especailly as NLP tools are developed and made avaliable for more and more languages, there’s less reason to stick to English-only NLP.

Option 3: Seperate your data by language & analyze them independently

This does take a little more work than the other options… but not that much more, especially for languages that already have resources avalialbe for them.

Like this:

In the course of my day-to-day work on Kaggle’s public data platform, I’ve learned a lot about the ecosystem of language data on the web (or at least the portions of it that have been annotated in English). For example, I’ve noticed a weird disconnect between European and American data repositories resources that I’m pretty sure has its roots in historical and disciplinary divisions.

I’ve also found a lot of great resources, though! At some point, I started keeping notes on interesting data repositories and link aggregators. I finally got around to tidying up and annotating my list of resources, and I figured that it would a useful thing to share with everyone. So, without further ado, here’s an (incomplete) list of some places to find language resources on the web:

The Linguistic Data Consortium is an international non-profit that offers archival hosting of datasets. The data offered by them is high quality and usually not free (although they offer data grants for students).

Kaggle’s public data platform has a lot of language/NLP datasets available on it, many not in English. You can also do data analysis on Kaggle (with R or Python) without having to download anything or set up a local environment.

Like a digital object identifier (DOI) for language resources. Not the best search (only looks at the title) but if you have a specific phrase you’re looking for it can be a good way to discover new resources.

If you’ve been following my blog for a while, you may remember that last year I found that YouTube’s automatic captions didn’t work as well for some dialects, or for women. The effects I found were pretty robust, but I wanted to replicate them for a couple of reasons:

I only looked at one system, YouTube’s automatic captions, and even that was over a period of several years instead of at just one point in time. I controlled for time-of-upload in my statistical models, but it wasn’t the fairest system evaluation.

I didn’t control for the audio quality, and since speech recognition is pretty sensitive to things like background noise and microphone quality, that could have had an effect.

Speech Data

I used speech data from four varieties: the South (speakers from Alabama), the Northern Cities (Michigan), California (California) and General American. “General American” is the sort of news-caster style of speech that a lot of people consider unaccented–even though it’s just as much an accent as any of the others! You can hear a sample here.

For each variety, I did an acoustic analysis to make sure that speakers I’d selected actually did use the variety I thought they should, and they all did.

Systems

For the YouTube captions, I just uploaded the speech files to YouTube as videos and then downloaded the subtitles. (I would have used the API instead, but when I was doing this analysis there was no Python Google Speech API, even though very thorough documentation had already been released.)

Bing’s speech API was a little more complex. For this one, my co-author built a custom Android application that sent the files to the API & requested a long-form transcript back. For some reason, a lot of our sound files were returned as only partial transcriptions. My theory is that there is a running confidence function for the accuracy of the transcription, and once the overall confidence drops below a certain threshold, you get back whatever was transcribed up to there. I don’t know if that’s the case, though, since I don’t have access to their source code. Whatever the reason, the Bing transcriptions were less accurate overall than the YouTube transcriptions, even when we account for the fact that fewer words were returned.

Results

OK, now to the results. Let’s start with dialect area. As you might be able to tell from the graphs below, there were pretty big differences between the two systems we looked at. In general, there was more variation in the word error rate for Bing and overall the error rate tended to be a bit higher (although that could be due to the incomplete transcriptions we mentioned above). YouTube’s captions were generally more accurate and more consistent. That said, both systems had different error rates across dialects, with the lowest average error rates for General American English.

Differences in Word Error Rate (WER) by dialect were not robust enough to be significant for Bing (under a one way ANOVA) (F[3, 32] = 1.6, p = 0.21), but they were for YouTube’s automatic captions (F[3, 35] = 3.45,p < 0.05). Both systems had the lowest average WER for General American.

Now, let’s turn to gender. If you read my earlier work, you’ll know that I previously found that YouTube’s automatic captions were more accurate for men and less accurate for women. This time, with carefully recorded speech samples, I found no robust difference in accuracy by gender in either system. Which is great! In addition, the unreliable trends for each system pointed in opposite ways; Bing had a lower WER for male speakers, while YouTube had a lower WER for female speakers.

So why did I find an effect last time? My (untested) hypothesis is that there was a difference in the signal to noise ratio for male and female speakers in the user-uploaded files. Since women are (on average) smaller and thus (on average) slightly quieter when they speak, it’s possible that their speech was more easily masked by background noises, like fans or traffic. These files were all recorded in a quiet place, however, which may help to explain the lack of difference between genders.

Finally, what about race? For this part of the analysis, I excluded General American speakers, since they did not report their race. I also excluded the single Native American speaker. Even with fewer speakers, and thus reduced power, the differences between races were still robust enough to be significant for YouTube’s automatic captions and Bing followed the same trend. Both systems were most accurate for Caucasian speakers.

As with dialect, differences in WER between races were not significant for Bing (F[4, 31] = 1.21, p = 0.36), but were significant for YouTube’s automatic captions (F[4, 34] = 2.86,p< 0.05). Both systems were most accurate for Caucasian speakers.

While I was happy to find no difference in performance by gender, the fact that both systems made more errors on non-Caucasian and non-General-American speaking talkers is deeply concerning. Regional varieties of American English and African American English are both consistent and well-documented. There is nothing intrinsic to these varieties that make them less easy to recognize. The fact that they are recognized with more errors is most likely due to bias in the training data. (In fact, Mozilla is currently collecting diverse speech samples for an open corpus of training data–you can help them out yourself.)

A recent paper proposes using automatic speech recognition to automatically score a clinical test used to diagnose, among other things, speech impairment, ADHD, dementia and schizophrenia.

Voice recognition is increasingly being incorporated into the hiring process. One example is HireVue, a software product designed to pre-screen candidates before they talk to a recruiter. A direct quote from the HireVue CEO when asked whether it might make a mistake on a candidate: “the algorithm is always right. It would be a ‘Yes’ or a ‘No.’” (Yikes!)

Every automatic speech recognition system makes errors. I don’t think that’s going to change (certainly not in my lifetime). But I do think we can get to the point where those error don’t disproportionately affect already-marginalized people. And if we keep using automatic speech recognition into high-stakes situations it’s vital that we get to that point quickly and, in the meantime, stay aware of these biases.

If you’re interested in the long version, you can check out the published paper here.

If you’ve read some of my other posts on sociolinguistics, you may remember that the one of its central ideas is that certain types of language usage pattern together with aspects of people’s social identities. In the US, for example, calling a group of people “yinz” is associated with being from Pittsburgh. Or in Spanish, replacing certain “s” sounds with “th” sounds is associated with being from northern or central Spain. When a particular linguistic form is associated with a specific part of someone’s social identity, we call that a “sociolinguistic variable”

There’s been a lot of work on the type of sociolinguistic variables people use when they’re speaking, but there’s been less work on what people do when they’re writing. And this does make a certain amount of sense: many sociolinguistic variables are either 1) something people aren’t aware they’re doing or 2) something that they’re aware they’re doing but might not consider “proper”. As a result, they tend not to show up in formal writing.

This is where the computational linguistics part comes in; people do a lot of informal writing on computers, especially on the internet. In fact, I’d wager that humans are producing more text now than at any other point in history, and a lot of it is produced in public places. That lets us look for sociolinguistics variables in writing in a way that wasn’t really possible before.

Which is a whole lot of background to be able to say: I’m looking at how punctuation and capitalization pattern with political affiliation on Twitter.

Political affiliation is something that othersociolinguists have definitely looked at. It’s also something that’s very, very noticeable on Twitter these days. This is actually a boon to this type of research. One of the hard things about doing research on Twitter is that you don’t always necessarily know someone’s social identity. And if you use a linguistic feature to try to figure out their identity when what you’re interested in is linguistic features, you quickly end up with the problem of circular logic.

Accounts which are politically active, however, will often explicitly state their political affiliation in their Twitter bio. And I used that information to get tweets from people I was very sure had a specific political affiliation.

For this project, I looked at people who use the hashtags #MAGA and #theResistance in their Twitter bios. The former is an initialism for “Make America Great Again” and is used by politically conservative folks who support President Trump. The latter is used by political liberal folks who are explicitly opposed to President Trump. These two groups not only have different political identities, but also are directly opposed to each other. This means there’s good reason to believe that they will use language in different ways that reflect that identity.

But what about the linguistic half of the equation? Punctuation and capitalization are especially interesting to me because they seem to be capturing some of the same information we might find in prosody or intonation in spoken language. Things like YELLING or…pausing….or… uncertainty? They’re also much, much easier to measure punctuation than intonation, which is notoriously difficult and time-consuming to annotate. At the same time, I have good evidence that how you use punctuation and capitalization has some social meaning. Check out this tweet, for example:

As this tweet shows, putting a capital letter at the beginning of a tweet is anything but “aloof and uninterested yet woke and humorous”.

So, if punctuation and capitalization are doing something socially, is part of what they’re doing expressing political affiliation?

That’s what I looked into. I grabbed up to 100 tweets each from accounts which used either #MAGA or #theResistance in their Twitter bios. Then I looked at how much punctuation and capitalization users from these two groups used in their tweets.

Punctuation

First, I looked at all punctuation marks. I did find that, on average, liberal users tended to use less punctuation. But when I took a closer look at the data, an interesting pattern emerged. In both the liberal and conservative groups, there were two clusters of users: those who used a lot of punctuation and those who used almost none.

Politically liberal users on average tended to use less punctuation than politically conservative users, but in both groups there’s really two sets of users: those who use a lot of punctuation and those who use basically none. There just happen to be more of the latter in #theResistance.

What gives rise to these two clusters? I honestly don’t know, but I do have a hypothesis. I think that there’s probably a second social variable in this data that I wasn’t able to control for. It seems likely that the user’s age might have something to do with it, or their education level, or even whether they use thier Twitter account for professional or personal communication.

Capitalization

My intuition that there’s a second latent variable at work in this data is even stronger given the results for the amount of capitalization folks used. Conservative users tended to use more capitalization than the average liberal user, but there was a really strong bi-modal distribution for the liberal accounts.

Again, we see that conservative accounts use more of the marker (in this case capitalization), but that there’s a strong bi-modal distribution in the liberal users’ data.

What’s more, the liberal accounts that used a lot of punctuation also tended to use a lot of capitalization. Since these features are both ones that I associate with very “proper” usage (things like always starting a tweet with a capital letter, and ending it with a period) this seems to suggest that some liberal accounts are very standardized in their use of language, while others reject at least some of those standards.

So what’s the answer the question I posed in the title? Can capitalization or punctuation reveal political affiliation? For now, I’m going to go with a solid “maybe”. Users who use very little capitalization and punctuation are more likely to be liberal… but so are users who use a lot of both. And, while I’m on the subject of caveats, keep in mind that I was only looking at very politically active accounts who discuss thier politics in their user bios. These observations probably don’t apply to all Twitter accounts (and certainly not across different languages).

If you’re interested in reading more, you can check out the fancy-pants versions of this research here and here. And I definitely intend to consider looking at this; I’ll keep y’all posted on my findings. For now, however, off to find me a Nanimo bar!