What computational linguists actually do all day: The debugging edition

We already knew that the patient had the primary, secondary, and tertiary stages of syphilis.

Tell someone you’re a computational linguist, and the next question is almost always this: so, how many languages do you speak? This annoys the shit out of us, in the same way that it might annoy a public health worker if you asked them how many stages of syphilis they have. (There are four. When I was a squid (military slang for “sailor”), one of our cardiologists lost her cool and threw a scalpel. It stuck in one of my mates’ hands. We already knew that the patient had the primary, secondary, and tertiary stages of syphilis, so my buddy was one unhappy boy…)

Being asked “how many languages do you speak?” annoys us because it reflects a total absence of knowledge about what we devote our professional lives to. (This is obviously a little arrogant–why should anyone else bother to find out about what we devote our professional lives to? That’s our problem, right? Nonetheless: the millionth time that you get asked, it’s annoying.) It’s actually easier to explain what linguistics is in French than it is in English, because French has two separate words for things that are both covered by the word language in English:

une langueis a particular language, such as French, or English, or Low Dutch.

le langageis language as a system, as a concept.

No, I did not just make up “tone-bearing unit.”

Linguists study the second, not the first. People who call themselves linguists might specialize in vowels, or in words like “the,” or in how people use language both to segregate themselves and to segregate others. Whatever it is that you do, you’re basing it on data, and the data comes from actual languages, so you might work with any number of them–personally, I wrote a book on a language spoken by about 30,000 people in what is now South Sudan. The point of that work, though, is to investigate broader questions about langage, more so than to speak another language–that’s a very different thing. I can tell you a hell of a lot about the finite state automata that describe tone/tone-bearing-unit mappings in that language, but can’t do anything in it beyond exchange polite greetings (and one very impolite leave-taking used only amongst males of the same age group).

So, if you’re not spending your days sitting around memorizing vocabulary items in three different regional variants of Upper Sorbian, what does a linguist actually do all day? Here’s a typical morning. I was trying to do something with trigrams (3-word sequences–approximately the longest sequence of words that you can include in a statistical model of language before it stops doing what you want it to do), when I ran into this:

Fixed that one, and then there was a problem with my x-ray reports (my speciality is biomedical languages)…

Fixed that one, and then…

…and your guess may well be better than mine on that one. God help you if you run into this kind of thing, though…

Source: me.

…because that message about not having some number of elements (a) usually takes forever to figure out, and then (b) once you do figure it out, reflects some kind of problem with your data that is going to give you a lot of headaches before you get it fixed.

I spend a lot of my day looking at things like this:

Source: me.

.,..which is a bunch of 0s and 1s describing the relationship between word frequency and word rank, plus what goes wrong when your data gets created on an MS-DOS machine, which I will have to fix before I can actually do anything with said data (see the English notes below for what said data means); or this…

Source: me.

…which tells me some things about the effects of “minor” preprocessing differences on type/token ratios–they’re not actually so minor; or this…

…which tells me that either there are some errors in that data, or there is an enormous amount of variability between the official terminology of the field and the way that said terminology actually shows up in the scientific literature. (See the leftmost blob–it indicates that there are plenty of cases of one-word terms that show up as more than 5 words in actual articles. That is certainly possible–disease in which abnormal cells divide without control and can invade nearby tissues is 13 words that together correspond to the single-word term cancer—but, I was surprised to see just how frequent those large discrepancies in lengths were. In my professional life, I love surprises, but they also suggest that you’d better consider the possibility that there are problems with the data.)

So, yeah: it’s not like I can’t get my hair cut in Japanese, or explain how to do post-surgical hand therapy in Spanish, or piss off a con artist in Turkish (a story for another time)–but, none of those have anything to do with my professional life as a computational linguist. That’s all about computing, which means computers, and I hate computers. Ironic, hein? Life is fucking weird, and I like it that way.

English notes

I think this is Queneau, but couldn’t swear to it. Source: it’s all over the place.

said: a shorter way of saying “the aforementioned.” Both of these are characteristic of written language, more so than of spoken language. Even in writing, though, it’s pretty bizarre if you’re not a native speaker, which is why I picked it to talk about today. A French equivalent would be ledit/ladite/lesdites (not sure about that last one–Phil dAnge?), which I have a soft spot for ’cause I learned it in Queneau’s Exercices de style.

Trying to think of helpful ways to recognize this bizarre usage of said, I went looking for examples of said whose part of speech is adjectival. Here are some of the things that I found:

As such, any dispute that you may have on goods purchased or services availed of should be raised directly withsaid merchant/s.

A seemingly endless shopping list to conquer, a shrinking budget with which to dosaidshopping ~ and let’s face it: our businesses don’t run themselves while we’re visiting relatives.

This is a monumental pain in the ass — you don’t exactly trip over Notary Publics in today’s day and age — and I can only assume came fromsaidcompany having a problem with identity once sometime in the last twelve years, and the president saying “fuck it.”

How it appears in the post:

…what goes wrong when your data gets created on an MS-DOS machine, which I will have to fix before I can actually do anything with said data;…

Either there are some errors in that data, or there is an enormous amount of variability between the official terminology of the field and the way that said terminology actually shows up in the scientific literature.

debugging: A technical term in software programming that refers to finding problems in your program. I used it in the title of today’s post because most of the illustrations that I gave of what I do all day are of irritating problems of one sort or another that I (really did) have to track down in the course of my day. They don’t tell you in school that tracking down such things are literally about 80% of what any programmer spends their time doing. Of course, any problem in a computer program is a problem that you created, so you can get irritated about them, but you most certainly cannot take your irritation out on anyone else…

8 thoughts on “What computational linguists actually do all day: The debugging edition”

About halfway through that, my eyes started spinning in different directions and I had to ask the cat to read the rest and explain it to me. It was an experience that left me grateful that you do what you do so that I don’t have to.