Zipf's Lawhttps://zipfslaw.org
A blog about the implications of the statistical properties of languageSun, 01 Sep 2019 10:33:19 +0000en
hourly
1 http://wordpress.com/https://s0.wp.com/i/buttonw-com.pngZipf's Lawhttps://zipfslaw.org
Spotting the wild handler: the clipboard scamhttps://zipfslaw.org/2019/09/01/spotting-the-wild-handler-the-clipboard-scam/
https://zipfslaw.org/2019/09/01/spotting-the-wild-handler-the-clipboard-scam/#commentsSun, 01 Sep 2019 07:36:22 +0000http://zipfslaw.org/?p=31522There’s this bizarre scam that you see in any heavily touristy area of Paris. Young women pretending to be deaf wander around with clipboards and try to get you to sign what is allegedly a petition. When they get someone to sign, one of two things happen:

I have always heard that there is a “handler” hanging around, watching them in case of trouble, but I’ve never been able to spot one–until today.

My task was made ridiculously simple by the fact that the little team gathered for a smoke break–right next to me. I was at the Place St-Michel, sitting on the edge of the fountain enjoying my own fine American tobacco product, when four women and a guy sat down next to me and lit up. The women were holding clipboards–score!

My level of interest in having this guy figure out that I was filming him was low, and consequently, I didn’t get a great video. What you’re going to see in the following film: the girls have just successfully scored, and their mark has walked away. The handler wandered over unobtrusively while they were taking her money, and then walked away–at the beginning of the clip, you see him (gray t-shirt, with a courier bag over his shoulder) walking away “stage right.” Then he takes up a position leaning–not very unobtrusively at all–against a lamp post.

As they smoked their cigarettes, the women chatted amongst themselves–clearly not deaf. The guy pretty much ignored them, chatting on a cell phone instead. In which language? I don’t know. I was listening for Bulgarian, Rom, or Romanian–but, what I heard sounded more like a dialect of Arabic. A mystery, since this is stereotypically a scam perpetrated by Roma, and personally, I don’t know of any scam in Paris associated with Arabs. (There is a whole ecosystem of scams in the world, with different ethnic groups dominating specific sectors of that ecosystem in Paris.)

I have a lot of respect for the guys that you see all over Paris hustling to sell souvenirs, bottles of water, whatever–they’re just trying to make a living like everyone else, exchanging goods for cash. I have a fair amount of respect for an inventive beggar, too–begging can be much harder and more creative work than you might imagine, and there are some really good ones. I have zero respect for people who rip other people off, who scam them; I have less than zero respect for people who scam others not by manipulating their greediness (e.g. with a get-rich-quick scheme), but by taking advantage of their kindness. That, I think, approaches the lowest of the low: fuck them.

]]>https://zipfslaw.org/2019/09/01/spotting-the-wild-handler-the-clipboard-scam/feed/348.849836 2.29813748.8498362.298137article-2367612-1ADCFD1E000005DC-626_964x636zipfslaw1gimg_7288No, the French do not hate Americanshttps://zipfslaw.org/2019/08/25/no-the-french-do-not-hate-americans/
https://zipfslaw.org/2019/08/25/no-the-french-do-not-hate-americans/#commentsSun, 25 Aug 2019 11:48:04 +0000http://zipfslaw.org/?p=31509Continue reading "No, the French do not hate Americans"]]>It’s the weekend of the celebration of the liberation of Paris from the Nazis. I step out on my balcony for a cigarette, and I see a parade of old World War II military vehicles roll down l’Avenue de la Motte-Picquet. When the American vehicles come, the onlookers cheer and clap. The French vehicles go by unapplauded.

It’s August in Paris, when there is dancing on the banks of the Seine. I walk up to a woman and ask her to dance. She walks into my arms and asks Where are you from? Later, I ask her how she knew so immediately that I wasn’t French–in France, asking a French person where they’re from is rude, although it’s (mostly) fine for non-French. (More on this below, in the French notes.) You hesitated a bit before a word, she said. Then she thought for a moment more: …and you walked up to me with this directness and openness that I admire in Americans.

It’s my first time in France, and I don’t speak French. Someone is telling me where to find a specific hotel in Normandy, and says–in English, obviously–That’s where you saved our fucking asses–twice.

In France, you do not ask a French person where they’re from (vous venez d’où ?). It’s rude, because the implication is that you don’t really belong in French. Rather, you ask What region are you from–vous venez de quelle région ? Point of pride: when I first started spending time in France as a francophone, people would ask me So, you’re an American? Then, they progressed to Where are you from?, or occasionally So, you’re British/Belgian/German/Suiss? Now, after 5 years of constant and intensive study of the langue de Molière, I very, very occasionally get what region are you from? Always warms my heart.

]]>https://zipfslaw.org/2019/08/25/no-the-french-do-not-hate-americans/feed/2baiser-liberationfzipfslaw1520px-P1040957_Paris_XVI_avenue_du_Président-Kennedy_rwkP1220077_Paris_VIII_avenue_F_Roosevelt_rwkLanguages that give you a sore throathttps://zipfslaw.org/2019/08/17/kaqchikel-ejective-consonants/
https://zipfslaw.org/2019/08/17/kaqchikel-ejective-consonants/#respondSat, 17 Aug 2019 22:59:46 +0000http://zipfslaw.org/?p=31453I’m walking down the street, and in one hand I have a shopping bag containing books that I just paid a week’s-worth of grocery money for. In the other hand: a shopping bag containing the most disgusting canned food available, ’cause… see the preceding sentence about books that I just paid a week’s-worth of grocery money for. I realize that my mouth hurts. Then I realize why: as I walk down the street, I’m thinking in French–but, I haven’t spoken it much lately.

It’s no secret that speaking a language that you don’t typically speak can make your mouth hurt. I speak Spanish for exactly one week a year, and it always makes my cheeks sore: the kinematics of Spanish are quite different from English and French (my languages of daily life), and the difference is enough to wear out my muscles. If I haven’t spoken French much for a week or two, my lips get tired: the French u (International Phonetic Alphabet [y]) requires more rounding than any sound in English or Spanish. But, Kaqchikel: Kaqchikel is giving me a sore throat.

I spend one week a year volunteering with a group called Surgicorps in Guatemala, a country the size of Tennessee–with 23-25 different languages. 70% of the population is “indígena,” which in these parts (see the English notes below for what in these parts means) means Mayan Indian. There are 20-22 different Mayan languages spoken in Guatemala, plus Spanish and two other non-Mayan Indian languages. Kaqchikel is one of the four mayoritarias, or “big” Mayan languages, being spoken by around half a million people; in preparation for my week of volunteer work I just spent several hours a day for the preceding two weeks studying it in a local language school.

Part of what makes Kaqchikel sound the way that it does is its ejective consonants. Those are the “popping” sounds that you hear in the following YouTube video. Why they “pop:” because of the way that you make the air come out of your mouth when you make them. Most sounds of language are made with what is called a pulmonic egressive airstream mechanism. “Airstream mechanism” refers to the way that you make the air flow to make the sound. Egressive means that when you make the sound, the air flows outward; and pulmonic means that the flow of air is initiated in the lungs.

Ejective consonants are produced by what is known as a glottalic airstream mechanism. That means that the airflow is powered by closing the vocal folds (vocal chords in non-technical English). In the case of a glottalic egressive consonant, you put your tongue wherever it goes to make the sound in question, you close your vocal folds, and then you lift your glottis upwards. This increases the air pressure in the oral cavity, and when you open your mouth to release the sound, that elevated air pressure gives the consonant the characteristic ejective “pop.”

So… why the sore throat? From clamping my vocal folds shut all day while I’m (trying to) speak Kaqchikel. Mind you, I already (a) smoke way too much, and (b) spend a lot of my waking hours speaking French, so my voice is already so low that making myself heard by an American without shouting is sometimes difficult.

One week a year I head south to Guatemala, where I do English/Spanish interpretation for Surgicorps, a wonderful group of surgeons, nurses, anesthesiologists, technicians, and therapists who provide free specialty surgical services to people who would not otherwise have access to them. We buy our own plane tickets and pay for our own hotel rooms. A donation from you to Surgicorps goes to taking care of our patients, and even a little bit helps—$250 pays all of the surgical expenses for one patient, $25 pays for a pack of instruments, and $10 buys all of the pain-killers that we hand out in a week. If you enjoy my posts from Guatemala, please consider a donation, large or small–just click here.

English notes

in these parts: in this geographical area. I’m just going to give you one example, in the hopes that you will take the time to watch the very powerful video embedded in the tweet.

I work near Pine Bluff and she is right. the number of shootings and killings there are ridiculous. I applaud her because Arkansas is full of racists Hicks and I can’t tell you have often the KKK still meets and regularly march in these parts with no reprecussions. https://t.co/Yr8HaGFQcF

How I used it in the post: I spend one week a year volunteering with a group called Surgicorps in Guatemala, a country the size of Tennessee–with 23-25 different languages. 70% of the population is “indígena,” which in these parts means Mayan Indian. “In these parts” refers back to “Guatemala.”

Once a year I spend a week in Guatemala with Surgicorps, a group of people who provide free surgery for people who have nowhere else to turn. We give up a week of vacation and buy our own plane tickets, and the surgeons, nurses, anesthesiologists, and therapists in our group give away their very valuable professional services to people who they will never see again and who in most cases don’t even share a language in which to say “thank you.” Please consider supporting our work with a donation. $250 pays the complete costs of surgery for one patient, $100 pays for four surgical packs, and $10 pays for all of the pain medications that we will hand out this week. If you enjoy these posts, please consider making a donation, no matter how small–your money will go a long way here in Guatemala.

The “613 pains” idea comes from the novel Everything is illuminated, by Jonathan Safran Foer, and its “613 sadnesses.” The book tells the story of a young man who goes back to the Ukraine to find the woman who saved his grandfather from the Nazis; 613 is a special number in Judaism, being the number of commandments in the Bible. The photograph is taken from the Surgicorps Facebook page. Hey–have I hit you up for a donation yet?

Applying to a graduate program means filling out a lot of paperwork–and writing a thing or two yourself. One of those things is called a personal statement, and there is a bit of an art to writing one. Here’s some advice for doing it.

The first thing to know about a personal statement is this: it’s not actually personal. Your goal in a “personal statement” is not to tell the admissions committee who you are “as a person,” but rather to take advantage of this opportunity to speak to them to show that you would be a good fit for their program.

What that means: you want the admissions committee member who is reading your statement to finish saying this to themself: oh–they could work with our faculty member Dr. Zipf [insert some actual faculty member of the institution in question, unless you’re applying to my institution]. (The pronoun themself is explained in the English notes below.)

How you lead them to that happy conclusion: don’t tell them, but show them. Here are some things that you can do:

State that you are interested in one or two specific areas of research of that department.

State that you became interested in the/those topic when doing a research project on that topic…

…or, if you have not done research on that topic, then that you got interested in it/them while doing research on some other topic and coming across a paper on the topic by some member of the faculty of the department to which you are applying.

List some areas of specialization within that topic or some related topics that you would be interested in working on, where those specializations or related topics are actually areas of research that members of the department to which you are applying work within.

Why I say one or two: you very much want to avoid a situation where (a) only one person in the department works on a topic, and (b) you don’t know it, but that person is getting ready to retire/move to another institution/begin a three-year period as the Associate Dean for Reproducibility, or something. You avoid that situation by either (a) talking about a topic that two or more people in the department actually work on, or (b) talking about more than one topic.

Now, you may be asking yourself: what if I can’t find anyone in the department who works on my area of interest? The answer:

If you cannot find anyone in the department who works in your area of interest, then that department is not a good fit for you.

…and that’s exactly what the department wants to know. In fact, if you apply to a graduate school and they don’t accept you, it is entirely reasonable to assume until proven otherwise that they’re not rejecting you, but just don’t see their department as the right place for you.

Need to know how to ask for a letter of recommendation for graduate school?

This post is written on the basis of my time on the admissions committee of a medium-sized graduate program in computational biology. If you have other perspectives/opinions on the subject, please add them to the comments below!

English notes

When you get deep into the weeds of the English language, one of the things that you run into is dialectal variation in pronoun use. For example:

Dative pronouns in conjoined subject noun phrases: In the Pacific Northwest region of the United States, if you have a subject with two more people joined by a conjunction (e.g. and or or), then the pronouns are in the dative form, not the subject form. For example, look at these contrasts:

I’m going to the store. (subject)

He’s going to the store. (subject)

Me and him are going to the store. (dative)

Him and me are going to the store. (dative)

Anaïs is going to the store. (subject)

They are going to the store. (subject)

Anaïs and them are going to the store. (dative)

Even in the Pacific Northwest, you don’t have to talk this way–it’s pretty regionally specific, and people will understand you just fine if you say he and I are going to the store. But, if you are in that part of the country, you have to be able to understand it.

Atypical reflexive pronouns: Other oddnesses have to do with the reflexive forms of pronouns. For example, in my dialect, the third-person plural forms they/them/their are used if you don’t know the gender of the referent. Straightforward enough–that usage goes back centuries in English. But: in a reflexive context (i.e. when the subject is doing something to itself or for itself), you get a variety of forms, depending on number:

You want the admissions committee member who is reading your statement to finish saying this to themself: oh–they could work with our faculty member Dr. Zipf [insert some actual faculty member of the institution in question, unless you’re applying to my institution]. That is obscure enough that it does not even show up in Merriam-Webster’s online dictionary.

My aunt and uncle bought themselves a new copy of the compact edition of the Oxford English dictionary. This plural form is totally standard American English.

My aunt and uncle each bought themselfs a new pair of sunglasses. …and that one, again, does not show up in Merriam-Webster.

This raises a question: how would someone who doesn’t speak a dialect like this say (1) and (3)? I’m pretty sure that in (3), they would say themselves. But, (1)? I don’t know another way of saying it–native speakers?

The picture at the top of this post is of Oxley Hall on the Ohio State University campus. I had the pleasure of getting a master’s degree in linguistics there in the 1990s. Mostly we hung out in the basement analyzing spectrograms, but we would occasionally sneak up into the tower. Fun.

French: A farmer in Picardy takes his pig to the vet. The vet says to him: c’est tatoué? The farmer says: ben sûr c’est à mwé!

English: What’s black and white and [rɛd] all over? A newspaper.

American Spanish: How is a cat like a priest? Ambos [kasan].

The French joke relies on a regional dialect where oi is at least sometimes pronounced wé rather than wa. The vet asks the farmer is it tattooed? in standard French, but the farmer understands it in the regional dialect as is it yours?, and answers of course it’s mine!

The English joke relies on the homophony between the color red and the past tense of the verb to read. This riddle puzzled the shit out of me when I was a small child, which in retrospect I should have realized meant that I was never going to be a very good linguist.

The Spanish joke relies on the American Spanish non-distinction between the pronunciation of z and s. (“American Spanish” means Spanish as spoken in the Americas, i.e. South, Central, and North America.) A cat casa (hunts), while a priest caza (marries). They’re written differently, and in Spain (and maybe some upper-class American dialects, but I can’t swear to it) are pronounced differently, but they’re pronounced the same in the Americas.

Sucking the joy out of language since 1989,

Beauregard Zipf

English notes

vet: This word can mean two things in American English:

veterinarian, as in the joke. Examples:

took my dog to the vet just to find out he’s sick af (af = “as fuck,” an adverb meaning “a lot”)

She and other vets said there’s frustration that the President is quick to claim credit for successes and happy to bask in the reflection of the military’s luster but doesn’t follow through on tough issues.

Vets groups decry hatred, racism in wake of Charlottesville violence (Source: headline here. Charlottesville is a city in North Carolina where the president of the United States of America defended a white supremacist rally at which an anti-racism protester was killed.)

The veteran’s voice is crucial to changing the hate rhetoric directed at Muslims. “When I served in the United States Marine Corps, I took an oath to the Constitution of the United States. There is a First Amendment, which respects religious tolerance and freedom of speech,” stated John Amidon, Vietnam vet and member of Veterans For Peace.

I just spent several frustrating hours trying to fix a bug in my code. In the end, the bug was purely a logic bug, and it was purely the product of poor variable-naming.

Code is the instructions that you write in a computer language, for a program to execute.

Here’s what happened. I’m writing the world’s simplest script–I just need to read in some files that contain values for features for individual files–or, to put it better: for individual papers that I want to classify.

A script is a kind of computer program, typically one that does a relatively simple task.

…and, with that, I think you can already guess what happened. I was opening files that contained features that I had extracted from other files, and I reused a variable name. Consequently, once my script reached some critical length, I could no longer keep track in my own head of the code that I was editing. So, my test cases found a simple bug, and in the process of fixing that bug, I got myself so confused that I was mixing up the “files” in the sense of “papers that I’m classifying” and the “files” in the sense of “files containing feature values from papers,” and the next thing you know, several hours have gone by.

A variable is something in a computer program whose value can be changed. It’s the opposite of a constant, which is something whose value cannot be changed. For example, the number 3 is a constant–its value will always be 3. On the other hand, a computer program might contain something called length_of_word, intended to store the length of some word that you’re looking at, and that length could be anything, in principle. (Really? How about 0? Or a negative number? This kind of unstated assumption is one way that computer programs can go wrong.)

This is one of those things that gets fixed by (1) printing out my code on actual paper, noticing the same variable name in two clearly-marked-off-as-different sections of the code, and thinking “Zipf, you might be even more stupid than you knew…”; (2) sitting in the Philadelphia sun with a pack of cigarettes and a quality zombie novel for a while (Déchirés, by Peter Stenson–the zombie apocalypse comes and the only people who survive are meth addicts–I think you can come to your own conclusion about the metaphor a lot quicker than I fixed my code); and then (3) you go back and look at the code and you see immediately how you managed to confuse the heck out of yourself.

My error here was in reusing my variable to store two different kinds of information. This is a classic error in computer programming. I either didn’t notice that I was doing it when I moved from the first part of the program to the second part, or more likely, noticed it but didn’t think that it would be a problem because the script was relatively short and simple. The problem with variable reuse is not for the program itself; rather, the problem is for the programmer, because variable reuse is a great way to confuse yourself. That’s exactly what I did–bad Zipf, bad!

Happy Saturday from Penn Student Housing, where either the kid in D3 is going to stop throwing rotting chicken in the communal trash can or he’s going to wake up with it in his bed,

Zipf

I notice that I’ve been writing a lot of whiny posts about computational linguistics lately. In fact I LOVE my job, enough so that I am probably one of the happiest people you know–or don’t know. Want the English-language version of Déchirés? Here it is: Fiend. I read it three times in English before I read it in French, so it MUST be good, right?

]]>

https://zipfslaw.org/2019/07/03/what-computational-linguists-actually-do-all-day-the-variable-reuse-edition/feed/0first-week-of-linguistics-charts-course-for-computational-linguistics-phdzipfslaw1Billet-doux: love letterhttps://zipfslaw.org/2019/07/01/billet-doux-love-letter/
https://zipfslaw.org/2019/07/01/billet-doux-love-letter/#commentsMon, 01 Jul 2019 12:05:21 +0000http://zipfslaw.org/?p=31472Continue reading "Billet-doux: love letter"]]>This is a love letter. It’s not to my grandmother, although it could be. My favorite memories of her: sitting together on her front porch in the morning, sharing a cup of coffee and a cigarette, talking about nothing–or just not talking at all.
Prévert in Paris, 1946. Photographer: unknown. Cat: unknown.

This is a love letter. It’s not to Jacques Prévert, although it could be. I’m usually up at daybreak, and sometimes as the sun peeks over the horizon I’ll go outside to have a smoke and read his Encore une fois sur le fleuve. I’ve read some of his poems so often that they form a sort of soundtrack in my head as I walk the streets. In his photographs, he looks like the uncle you always wanted–a face that you can tell is just barely hiding a smile, a cigarette in his hand–or just hanging from his lips.

This is a love letter. It’s not to my grandmother, although it could be. When she died, I found her long white evening gloves and her cigarette holder.

This is a love letter. It’s not to my grandfather, but it could be. One of my mother’s friends told me this about him: his apartment was nothing but books and cigarette smoke.

This is a love letter to cigarettes. Yeah, I know: they’re gonna kill me. Hell–if I didn’t smoke, I might live two years longer! Two years against some connection, any connection, with the French grandfather who had my mother when he was as old as I am now (very), and died before I was born. Two years against Jacque Prévert in my head when I walk the streets in Paris, or anywhere in the world, really. Two years against that memory of my grandmother, the warm Florida mornings, the ashtray that my father made for her in summer camp. Seems like I come out ahead on this one.

The picture at the top of this page is not my grandmother, but the American actress Carol Landis, photographed in 1946 for a Kislav glove ad. Photographer: unknown.

English notes:

To walk the streets: be careful with this one. It can mean walking nowhere in particular–not flâner, as it connotes a certain intensity and solitariness that is lacking in flâner. It can also mean living by prostitution–compare the noun streetwalker, a prostitute qui fait le trottoir. Yet another meaning: to be free after a time in prison.

How I used it in the post: I’ve read some of his poems so often that they form a sort of soundtrack in my head as I walk the streets.

With the “out of prison” meaning: Many are outraged that the convicted killer will be walking the streets after spending just two years in prison. (Source: the Farlex Free Dictionary.)

le billet-doux: an old term for a love letter. I understand that you can use it for comic effect. But, compared to la lettre d’amour, I like the sound of billet-doux much more. Doux: it just sounds…right. (Phil dAnge, can you comment?)

]]>https://zipfslaw.org/2019/07/01/billet-doux-love-letter/feed/3636549534079a9887719f9141d5177aazipfslaw1izis-jacques-prévert-cigarette-catWhat computational linguists actually do all day: The recursion editionhttps://zipfslaw.org/2019/06/23/what-computational-linguists-actually-do-all-day-the-recursion-edition/
https://zipfslaw.org/2019/06/23/what-computational-linguists-actually-do-all-day-the-recursion-edition/#commentsSun, 23 Jun 2019 14:35:51 +0000http://zipfslaw.org/?p=31462I know, I know: computational linguistics sounds like the world’s most glamorous profession, right? You imagine a bunch of geeks in hip glasses sitting around talking about Sanskrit is-aorist verbs, playing a little foosball after a free sushi lunch in the Google cafeteria, and then writing code to translate Jacques Prévert into idiomatic American English with a little stock ticker in the upper-right corner of their screen so that they can watch the value of their vested options go up, and up, and up, and…

In reality, I’m sitting in the international student dormitory of a well-known East Coast American university. Yesterday was a good day, because the shitwad in room D2 left his dirty dishes in the sink for the full 48 hours that let me feel fine about throwing the reeking things in the trash can.

But, then I realized something: I can only get easy copyright releases for the book I’m writing for papers published in 2016 or later. That means that I need to do a serious analysis of what I’m citing in the book, which means…writing code (the computer language that makes up a program) to go through a bunch of citations to figure out what year they were published, in which conference or journal, etc., etc., etc.

…which wasn’t particularly difficult, but caused a little pinprick in my soul, ’cause I knew as I was writing it that it would mess up any time that I had a title with a curly-brace in it ({}), and practicing your profession shittily never feels good. For reasons that we need not go into, having curly-braces in the title of a work happens a hell of a lot more often than you might think, and that fixing that little flaw would require writing something called a recursive function, which really shouldn’t be that complicated for a computational linguist (recursion is one of the fundamental properties of language (the picture at the top of this page is a humorous illustration of recursion (which is probably oxymoronic (and as you might have guessed, these embedded parentheticals are themselves an example of recursion (as is the second sentence of this post (an example, that is–not necessarily a humorous one (unlike the cartoon))))))), and yet still, is more than my little brain de pois chiche (garbanzo bean) can handle on a Sunday morning.

Then, in order to be able to see any actual output, I had to write code like the following:

That gave me the first thought I’d had all morning that was actually interesting, as I contemplated how hard I’m pretty sure that it would have been–how impossible I at least hope it would be, for the moment at any rate–for a computer to find and fix that particular bug.

Another half hour or so of work, and now I can actually see what I wanted to know, which is the venues where the works that I cite were published. This was useful, in that I noticed that one that should be heavily represented in my bibliography in fact barely figures there at all. But, what it meant was that I needed to Google hither and yon to find out how to search Google Scholar (we’re just getting more and more meta here all the time) by name of conference. Not particularly challenging; but, not particularly interesting, either.

This is a whiny post, right? Totally tongue in cheek, though. Actually, I have the incredible good luck to love what I do, and the book in question really is a labor of…a labor of love.

English notes

Something in this post that is perfectly fine English but that I probably would not have written if I didn’t spend a lot of time writing (poorly) in French these days:

I noticed that a publication venue that should be heavily represented in my bibliography in fact barely figures there at all.

An educated speaker of the langue de Molière will be aware that figurer sur une liste is perfectly natural (as far as I know) French. What I wrote is perfectly fine English, but I would suspect that it doesn’t occur very often, even in written academic or official English. Why did it pop out of mouth (well…fingers) today? French-language interference, which is funny, ’cause in language teaching we often talk about first-language interference (carrying over aspects of the grammar of your native language, such that they fuck up your mastery of a foreign or second language), but I can’t recall ever running into the concept of second-language interference, and French is mostly definitely a second language for me, not my first. Go figure…

go figure is an expression that expresses surprise about something that you’ve just been talking about, or an assertion that you are about to make. How I used it in the post:

I can’t recall ever running into the concept of second-language interference, and French is mostly definitely a second language for me, not my first. Go figure…

I occasionally use this blog to try out materials for something that I will be publishing. This post is a casual version of something that will go into a book that I’m writing about…writing.

So, you’re going to do a data science project. Maybe you’re going to use natural language processing (processing: using a computer program to do something; natural language: human language, as opposed to computer languages) to analyze social media data because you want to find out how veterans feel about the medical care that they receive through the Veterans’ Administration. (Spoiler alert: a number of my buddies are vets, and they do indeed use the Veterans’ Administration health care system, and they both (a) are happy with it, and (b) recommend it to the rest of us.) Maybe you’re doing it as a project for a course; maybe you’re doing it as your first assignment at your high-paying brand-new data scientist job; maybe you’re planning to write a research paper for a journal on military health care. How do you go about doing it?

An excellent piece of advice when you’re trying to figure out how to do any research project: write out what you’re going to do, in prose, before you start doing it. As my colleague Graciela Gonzalez, of the Health Language Processing Laboratory at the University of Pennsylvania School of Medicine, puts it:

Most of us make some mistakes in the process of thinking through how we will test our hypothesis. The advantage of writing down what you’re going to do–the Methods section of a research paper, the design of your research project–before you do it is that when you see it on paper, spelled out explicitly and step by step, you will often notice the logical or procedural errors in what you were thinking, and then you won’t spend weeks making those errors before realizing that they were never going to get you where you wanted to go.

OK, so: you know that you’re going to write out your methods, very explicitly and in the order in which you will do them. But, how do you figure out what those methods should be?

An efficient way to go about this is to read research papers by other people who have done similar things. As you read them, you’re going to look for a general pattern–think of this as an example of the frameworks that we’ve talked about in other parts of this book. Returning to our example of using natural language processing to analyze social media data, you might go to PubMed/MEDLINE, the National Library of Medicine’s database of 27 million biomedical research articles, and search for papers that mention either natural language processing or text mining, and also have the words social media in the title or abstract. (Click here if you would like to see the set of 190+ papers that this search would find.)

The results of that search will return these three papers that are studying a problem similar to yours: they’re using natural language processing to find women talking about their pregnancy, people talking about adverse reactions to drugs, or people talking about abuse of prescription medications–not exactly what you need to do, but similar. You’ll see two steps that are carried out in all of them. I’ve highlighted the points where they’re mentioned in the abstracts of the three papers:

METHODS:Our discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a sample of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined.

METHODS: One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies.

METHODS: We collected Twitter user posts (tweets) associated with three commonly abused medications (Adderall(®), oxycodone, and quetiapine). We manually annotated 6400 tweets mentioning these three medications and a control medication (metformin) that is not the subject of abuse due to its mechanism of action. We performed quantitative and qualitative analyses of the annotated data to determine whether posts on Twitter contain signals of prescription medication abuse. Finally, we designed an automatic supervised classification technique to distinguish posts containing signals of medication abuse from those that do not and assessed the utility of Twitter in investigating patterns of abuse over time.

Now we can abstract out the two steps that we found in all three papers:

The authors built a data set.

The authors used a technique called classification–a form of machine learning–to differentiate between the social media posts that did and did not talk about a person’s own pregnancy, or an adverse reaction to a medication, or abuse of prescription medications.

So, now you have a basic outline of your methodology. Your goal being to use natural language processing to investigate, using social media data, how veterans feel about the care that they receive through the Veterans’ Administration health care system, maybe your methodology will look like this:

Create a data set containing tweets in which veterans are talking about how they feel about the care that they receive in the VA health care system.

Use machine learning to classify those tweets into ones where the vets feel (a) positive, (b) negative, or (c) neutral about that care.

OK, so: now you can expand that. You’re quickly going to realize that Step 2–classifying those tweets–is actually going to require you to be able to do three classifications:

You have to be able to differentiate tweets written by veterans from tweets written by everybody else.

You have to be able to differentiate tweets where the vets are talking about the VA health care system from where they’re talking about things other than the VA health care system.

You have to be able to classify whether the feelings that they express about the VA health care system are positive, negative, or neutral.

Now that you’ve started to flesh out your methodology, you realize something: creating that data set is going to take a really long time, since you essentially have to be able to label three different kinds of things in the social media posts. You have a finite amount of time and resources with which to do it, so how are you going to make that possible?

Faced with an enormous amount of work to accomplish with limited time and resources, the most sane approach is this: go to your supervisor, show them your detailed methods plan, and let them come to the conclusion that they had better either (a) give you a lot more resources, or (b) modify your assignment. Having gone through this multiple times over the course of my career, I can tell you that (b) is a hell of a lot more likely. What is the modified assignment going to look like? It’s probably going to be a reduction of the task to “just” the task of detecting tweets that were and weren’t written by veterans. Now you can go back to your outline, and modify it:

Create a data set containing tweets written by veterans, and tweets written by anybody else.

Use machine learning to classify those tweets into the ones that were written by veterans, and the ones that weren’t.

This is going to be hard enough, believe me. Here are some examples of what those tweets might look like–I made them up, but they’re totally plausible:

HM1 Zipf here, USS Biddle 1980-1982–BT3 Raven McDavid, you out there?

AFOSC raffle drawing at 1500–win that lawnmower and help us buy books for the squadron?

FTN today, FTN tomorrow, FTN and fuck Chief Chomsky til I get out this motherfucker

Mario Brothers, still nothin like it, bitchboys

Have you figured it out? Here are the answers:

Clearly written by a veteran.

Almost certainly written by the spouse of an active duty Air Force officer, so not written by a veteran.

Clearly written by a sailer who is still on active duty, so not written by a veteran.

No clue who it was written by, and/but there’s no reason whatsoever to think that it was written by a veteran, so it should be classified as not written by a veteran.\

What’s that you say? It wasn’t clear to you at all? Think about this: if it wasn’t clear to you, it’s certainly not going to be clear to a computer program, so your classification step is going to be difficult. In fact, if it’s not clear to you, you’re going to have a hell of a difficult time building the data set–time to go back to your supervisor and ask for the resources to hire some veterans to help you out!

…and (4) raises a super-difficult question: what the hell counts as a reasonable experimental control for this research project? (Spoiler: I don’t know, and I have a doctoral degree in this particular topic.)

All of this to say:

Your redefined project is going to be plenty hard, thank you very much.

You wouldn’t know how crucial it was to redefine said project if you hadn’t started the process of writing out what exactly you’re going to do.

…and hell–you hadn’t even gotten to the “exactly” part yet! So: take Graciela’s point seriously, and write some things down before you start doing anything else.

…and now you can think about what you’re going to measure to figure out whether or not you were successful in doing what you were trying to do.

Linguistic geekery: Raven McDavid was a dialectologist back in day. He is said to be the inspiration for the Harrison Ford character in Raiders of the lost ark. Chomsky is Noam Chomsky, the most important (although not the best, in my humble opinion) linguist of the 20th century. Where they appear in the post:

HM1 Zipf here, USS Biddle 1980-1982–BT3 Raven McDavid, you out there?

AFOSC raffle drawing at 1500–win that lawnmower and help us buy books for the squadron?

FTN today, FTN tomorrow, FTN and fuck Chief Chomsky til I get out this motherfucker