One of the five people who interviewed me asked a question that resulted in an hour-long discussion: "Explain how you would develop a frequency-sorted list of the ten thousand most-used words in the English language."

My initial response was to assail the assumptions underlying the problem. Language is a fluid thing, I argued. It changes in real time. Vocabulary and usage patterns shift day-to-day. To develop a list of words and their frequencies means taking a snapshot of a moving target. Whatever snapshot you take today isn't going to look like the snapshot you take tomorrow or even five minutes from now.

The reason they ask you this question is to see how you solve problems. So your initial response is great: to be critical and independent and wonder whether a requirement makes sense. At the same time, it is important to at some point allow your interviewer (the hypothetical customer) to make a decision such as: "just take a snapshot once a day" and then go with that. That is to determine that you're actually committed to getting things done.
–
DeckardApr 26 '11 at 12:15

1

This is one of those questions that does not actually have a correct answer. It appears to be designed to do exactly what happen have a detail discussion. So I thought about the fact there might actually be a "correct" answer and that would be "satisfy" the requirements. Beyond that single "correct" answer it was asked to do exactly what happen have a technical discussion with you. It is meant to get rid of people they are not interested.
–
RamhoundApr 26 '11 at 12:17

9 Answers
9

My initial response was to assail the
assumptions underlying the problem.

... that resulted in an hour-long discussion

Assuming that you were not interviewing for a position as a developer within the domain of linguistic processing (or where specification errors have life threatening consequences), you probably gave the interviewer the impression that you either have difficulties with separating the important from the incidental or that you tend to be confrontational.

Don't get me wrong I have the same tendencies and have to work hard to keep them under control, but I had to learn that finding loopholes in requests for cooperation is usually - and often for good reasons - interpreted as lack of empathy and not as a sign of intelligence.

Consider the situation of the interviewer, he probably just wanted to know how you approach the common programming task of accumulating, sorting and evaluating data and your ability communicating your solution to another person - while being constrained by the situation of a job interview.

Anything more elaborate than: "Gathering data about language use is, due to its fluid nature, an interesting problem by itself. Unless you'd like me to explore it with you in detail, I will for now just assume it is static." before diving into a solution for a reasonably idealized problem statement, is likely to leave a mixed impression.

To answer the original question:

pick a suitable dictionary (e.g. Oxford) - preferably one where they're already properly 'stemmed' (if not then run them through a suitable stemming algorithm)

put the resulting word stems as keys in a Associative Array/Hash Map/Dictionary and initialize each value with 0

for each element in the corpus under interest

stem each word

if the stemmed word can be found as a key in the previously created Hash Map

increment its corresponding value by 1

if it can't be found, ignore it

print HashMap as a list of key-value tuples, sorted by the values

and then go into optimization possibilities for the "sorting" hand wavy step if the interviewer hints about being interested in them.

This question is a good interview question because it invites discussion on several levels. At one level, it is a programming question, to see if you know how to efficiently scan a corpus of English text to count the occurrance of words. A person with a good CS education and a couple years experience will probably tackle it as a programming question. It is also a question about defining the problem domain.

A person with more experience may also know that English is a moving target and that where you get your corpus of english words will make a difference.

An even more sophisticated user may have some notion that the 10k most frequently used words will be virtually independent of the corpus, unless you make a huge mistake like using Shakespeare or a medical dictionary as your corpus. They may want to know how they are to distinguish proper names from other words. There are some real subtlties here.

If the company is in the business of analyzing english text, this question sorts out candidates by skill level. It's a check on the claims they make in their resume.

You mustn't get in a huff because the question is ill-defined. Of course it's ill-defined. One purpose in asking the question is to see if you recognize issues in the definition. But don't focus too intensely on that part of the question either. It may be that the interviewer really intends to talk about your programming smarts, and is himself not very sophisticated. After a little preliminary conversation, ask the interviewer, "Do you want me to talk about programming issues or about the domain of analyzing english text?" If you want to score full marks on this question, you need to answer the question the interviewer intends to ask. The interviewer might not even have a completely formed idea of what he wants to hear.

In my recent experience interviewing, most interviewers ask coding questions. Coding is simple and well defined. The interviewer doesn't have to think too hard to ask you a coding question. He has years of experience reading and writing code, and lots of time to look for bugs while you fumble at the whiteboard.

Not nailing down requirements and making assumptions is not how I would have answered this question.
–
user23157Apr 26 '11 at 15:59

1

Exactly, just answer the question :-) Nailing down reqs or defining precise meanings of words are not really the point of this question, the interviewers are clearly after an algorithm and you should make some basic assumptions and describe an approach. You should of course state any assumptions, and there is no harm in saying something along the lines of "in reality I'd spend more time nailing down reqs here", but to debate the nature of language or state that the requirements are not well defined would look to me like you were evading the technical point somewhat.
–
Steve HaighApr 26 '11 at 16:41

Solve the programming problem, if they want you to solve the business problem of what data sources to use then discuss possible sources and what you think their merits are and say you'd want to do testing and compare the results to existing frequency lists, etc.

The question itself is not sufficiently defined to be a technical question.

I find it odd for an hour-long discussion to arise from that single question... But since it evidently did occur, you can be sure that the interviewing committee didn't even know what they were looking for. It's most likely they were assessing your first action (i.e. if the interviewee spat out (not-pseudo-)code, it would've been seen negatively).

Without any additional information, I would simply give the bog-standard obvious answer: record spoken every word of every English-speaking person on Earth for a day and then state these assumptions:

Recording whoever I want is a feasible and viable option (i.e. the equipment is available and it was somehow possible).

A day, or whatever timeframe picked, is a sufficiently big sample or words.

A semi-related aside: Sorry, but I disagree with

Vocabulary and usage patterns shift
day-to-day. To develop a list of words
and their frequencies means taking a
snapshot of a moving target. Whatever
snapshot you take today isn't going to
look like the snapshot you take
tomorrow or even five minutes from
now.

A person's vernacular is controlled by those he is addressing or speaking to. While the people you interact with each day may be different - therefore, effecting a change in spoken words - it is naive to assume that every English-speaking person has the same subsequent change.

First I would ask for a set of requirements for the list object: "In what form is the data added to the list?" would be my first question. Once the requirements were nailed down I would then set about designing the object to match.

Whether or not you think the project is a sensible one is to a certain extent neither here nor there; as a developer you might not think a project you are asked to work on is sensible or viable, but it is in general your job to give the customer what they've asked for.

It might have been an interesting conversation, but do you think an hour arguing the toss over a point of semantics was as useful as showing the guy that you knew how to write software?

They would look at me with blank stares and eventually say: "But that is the slowest possible way to sort!", implying I had just failed the interview.

I would reply: It would be slow the first time, granted, but then it would be fast later, because, let's be honest, the frequency list of the most common 10K english words isn't going to change much on a daily basis, so you should pick the simplest algorithm possible, one that even a 5 year old can understand. That way, no strange bugs, edge cases, etc can creep in.

Then I would pick up my stuff and say: "It's been a pleasure meeting with you" as obviously I was not what they were looking for.

Clarification

After the original sort, the elements on the list would see their ranking go and down by very little, if any. For example, "F*ck" would stay near the top for many years to come, so the swap-sort (bubble-sort for you college students) would have very little swapping to do before the list was ordered again.