A recruiter gave me a homework problem as a part of the recruiting process and after receiving my submission he told me that he decided not to proceed with me. When I asked for the reason, he told me that I should ask for advice online from more experienced Python programmers. He said that "the issues were algorithm design and code structure". The (short) description of the problem and the code are in this gist. Please let me know what you think he was pointing at.

Problem description:

Given a page of content with alphanumeric words, and a search phrase
of N words, write an algorithm that will return the shortest snippet
of content that contains all N words in any order. Example: The
George Washington Bridge in New York City is one of the oldest bridges
ever constructed. It is now being remodeled because the bridge is a
landmark. City officials say that the landmark bridge effort will
create a lot of new jobs in the city.

2 Answers
2

I would agree with the recruiter: this code does not make a good first impression.

Red flags

Any one of these issues could be grounds for disqualification within the first few seconds of reading.

Too much code. If someone gives you a programming puzzle, chances are that they expect a short, elegant solution. Nobody wants to read a long, rambling answer, so they would not ask a question that requires a lot of code to solve.

The bulk of your code is in one long function. I couldn't point to a chunk of code within that function and say what that chunk's purpose is. The comments that you left are worse-than-useless junk: they are all disabled print statements for debugging.

Global variables. It is widely agreed that global variables are to be used only as a last resort. Here, there's absolutely no justification for them. Not only do they make it hard to reason about the code by making side-effects non-localized, they indicate that your function's interface is sloppy: it's unclear what the function's inputs and outputs are.

Too many variables. Within FindShortest(), we have:

CurrentTerm

MatchIndexes (global?)

LeftBorder (global)

RightBorder (global)

FinalLeftBorder (global)

FinalRightBorder (global)

SearchTerms

OptimalRightBorderFound

OldLeftBorder

OldRightBorder

The human mind can keep track of about seven things at once. Ideally, you should only have about three variables per function.

Non-idiomatic Python. You used regular expressions, which is good. Other than that, your answer is written not much differently from a C solution. You need to demonstrate your ability to think at a more abstract level.

Non-standard naming. You should call your function find_shortest, and its parameter current_term, and similarly for all variables. UpperCase identifiers look like class names. See PEP 8, which is the standard style guide for Python.

The naming of your functions is particularly important, since its effects extend beyond your own code. You've just indicated that you will burden your future colleagues with non-standard naming.

Yellow flags

These are also serious issues. They are just not as obvious as the red flags at first glance.

Lots of free-floating preparatory code. There's a lot of code between opening the file and calling FindShortest(). Why does it do, and why is it not packaged in functions too?

Careless concatenation.RegexVariable = r"\W" + Term + r"\W" trusts that Term contains no characters that have special meaning within regular expressions. From that one line of careless concatenation, I would infer that you would probably write code that is vulnerable to SQL injection, cross-site scripting, arbitrary command execution, header-splitting attacks, etc.

Irresponsible recursion. When a function calls itself, that is recursion. However, recursion should be used responsibly: there should be invariants, well-defined function inputs and outputs, a base case, and a recursive case. But you don't have any of those elements, so effectively you have a weird goto.

Proposed solution

For comparison, here's what I came up with. It's optimized more for simplicity than performance — it's usually beneficial to convey that goal to your interviewer.

Note that it accomplishes the task by chaining three functions, each with well defined inputs and outputs.

from collections import namedtuple
from itertools import product
import re
WordPos = namedtuple('WordPos', ['start', 'end'])
def find_all_words(text, words):
"""
For each word in the list, find all positions in which they
appear in the text. Results are returned as a dict, with
the words as keys. Each value is a list, with a WordPos
object to mark each occurrence of the word in the text.
"""
def positions(text, word):
word_re = re.compile(r'\b' + re.escape(word) + r'\b', re.IGNORECASE)
return [WordPos(match.start(), match.end()) for match in
word_re.finditer(text)]
return { word: positions(text, word) for word in words }
def cluster(found_words):
"""
Given a dict resulting from find_all_words(), pick the
occurrence of each word that minimizes the length of the
substring of text that contains them all.
The result is a WordPos object that represents the span of
the cluster. If any of the words does not appear in the
text, then this function returns None.
"""
def bounds(combo):
start = min(word.start for word in combo)
end = max(word.end for word in combo)
return WordPos(start, end)
positions = found_words.values()
combo_bounds = [bounds(combo) for combo in product(*positions)]
if combo_bounds:
# Find the shortest combo
return min(combo_bounds, key=lambda span: span.end - span.start)
else:
# At least one search term was not found
return None
def excerpt(text, combo_bounds):
"""
Take the substring of text corresponding to the given span.
"""
if not combo_bounds:
return None
return text[combo_bounds.start : combo_bounds.end]
test_text = """The George Washington Bridge in New York City…"""
test_terms = 'Landmark City Bridge'.split()
print(excerpt(test_text, cluster(find_all_words(test_text, test_terms))))

Good review! {word: positions(text, word) for word in words} would be a dictionary-comprehension alternative to dict(...). The logic for finding start and end, though simple, is duplicated and could be extracted: start, end = find_bounds(combo) instead of having the nested span function.
–
codesparkleApr 28 '14 at 4:33

Thank you so much! This has been more than educational. As @Veedrac rightfully pointed about I am indeed lacking experience with real code (a student with only competitive programming experience under my belt). As you pointed out, my code looks like C code because I only started coding in Python recently. Besides what is already recommended by Veedrac, are there any resources you would recommend me to study to develop more abstract, Pythonic thinking?
–
Arman TokanovApr 28 '14 at 6:08

1

Try following the online book Composing Programs. If you're used to languages like C, then Section 1.6 (Higher-Order Functions) will blow your mind.
–
200_success♦Apr 28 '14 at 6:54

3

To expand on 200_success's point about commented out print statements, I would suggest replacing them with assert statements, unit tests, trips through pdb or another debugger, or removing them all together. Commented out code almost never belongs in a finished product.
–
Michael UrmanApr 29 '14 at 12:18

Thank you! So how do I manipulate global variables within recursion without stating them as global in the recursion?
–
Arman TokanovApr 30 '14 at 7:54

Use standard naming. ReplaceTheseFormsOfVariableNames with these_forms_of_variable_names, at least in accordance to PEP 8.

State in global variables is pretty much always bad. Encapsulate that behaviour in returns.

Files need safety. I suggest you look up the with statement. You'd use

with open(...) as myfile:
do_stuff_with(myfile)

Never use addition on strings or list-like structures in a loop unless they've been explicitly designed for it, or you know enough to know it doesn't apply. f.read() would have worked better.

x[:len(x) - y] is better just written x[:-y].

Encapsulate more things in functions and classes.

sys.stdout.write is a bit of an odd thing to use; print will do fine and a function that just returns a list (with no printing) would have worked yet nicer.

All those debug statements don't make reading the code any easier.

I think something you're missing is just experience with real code. Reading stuff like arbitrary open source Python code or maybe this or how about some standard-library stuff¹ will help you get to grips with how code is meant to feel. Because the overriding sense I got from reading your code is that it wasn't Python, largely because it's too different. I never got to the point where I cared about the algorithm. If you were working in a team you'd be creating a lot of difficulty just by being nonstandard, and I assume that harmed your chances a lot.