Permuted Index

December 22, 2009

In 1972, David Parnas proposed the construction of a permuted index, also known as a keyword-in-context or kwic index, as an exercise in program design. For example, the three sentences

All's well that ends well.
Nature abhors a vacuum.
Every man has a price.

produce the permuted index:

Nature abhors a vacuum.
All's well that ends well.
All's well that ends well.
Every man has a price.
Every man has a price.
Every man has a price.
Nature abhors a vacuum.
Every man has a price.
All's well that ends well.
Nature abhors a vacuum.
All's well that ends well.
All's well that ends well.

Parnas proposed a three-step algorithm: rotate, sort, unrotate. The rotate step takes a sentence and produces all its rotations. The sort step sorts the rotations by their back half. The unrotate step produces neatly-formatted output. Somewhere, in one of the three steps, rotations that produce back halves starting with words on a stop list are discarded; note that there is no output for “a vacuum” or “a price.”

Your task is to implement a program that produces permuted indexes using Parnas’ three-step algorithm. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

def rotate(phrase):
“””
Takes a given phrase and performs a keyword rotation on it, returning a
list of (part 1, part 2) tuples.
“””
# I don’t want to figure out the sorting of capital letters.
phrase = phrase.lower()
# split the phrase into words.
split_phrase = phrase.split(‘ ‘)
rotation_list = []
# Perform the actual rotation and return the splits.
for i in range(len(split_phrase)):
non_words =\
[“a”,”an”,”and”,”by”,”for”,”if”,”in”,”is”,”of”,”on”,”the”,”to”]
# If the split is before a non word skip the split, unless it’s the first
# word in the phrase.
if (split_phrase[i] in non_words) and (i is not 0):
continue
# Create the left and right portions of the split, and insert into the
# return tuple list.
left = “”
right = “”
for n in range(len(split_phrase[:i])):
if n == 0:
left = split_phrase[n]
else:
left = left + ‘ ‘ + split_phrase[n]
for n in range(len(split_phrase[i:])):
if n == 0:
right = split_phrase[i+n]
else:
right = right + ‘ ‘ + split_phrase[i+n]
rotation_list.append((left, right))
return rotation_list

def sort(rotation_list):
“””
Given a rotation list, bubble sort by the right hand side because I’m lazy.
“””
for i in range(len(rotation_list)):
for n in range(len(rotation_list)-1):
if (rotation_list[n][1] > rotation_list[n+1][1]):
rotation_list[n], rotation_list[n+1] = \
rotation_list[n+1], rotation_list[n]

def unrotate(rotation_list):
“””
Neatly output the given rotation list to the user.
“””
# Find the length of the longest string in the rotation list.
max_size = 0
for i in range(len(rotation_list)):
for j in range(2):
if (len(rotation_list[i][j]) > max_size):
max_size = len(rotation_list[i][j])

Parnas’s paper on kwic index looked like he had spent a week on very intensive design. In my opinion, parnas’s ideas are roughly the same as Niklaus Wirth’s “refinement” method – or at least compatible with it.
I think Parnas was on the right track with his ideas.

I was astounded a couple of years ago to see that someone had written a kwic index progam in 3 lines of Python. I did not keep a note of the Author. I still think however, that Parnas’s design method is very good.