There's supposed to be a tag in each line enclosed in parenthesis. But as you can see the data is very dirty and there's extra spaces inside tags, missing parents, misspelled tags, etc.

How would I identify the lines containing "(tag)" or something similar? The output I would expect in the example is to retrieve all lines, except for the last.

I need to compare a long string to a shorter sub-string, so something like Levenshtein distance wont work (or will it?). Tokenized techniques seem to rigid, because although I might tokenize everything in parenthesis, what happens when a parenthesis is missing?

2 Answers
2

I've done something similar to this a while back, only I was just matching words to a sought word. This might not help you, but it might give you an idea.

Essentially, what I did was something like this:

For each word in my range and did something similar to a diff. I considered the letter in the string and its position. For correct letters I gave a point, for correct position in the string I gave a point. For bad and placement I would remove a point, and for missing letters I would remove a point. (It's hard to remember, because I don't have access to the source anymore, but I think I also based the placement point off the distance from the expected position).

After I tallied up the 'score' of this word, I would normalize. I had an arbitrary threshold (found via trial and error, but it could be feedback driven) that I used to determine if it was close enough.

Then all of the words that were above the threshold were returned to the user in descending order.

IIRC, it wasn't terribly effective for short words, but overall, it was fairly effective for its implementation (search over a specific domain).