I'm attempting to write a script which finds words from several lists, one with , over 20,000 entries, in a string (I'll be honest from the start - this is for a university project, so I'm not looking for answers, just pointers and advice). My current approach (Probably not very efficient, but it works) is checking if the substring of the list entry is in the original string.

So, say I have the list (Purely an example):

black car white car green car red car ...

And the string:

"my friend drives a bright red car"

It would attempt to find the substring "black car", then the substring "white car", until it gets to "red car" which is a match.

(If anybody has any suggestions as to a different more efficient approach to this lookup, please let me know)

Anyway, that all works fine, however I'm required to strip all punctuation and format both the original string and list entries in a certain way before attempting to find an entry in the original string.

After reading up on function calls in Perl, it turns out that (Calling a function from within a loop 20,000+ times) is a terrible idea where optimization is concerned.

I'm struggling to find an alternative approach, without writing duplicating the code within the formatString sub routine 4 or so times. Maybe I'm just being braindead.

I thought about writing a sub routine which took a file handler or reference to a FH as a parameter, then doing the formatting and lookup within that function, but the lookup for each list is different (Some are looking for matches, some replacing text and others simply removing text).

Any ideas?

The formatString sub routine basically performs a bunch of regex operations on the passed string: