What do I use to search for multiple words in a string? I would like the logical operation to be AND so that all the words are in the string somewhere. I have a bunch of nonsense paragraphs and one plain English paragraph, and I'd like to narrow it down by specifying a couple common words like, "the" and "and", but would like it match all words I specify.

Regular expressions support a "lookaround" condition that lets you search for a term within a string and then forget the location of the result; starting at the beginning of the string for the next search term. This will allow searching a string for a group of words in any order.

The regular expression for this is:

^(?=.*\bword1\b)(?=.*\bword2\b)(?=.*\bword3\b)

Where \b is a word boundary and the ?= is the lookaround modifier.

If you have a variable number of words you want to search for, you will need to build this regular expression string with a loop - just wrap each word in the lookaround syntax and append it to the expression.

Doesn't that just match a sentence that contains two words, either word1 followed by word2, or word2 followed by word1 (as desired), or word1 followed by word1, or word2 followed by word2 (as not desired)? That was the sort of problem I ran into when trying to answer.
–
Jonathan LefflerOct 17 '08 at 3:20

Assuming PCRE (Perl regexes), I am not sure that you can do it at all easily. The AND operation is concatenation of regexes, but you want to be able to permute the order in which the words appear without having to formally generate the permutation. For N words, when N = 2, it is bearable; with N = 3, it is barely OK; with N > 3, it is unlikely to be acceptable. So, the simple iterative solution - N regexes, one for each word, and iterate ensuring each is satisfied - looks like the best choice to me.

Why do the N things have to be regexes though? Could just use "index" here.
–
Account deletedOct 16 '08 at 22:58

1

\b(foo|bar|baz)\b.*\b(?!\1)(foo|bar|baz)\b.*\b(?!\1)(?!\2)(foo|bar|baz)\b ought to handle permutations by using back references and negative lookahead to avoid matching a word twice. It's still properly evil, but at least the pattern length isn't O(N!)
–
stevemegsonOct 16 '08 at 23:19

@BKB: I'm not sure what you mean by using an index.
–
Jonathan LefflerOct 17 '08 at 3:23

@SteveMegson: Yes, I think I see what you're up to - and not being sure of the scope of negative lookahead (a relatively new feature of Perl - since I was really learning it, back in the days of 4.x, and 5.[0-6]), I was not dogmatic in my answer. As you say, not nice, but not combinatorial either.
–
Jonathan LefflerOct 17 '08 at 3:25