I know it's possible to match a word and then reverse the matches using other tools (e.g. grep -v). However, is it possible to match lines that do not contain a specific word, e.g. "hede", using a regular expression?

Input:

hoho
hihi
haha
hede

Code:

grep "<Regex for 'doesn't contain hede'>" input

Desired output:

Probably a couple years late, but what's wrong with: ([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$)))*? The idea is simple. Keep matching until you see the start of the unwanted string, then only match in the N-1 cases where the string is unfinished (where N is the length of the string). These N-1 cases are "h followed by non-e", "he followed by non-d", and "hed followed by non-e". If you managed to pass these N-1 cases, you successfully didn't match the unwanted string so you can start looking for [^h]* again
– stevendesuSep 29 '11 at 3:44

284

@stevendesu: try this for 'a-very-very-long-word' or even better half a sentence. Have fun typing. BTW, it is nearly unreadable. Don't know about the performance impact.
– Peter SchuetzeJan 30 '12 at 18:45

13

@PeterSchuetze: Sure it's not pretty for very very long words, but it is a viable and correct solution. Although I haven't run tests on the performance, I wouldn't imagine it being too slow since most of the latter rules are ignored until you see an h (or the first letter of the word, sentence, etc.). And you could easily generate the regex string for long strings using iterative concatenation. If it works and can be generated quickly, is legibility important? That's what comments are for.
– stevendesuFeb 2 '12 at 3:14

53

@stevendesu: i'm even later, but that answer is almost completely wrong. for one thing, it requires the subject to contain "h" which it shouldn't have to, given the task is "match lines which [do] not contain a specific word". let us assume you meant to make the inner group optional, and that the pattern is anchored: ^([^h]*(h([^e]|$)|he([^d]|$)|hed([^e]|$))?)*$ this fails when instances of "hede" are preceded by partial instances of "hede" such as in "hhede".
– jayteaSep 10 '12 at 10:41

28 Answers
28

The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:

^((?!hede).)*$

The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.

And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s in the following pattern):

/^((?!hede).)*$/s

or use it inline:

/(?s)^((?!hede).)*$/

(where the /.../ are the regex delimiters, i.e., not part of the pattern)

If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]:

/^((?!hede)[\s\S])*$/

Explanation

A string is just a list of n characters. Before, and after each character, there's an empty string. So a list of n characters will have n+1 empty strings. Consider the string "ABhedeCD":

where the e's are the empty strings. The regex (?!hede). looks ahead to see if there's no substring "hede" to be seen, and if that is the case (so something else is seen), then the . (dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.

So, in my example, every empty string is first validated to see if there's no "hede" up ahead, before a character is consumed by the . (dot). The regex (?!hede). will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$

As you can see, the input "ABhedeCD" will fail because on e3, the regex (?!hede) fails (there is"hede" up ahead!).

I would not go so far as to say that this is something regex is bad at. The convenience of this solution is pretty obvious and the performance hit compared to a programmatic search is often going to be unimportant.
– ArchimaredesMar 3 '16 at 16:09

@PeterK, sure, but this is SO, not MathOverflow or CS-Stackexchange. People asking a question here are generally looking for a practical answer. Most libraries or tools (like grep, which the OP mentions) with regex-support all have features that mke them non-regular in a theoretical sense.
– Bart KiersNov 18 '16 at 15:08

14

@Bart Kiers, no offense to you answer, just this abuse of terminology irritates me a bit. The really confusing part here is that regular expressions in the strict sense can very much do what OP wants, but the common language to write them does not allow it, which leads to (mathematically ugly) workarounds like look-aheads. Please see this answer below and my comment there for (theoretically aligned) proper way of doing it. Needless to say it works faster on large inputs.
– Peter KNov 18 '16 at 15:33

15

In case you ever wondered how to do this in vim: ^\(\(hede\)\@!.\)*$
– baldrsNov 24 '16 at 11:58

Thanks, I used it to validate that the string dosn't contain squence of digits ^((?!\d{5,}).)*
– Samih AMay 10 '15 at 10:42

2

Hello! I can't compose does not end with "hede" regex. Can you help with it?
– Aleks YaOct 18 '15 at 21:33

1

@AleksYa: just use the "contain" version, and include the end anchor into the search string: change the string to "not match" from "hede" to "hede$"
– NyergudsMay 4 '16 at 10:42

1

@AleksYa: the does not end version could be done using negative lookbehind as: (.*)(?<!hede)$. @Nyerguds' version would work as well but completely misses the point on performance the answer mentions.
– thisismydesignSep 14 '17 at 16:53

1

Why are so many answers saying ^((?!hede).)*$ ? Is it not more efficient to use ^(?!.*hede).*$ ? It does the same thing but in fewer steps
– JackPReadJan 15 at 10:53

^the beginning of the string,
( group and capture to \1 (0 or more times (matching the most amount possible)),(?! look ahead to see if there is not,

hede your string,

) end of look-ahead,
. any character except \n,)* end of \1 (Note: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in \1)$ before an optional \n, and the end of the string

Important to note this only uses basic POSIX.2 regular expressions and thus whilst terse is more portable for when PCRE is not available.
– Steve-oFeb 19 '14 at 17:25

5

I agree. Many if not most regular expressions are not regular languages and could not be recognized by a finite automata.
– ThomasMcLeodMar 22 '14 at 21:36

@ThomasMcLeod, Hades32: Is it within the realms of any possible regular language to be able to say ‘not’ and ‘and’ as well as the ‘or’ of an expression such as ‘(hede|Hihi)’? (This maybe a question for CS.)
– James HaighJun 13 '14 at 16:54

7

@JohnAllen: ME!!! …Well, not the actual regex but the academic reference, which also relates closely to computational complexity; PCREs fundamentally can not guarantee the same efficiency as POSIX regular expressions.
– James HaighJun 13 '14 at 17:04

4

Sorry -this answer just doesn't work, it will match hhehe and even match hehe partially (the second half)
– FalcoAug 13 '14 at 12:57

You first define the type of your expressions: labels are letter (lal_char) to pick from a to z for instance (defining the alphabet when working with complementation is, of course, very important), and the "value" computed for each word is just a Boolean: true the word is accepted, false, rejected.

True, but ugly, and only doable for small character sets. You don't want to do this with Unicode strings :-)
– reinierpostNov 8 '15 at 23:43

There are more tools that allow it, one of the most impressive being Ragel. There it would be written as (any* - ('hehe' any*)) for start-aligned match or (any* -- ('hehe' any*)) for unaligned.
– Peter KNov 18 '16 at 15:09

1

@reinierpost: why is it ugly and what's the problem with unicode? I can't agree on both. (I have no experience with vcsn, but have with DFA).
– Peter KNov 18 '16 at 15:39

The regexp ()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).* didn't work for me using egrep. It matches hede. I also tried anchoring it to the beginning and end, and it still didn't work.
– Pedro GimenoDec 6 '16 at 23:18

3

@PedroGimeno When you anchored, you made sure to put this regex in parens first? Otherwise the precedences between anchors and | won't play nicely. '^(()|h(ed?)?|([^h]|h([^e]|e([^d]|d([^e]|e.)))).*)$'.
– akimDec 8 '16 at 9:03

Here's a good explanation of why it's not easy to negate an arbitrary regex. I have to agree with the other answers, though: if this is anything other than a hypothetical question, then a regex is not the right choice here.

Some tools, and specifically mysqldumpslow, only offer this way to filter data, so in such a case, finding a regex to do this is the best solution apart from rewriting the tool (various patches for this have not been included by MySQL AB / Sun / Oracle.
– FGMAug 7 '12 at 12:21

1

Exactly analagous to my situation. Velocity template engine uses regular expressions to decide when to apply a transformation (escape html) and I want it to always work EXCEPT in one situation.
– Henno VermeulenOct 18 '13 at 14:43

1

What alternative is there? Ive never encountered anything that could do precise string matching besides regex. If OP is using a programming language, there may be other tools available, but if he/she is using not writing code, there probably isnt any other choice.
– kingfrito_5005Oct 20 '16 at 18:32

1

One of many non-hypothetical scenarios where a regex is the best available choice: I'm in an IDE (Android Studio) that shows log output, and the only filtering tools provided are: plain strings, and regex. Trying to do this with plain strings would be a complete fail.
– LarsHDec 5 '16 at 16:11

With negative lookahead, regular expression can match something not contains specific pattern. This is answered and explained by Bart Kiers. Great explanation!

However, with Bart Kiers' answer, the lookahead part will test 1 to 4 characters ahead while matching any single character. We can avoid this and let the lookahead part check out the whole text, ensure there is no 'hede', and then the normal part (.*) can eat the whole text all at one time.

Here is the improved regex:

/^(?!.*?hede).*$/

Note the (*?) lazy quantifier in the negative lookahead part is optional, you can use (*) greedy quantifier instead, depending on your data: if 'hede' does present and in the beginning half of the text, the lazy quantifier can be faster; otherwise, the greedy quantifier be faster. However if 'hede' does not present, both would be equal slow.

so to simply check if given string does not contain str1 and str2: ^(?!.*(str1|str2)).*$
– S.SerpooshanMar 1 '17 at 7:20

1

Yes, or you can use lazy quantifier: ^(?!.*?(?:str1|str2)).*$, depending on your data. Added the ?: since we don't need to capture it.
– amobizMar 2 '17 at 9:59

This is by far the best answer by a factor of 10xms. If you added your jsfiddle code and results onto the answer people might notice it. I wonder why the lazy version is faster than the greedy version when there is no hede. Shouldn't they take the same amount of time?
– user5389726598465Jul 23 '17 at 9:06

Yes, they take the same amount of time since they both tests the whole text.
– amobizAug 3 '17 at 3:50

Since .NET doesn't support action Verbs (*FAIL, etc.) I couldn't test the solutions P1 and P2.

Summary:

I tried to test most proposed solutions, some Optimizations are possible for certain words.
For Example if the First two letters of the search string are not the Same, answer 03 can be expanded to
^(?>[^R]+|R+(?!egex Hero))*$ resulting in a small performance gain.

But the overall most readable and performance-wise fastest solution seems to be 05 using a conditional statement
or 04 with the possesive quantifier. I think the Perl solutions should be even faster and more easily readable.

You should time ^(?!.*hede) too. /// Also, it's probably better to rank the expressions for the matching corpus and the non-matching corpus separately because it's usually a case that most line match or most lines don't.
– ikegamiAug 23 '16 at 0:07

Good point; I'm surprised nobody mentioned this approach before. However, that particular regex is prone to catastrophic backtracking when applied to text that doesn't match. Here's how I would do it: /^[^h]*(?:h+(?!ede)[^h]*)*$/
– Alan MooreApr 14 '13 at 5:26

...or you can just make all the quantifiers possessive. ;)
– Alan MooreApr 15 '13 at 15:17

@Alan Moore - I'm surprised too. I saw your comment (and best regex in the pile) here only after posting this same pattern in an answer below.
– ridgerunnerDec 20 '13 at 3:08

@ridgerunner, doesn't have to be the best tho. I've seen benchmarks where the top answer performs better. (I was surprised about that tho.)
– QtaxFeb 20 '14 at 13:10

There is no "lookcurrent" in perl regexp's. This is truly a negative lookahead (prefix (?!). Positive lookahead's prefix would be (?= while the corresponding lookbehind prefixes would be (?<! and (?<= respectively. A lookahead means that you read the next characters (hence “ahead”) without consuming them. A lookbehind means that you check characters that have already been consumed.
– Didier LMay 21 '12 at 16:35

So the line which contains the string hede would be matched. Once the regex engine sees the following (*SKIP)(*F) (Note: You could write (*F) as (*FAIL)) verb, it skips and make the match to fail. | called alteration or logical OR operator added next to the PCRE verb which inturn matches all the boundaries exists between each and every character on all the lines except the line contains the exact string hede. See the demo here. That is, it tries to match the characters from the remaining string. Now the regex in the second part would be executed.

PART 2

^.*$

Explanation:

^ Asserts that we are at the start. ie, it matches all the line starts except the one in the hede line. See the demo here.

.* In the Multiline mode, . would match any character except newline or carriage return characters. And * would repeat the previous character zero or more times. So .* would match the whole line. See the demo here.

Hey why you added .* instead of .+ ?

Because .* would match a blank line but .+ won't match a blank. We want to match all the lines except hede , there may be a possibility of blank lines also in the input . so you must use .* instead of .+ . .+ would repeat the previous character one or more times. See .* matches a blank line here.

Since no one else has given a direct answer to the question that was asked, I'll do it.

The answer is that with POSIX grep, it's impossible to literally satisfy this request:

grep "Regex for doesn't contain hede" Input

The reason is that POSIX grep is only required to work with Basic Regular Expressions, which are simply not powerful enough for accomplishing that task (they are not capable of parsing regular languages, because of lack of alternation and grouping).

However, GNU grep implements extensions that allow it. In particular, \| is the alternation operator in GNU's implementation of BREs, and \( and \) are the grouping operators. If your regular expression engine supports alternation, negative bracket expressions, grouping and the Kleene star, and is able to anchor to the beginning and end of the string, that's all you need for this approach.

For those interested in the details, the technique employed is to convert the regular expression that matches the word into a finite automaton, then invert the automaton by changing every acceptance state to non-acceptance and vice versa, and then converting the resulting FA back to a regular expression.

Finally, as everyone has noted, if your regular expression engine supports negative lookahead, that simplifies the task a lot. For example, with GNU grep:

grep -P '^((?!hede).)*$' Input

Update: I have recently found Kendall Hopkins' excellent FormalTheory library, written in PHP, which provides a functionality similar to Grail. Using it, and a simplifier written by myself, I've been able to write an online generator of negative regular expressions given an input phrase (only alphanumeric and space characters currently supported): http://www.formauri.es/personal/pgimeno/misc/non-match-regex/

Basically, "match at the beginning of the line if and only if it does not have 'hede' in it" - so the requirement translated almost directly into regex.

Of course, it's possible to have multiple failure requirements:

^(?!.*(hede|hodo|hada))

Details: The ^ anchor ensures the regex engine doesn't retry the match at every location in the string, which would match every string.

The ^ anchor in the beginning is meant to represent the beginning of the line. The grep tool matches each line one at a time, in contexts where you're working with a multiline string, you can use the "m" flag:

It may be more maintainable to two regexes in your code, one to do the first match, and then if it matches run the second regex to check for outlier cases you wish to block for example ^.*(hede).* then have appropriate logic in your code.

OK, I admit this is not really an answer to the posted question posted and it may also use slightly more processing than a single regex. But for developers who came here looking for a fast emergency fix for an outlier case then this solution should not be overlooked.

Regex negation is not particularly useful on its own but when you also have intersection, things get interesting, since you have a full set of boolean set operations: you can express "the set which matches this, except for things which match that".

How to use PCRE's backtracking control verbs to match a line not containing a word

Here's a method that I haven't seen used before:

/.*hede(*COMMIT)^|/

How it works

First, it tries to find "hede" somewhere in the line. If successful, at this point, (*COMMIT) tells the engine to, not only not backtrack in the event of a failure, but also not to attempt any further matching in that case. Then, we try to match something that cannot possibly match (in this case, ^).

If a line does not contain "hede" then the second alternative, an empty subpattern, successfully matches the subject string.

This method is no more efficient than a negative lookahead, but I figured I'd just throw it on here in case someone finds it nifty and finds a use for it for other, more interesting applications.

Maybe you'll find this on Google while trying to write a regex that is able to match segments of a line (as opposed to entire lines) which do not contain a substring. Tooke me a while to figure out, so I'll share:

With ConyEdit, you can use the command line cc.gl !/hede/ to get lines that do not contain the regex matching, or use the command line cc.dl /hede/ to delete lines that contain the regex matching. They have the same result.

^((?!hede).)*$ is an elegant solution, except since it consumes characters you won't be able to combine it with other criteria. For instance, say you wanted to check for the non-presence of "hede" and the presence of "haha." This solution would work because it won't consume characters:

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).