Other then the fact that in 2.6 re.sub won't take a flags argument...
– new123456Jun 6 '11 at 3:27

41

I just ran into a case where using re.compile gave a 10-50x improvement. The moral is that if you have a lot of regexes (more than MAXCACHE = 100) and you use them a lot of times each (and separated by more than MAXCACHE regexes in between, so that each one gets flushed from the cache: so using the same one a lot of times and then moving on to the next one doesn't count), then it would definitely help to compile them. Otherwise, it doesn't make a difference.
– ShreevatsaRDec 30 '13 at 14:21

6

One small thing to note is that for strings that do not need regex, the in string substring test is MUCH faster: >python -m timeit -s "import re" "re.match('hello', 'hello world')" 1000000 loops, best of 3: 1.41 usec per loop>python -m timeit "x = 'hello' in 'hello world'" 10000000 loops, best of 3: 0.0513 usec per loop
– GamrixSep 1 '15 at 23:06

3

NOTE : Don't use "in" . @Gamrix Issue with using "in" for check is bad because it checks for exact characters instead of space separated words: Eg: 'wo' in 'hello world' will return True and so will 'world' in 'hello world'. Better to use regex
– MANUAug 21 '17 at 11:03

8

@MANU, dude really? Did you even look at his regex? re.match('hello', 'hello world') That is exactly equivalent to "in." This behavior is not on the whole bad. It's only bad for your specific use case, which is far from universal.
– arjunygAug 21 '17 at 22:10

22 Answers
22

I've had a lot of experience running a compiled regex 1000s of times versus compiling on-the-fly, and have not noticed any perceivable difference. Obviously, this is anecdotal, and certainly not a great argument against compiling, but I've found the difference to be negligible.

EDIT:
After a quick glance at the actual Python 2.5 library code, I see that Python internally compiles AND CACHES regexes whenever you use them anyway (including calls to re.match()), so you're really only changing WHEN the regex gets compiled, and shouldn't be saving much time at all - only the time it takes to check the cache (a key lookup on an internal dict type).

Your conclusion is inconsistent with your answer. If regexs are compiled and stored automatically there is no need in most cases to do it by hand.
– jfsJan 17 '09 at 0:21

74

J. F. Sebastian, it serves as a signal to the programmer that the regexp in question will be used a lot and is not meant to be a throwaway.
– kaleissinJan 20 '09 at 14:28

30

More than that, I'd say that if you don't want to suffer the compile & cache hit at some performance critical part of your application, you're best off to compile them before hand in a non-critical part of your application.
– Eddie ParkerJan 20 '09 at 18:10

19

I see the main advantage for using compiled regex if your re-using the same regex multiple times, thereby reducing the possibility for typos. If your just calling it once then uncompiled is more readable.
– monkutMar 19 '09 at 1:00

17

So, the main difference will be when you are using lots of different regex (more than _MAXCACHE), some of them just once and others lots of times... then it's important to keep your compiled expressions for those that are used more so they're not flushed out of the cache when it's full.
– fortranJul 6 '09 at 10:36

For me, the biggest benefit to re.compile isn't any kind of premature optimization (which is the root of all evil, anyway). It's being able to separate definition of the regex from its use.

Even a simple expression such as 0|[1-9][0-9]* (integer in base 10 without leading zeros) can be complex enough that you'd rather not have to retype it, check if you made any typos, and later have to recheck if there are typos when you start debugging. Plus, it's nicer to use a variable name such as num or num_b10 than 0|[1-9][0-9]*.

It's certainly possible to store strings and pass them to re.match; however, that's less readable:

num = "..."
# then, much later:
m = re.match(num, input)

Versus compiling:

num = re.compile("...")
# then, much later:
m = num.match(input)

Though it is fairly close, the last line of the second feels more natural and simpler when used repeatedly.

I agree with this answer; oftentimes using re.compile results in more, not less readable code.
– Carl MeyerFeb 1 '09 at 19:26

1

Sometimes the opposite is true, though - e.g. if you define the regex in one place and use its matching groups in another far-away place.
– Ken WilliamsJul 17 '17 at 15:51

@KenWilliams Not necessarily, a well named regex for a specific purpose should be clear even when used far from the original definition. For example us_phone_number or social_security_number etc.
– Brian M. SheldonOct 3 '18 at 13:53

@BrianM.Sheldon naming the regex well doesn't really help you know what its various capturing groups represent.
– Ken WilliamsOct 23 '18 at 3:58

so, if you're going to be using the same regex a lot, it may be worth it to do re.compile (especially for more complex regexes).

The standard arguments against premature optimization apply, but I don't think you really lose much clarity/straightforwardness by using re.compile if you suspect that your regexps may become a performance bottleneck.

Update:

Under Python 3.6 (I suspect the above timings were done using Python 2.x) and 2018 hardware (MacBook Pro), I now get the following timings:

I also added a case (notice the quotation mark differences between the last two runs) that shows that re.match(x, ...) is literally [roughly] equivalent to re.compile(x).match(...), i.e. no behind-the-scenes caching of the compiled representation seems to happen.

Major problems with your methodology here, since the setup argument is NOT including in the timing. Thus, you've removed the compilation time from the second example, and just average it out in the first example. This doesn't mean the first example compiles every time.
– TriptychJan 16 '09 at 22:12

1

Yes, I agree that this is not a fair comparison of the two cases.
– KivJan 16 '09 at 22:15

6

I see what you mean, but isn't that exactly what would happen in an actual application where the regexp is used many times?
– dF.Jan 17 '09 at 0:05

23

@Triptych, @Kiv: The whole point of compiling regexps separate from use is to minimize compilation; removing it from the timing is exactly what dF should have done, because it represents real-world use most accurately. Compilation time is especially irrelevant with the way timeit.py does its timings here; it does several runs and only reports the shortest one, at which point the compiled regexp is cached. The extra cost you're seeing here is not the cost of compiling the regexp, but the cost of looking it up in the compiled regexp cache (a dictionary).
– jemfinchApr 14 '10 at 11:47

2

@Triptych Should the import re be moved out of setup? It's all about where you want to measure. If I run a python script numerous times, it would have the import re time hit. When comparing the two it is important to separate the two lines for timing. Yes as you say it is when you will have the time hit. The comparison shows that either you take the time hit once and repeat the lesser time hit by compiling or you take the hit each time assuming the cache gets cleared between calls, which as it has been pointed out could happen. Adding a timing of h=re.compile('hello') would help clarify.
– Tom MyddeltynAug 5 '16 at 19:17

it doesn't really matter, the point is to try the benchmark in the environment where you'll be running the code
– david kingOct 6 '14 at 17:27

1

For me the performance is almost exactly the same for 1000 loops or more. The compiled version is faster for 1-100 loops. (On both pythons 2.7 and 3.4).
– ZitraxNov 23 '15 at 12:34

2

On my Python 2.7.3 setup there is hardly any difference. Sometimes compile is faster, sometimes ist's slower. The difference is always <5%, so I count the difference as measuring uncertainty, since the device only has one CPU.
– DakkaronDec 17 '15 at 11:52

1

In Python 3.4.3 seen in two separate runs: using compiled was even slower than not compiled.
– ZelphirJan 2 '16 at 23:33

I just tried this myself. For the simple case of parsing a number out of a string and summing it, using a compiled regular expression object is about twice as fast as using the re methods.

As others have pointed out, the re methods (including re.compile) look up the regular expression string in a cache of previously compiled expressions. Therefore, in the normal case, the extra cost of using the re methods is simply the cost of the cache lookup.

However, examination of the code, shows the cache is limited to 100 expressions. This begs the question, how painful is it to overflow the cache? The code contains an internal interface to the regular expression compiler, re.sre_compile.compile. If we call it, we bypass the cache. It turns out to be about two orders of magnitude slower for a basic regular expression, such as r'\w+\s+([0-9_]+)\s+\w*'.

I agree with Honest Abe that the match(...) in the given examples are different. They are not a one-to-one comparisons and thus, outcomes are vary. To simplify my reply, I use A, B, C, D for those functions in question. Oh yes, we are dealing with 4 functions in re.py instead of 3.

Running this piece of code:

h = re.compile('hello') # (A)
h.match('hello world') # (B)

is same as running this code:

re.match('hello', 'hello world') # (C)

Because, when looked into the source re.py, (A + B) means:

h = re._compile('hello') # (D)
h.match('hello world')

and (C) is actually:

re._compile('hello').match('hello world')

So, (C) is not the same as (B). In fact, (C) calls (B) after calling (D) which is also called by (A). In other words, (C) = (A) + (B). Therefore, comparing (A + B) inside a loop has same result as (C) inside a loop.

George's regexTest.py proved this for us.

noncompiled took 4.555 seconds. # (C) in a loop
compiledInLoop took 4.620 seconds. # (A + B) in a loop
compiled took 2.323 seconds. # (A) once + (B) in a loop

Everyone's interest is, how to get the result of 2.323 seconds. In order to make sure compile(...) only get called once, we need to store the compiled regex object in memory. If we are using a class, we could store the object and reuse when every time our function get called.

If we are not using class (which is my request today), then I have no comment. I'm still learning to use global variable in Python, and I know global variable is a bad thing.

One more point, I believe that using (A) + (B) approach has an upper hand. Here are some facts as I observed (please correct me if I'm wrong):

Calls A once, it will do one search in the _cache followed by one sre_compile.compile() to create a regex object. Calls A twice, it will do two searches and one compile (because the regex object is cached).

If the _cache get flushed in between, then the regex object is released from memory and Python need to compile again. (someone suggest that Python won't recompile.)

If we keep the regex object by using (A), the regex object will still get into _cache and get flushed somehow. But our code keep a reference on it and the regex object will not be released from memory. Those, Python need not to compile again.

The 2 seconds differences in George's test compiledInLoop vs compiled is mainly the time required to build the key and search the _cache. It doesn't mean the compile time of regex.

George's reallycompile test show what happen if it really re-do the compile every time: it will be 100x slower (he reduced the loop from 1,000,000 to 10,000).

Here are the only cases that (A + B) is better than (C):

If we can cache a reference of the regex object inside a class.

If we need to calls (B) repeatedly (inside a loop or multiple times), we must cache the reference to regex object outside the loop.

Case that (C) is good enough:

We cannot cache a reference.

We only use it once in a while.

In overall, we don't have too many regex (assume the compiled one never get flushed)

In addition to the small speed benefit from using re.compile, people also like the readability that comes from naming potentially complex pattern specifications and separating them from the business logic where there are applied:

is the " in def search(pattern, string, flags=0):" a typo?
– phuclvJul 7 '17 at 6:05

1

Note that if pattern is already a compiled pattern, the caching overhead becomes significant : hashing a SRE_Pattern is expensive and the pattern is never written to cache, so the lookup fails each time with a KeyError.
– Eric DuminilNov 25 '17 at 9:20

The optional second parameter pos gives an index in the string where
the search is to start; it defaults to 0. This is not completely
equivalent to slicing the string; the '^' pattern character matches at
the real beginning of the string and at positions just after a
newline, but not necessarily at the index where the search is to
start.

endpos

The optional parameter endpos limits how far the string will be
searched; it will be as if the string is endpos characters long, so
only the characters from pos to endpos - 1 will be searched for a
match. If endpos is less than pos, no match will be found; otherwise,
if rx is a compiled regular expression object, rx.search(string, 0,
50) is equivalent to rx.search(string[:50], 0).

The regex object's search, findall, and finditer methods also support these parameters.

Although this does not affect the speed of running your code, I like to do it this way as it is part of my commenting habit. I throughly dislike spending time trying to remember the logic that went behind my code 2 months down the line when I want to make modifications.

I've edited your answer. I think mentioning re.VERBOSE is worthwhile, and it does add something that the other answers seem to have left out. However, leading your answer with "I'm posting here because I can't comment yet" is sure to get it deleted. Please don't use the answers box for anything other than answers. You're only one or two good answers away from being able to comment anywhere (50 rep), so please just be patient. Putting comments in answer boxes when you know you shouldn't won't get you there any faster. It will get you downvotes and deleted answers.
– skrrgwasmeMar 20 '15 at 3:46

1

thank you for your advice =) i will remember that
– cyneoMar 20 '15 at 4:43

Same issue as with dF's performance comparison. It's not really fair unless you include the performance cost of the compile statement itself.
– Carl MeyerFeb 1 '09 at 19:27

6

Carl, I disagree. The compile is only executed once, while the matching loop is executed a million times
– Eli BenderskyFeb 1 '09 at 20:19

@eliben: I agree with Carl Meyer. The compilation takes place in both cases. Triptych mentions that caching is involved, so in an optimal case (re stays in cache) both approaches are O(n+1), although the +1 part is kind of hidden when you don't use re.compile explicitly.
– paprikaFeb 19 '09 at 4:02

1

Don't write your own benchmarking code. Learn to use timeit.py, which is included in the standard distribution.
– jemfinchApr 14 '10 at 11:48

How much of that time are you recreating the pattern string in the for loop. This overhead can't be trivial.
– IceArdorApr 24 '14 at 8:16

This answer might be arriving late but is an interesting find. Using compile can really save you time if you are planning on using the regex multiple times (this is also mentioned in the docs). Below you can see that using a compiled regex is the fastest when the match method is directly called on it. passing a compiled regex to re.match makes it even slower and passing re.match with the patter string is somewhere in the middle.

This is a good question. You often see people use re.compile without reason. It lessens readability. But sure there are lots of times when pre-compiling the expression is called for. Like when you use it repeated times in a loop or some such.

It's like everything about programming (everything in life actually). Apply common sense.

I've had a lot of experience running a compiled regex 1000s
of times versus compiling on-the-fly, and have not noticed
any perceivable difference

The votes on the accepted answer leads to the assumption that what @Triptych says is true for all cases. This is not necessarily true. One big difference is when you have to decide whether to accept a regex string or a compiled regex object as a parameter to a function:

Regular Expressions are compiled before being used when using the second version. If you are going to executing it many times it is definatly better to compile it first. If not compiling every time you match for one off's is fine.

this is about as simple in terms of functionality as it can get. because this is example is so short, i conflated the way to get _text_has_foobar_re_search all in one line. the disadvantage of this code is that it occupies a little memory for whatever the lifetime of the TYPO library object is; the advantage is that when doing a foobar search, you'll get away with two function calls and two class dictionary lookups. how many regexes are cached by re and the overhead of that cache are irrelevant here.

I readily admit that my style is highly unusual for python, maybe even debatable. however, in the example that more closely matches how python is mostly used, in order to do a single match, we must instantiate an object, do three instance dictionary lookups, and perform three function calls; additionally, we might get into re caching troubles when using more than 100 regexes. also, the regular expression gets hidden inside the method body, which most of the time is not such a good idea.

be it said that every subset of measures---targeted, aliased import statements; aliased methods where applicable; reduction of function calls and object dictionary lookups---can help reduce computational and conceptual complexity.

WTF. Not only you dugg out an old, answered question. Your code is non-idiomatic as well and wrong on so many levels - (ab)using classes as namespaces where a module is enough, capitalizing class names, etc... See pastebin.com/iTAXAWen for better implementations. Not to mention the regex you use is broken, too. Overall, -1
– user395760Nov 6 '10 at 20:34

2

guilty. this is an old question, but i don't mind being #100 in a slowed-down conversation. the question has not been closed. i did warn my code could be adversary to some tastes. i think if you could view it as a mere demonstration of what is doable in python, like: if we take everything , everything we believe, as optional, and then tinker together in what any way, what do the things look like that we can get? i am sure you can discern merits and dismerits of this solution and can complain more articulatedly. otherwise i must conclude your claim of wrongness relies on little more than PEP008
– flowNov 6 '10 at 22:14

2

No, it's not about PEP8. That's just naming conventions, and I'd never downvote for not following those. I downvoted you because the code you showed is simply poorly written. It defies conventions and idioms for no reason, and is an incarnation of permature optimization: You'd have to optimize the living daylight out of all other code for this to become a bottleneck, and even then the third rewrite I offered is shorter, more idiomatic and just as fast by your reasoning (same number of attribute access).
– user395760Nov 7 '10 at 16:09

"poorly written"--like why exactly? "defies conventions and idioms"--i warned you. "for no reason"--yes i do have a reason: simplify where complexity serves no purpose; "incarnation of premature optimization"--i'm very much for a programming style that chooses a balance of readability and efficiency; OP asked for elicitation of "benefit in using re.compile", which i understand as a question about efficiency. "(ab)using classes as namespaces"--it is your words that are abusive. class is there so you have a "self" point-of-reference. i tried using modules for this purpose, classes work better.
– flowNov 9 '10 at 13:36

"capitalizing class names", "No, it's not about PEP8"--you're apparently so outrageously angry you can't even tell about what to bicker first. "WTF", "wrong"---see how emotional you are? more objectivity and less froth please.
– flowNov 9 '10 at 14:19

My understanding is that those two examples are effectively equivalent. The only difference is that in the first, you can reuse the compiled regular expression elsewhere without causing it to be compiled again.

Calling the compiled pattern object's search function with the string 'M' accomplishes the same thing as calling re.search with both the regular expression and the string 'M'. Only much, much faster. (In fact, the re.search function simply compiles the regular expression and calls the resulting pattern object's search method for you.)