Speak and Shout

Tuesday, December 18, 2007

Something I hate about Python's re module

UPDATE: I've gotten a couple comments about this, and both seem to illustrate (with varying degrees of tact) that my complaint wasn't all that clear. My problem with the re module is this: Python's documentation for the re.match function is match(pattern, string, [flags]) where pattern can be either a regex string or a compiled regex object. If it's a compiled regex object, then supplying an optional flag to re.match (in my case, re.IGNORECASE) doesn't work and, more to the point, fails silently. I think this should throw an exception if it's not going to work. However, I really think that it should work the way I've illustrated, because, IMO, it's the most natural way to use the API.

----
I've been burned at least three separate times by the following problem: I'll start out with a simple uncompiled regex for testing and then switch over to a compiled regex. Suddenly, the whole thing stops working.

Here's an example of what I'm doing below. (Example taken from O'Reilly's Regular Expression Pocket Reference by Tony Stubblebine.)

import re
dailybugle = r'Spider-Man Menaces City!'
pattern = r'spider[- ]?man.'
if re.match(pattern, dailybugle, re.IGNORECASE):
    print dailybugle

This prints out 'Spider-Man Menaces City!' as expected. So now I want to compile the regular expression now for speed. I change the code to look like this:

import re
dailybugle = r'Spider-Man Menaces City!'
pattern = re.compile(r'spider[- ]?man.')
if re.match(pattern, dailybugle, re.IGNORECASE):
    print dailybugle


Looks simple, right? I just surrounded the pattern string with a call to re.compile(). Unfortunately, the whole thing now quietly fails. What the ... ? Take the re.compile out, it starts working again.



The solution is to move the re.IGNORECASE flag into the re.compile call, like so:

import re
dailybugle = r'Spider-Man Menaces City!'
pattern = re.compile(r'spider[- ]?man.', re.IGNORECASE)
if re.match(pattern, dailybugle):
    print dailybugle


In my opinion, this solution is very unintuitive and requires more rejiggering of the code than it should. But even worse is that in my first attempt to use a compiled regex, re.match can receive a re.IGNORECASE flag that it subsequently disregards. This type of call should throw an exception, in my opinion.

Anyone know a reason for this bad (and seemingly buggy) behavior?

Labels: ,

11 Comments:

  • pattern = re.compile(r'spider[- ]?man.', re.IGNORECASE)
    if pattern.match(dailybugle):
    print dailybugle

    By Blogger Justin, At 11:21 AM  

  • Your url to this entry ( http://mysite.verizon.net/bcorfman/2007/12/something-i-hate-about-pythons-re.html ) is broken - well, it's a blank page with nothing on it.

    Tim

    By Anonymous Anonymous, At 11:29 AM  

  • Twist and Shout: Something I hate about bloggers.

    Why not actually *read* the documentation of the re module before spewing your own programming mistakes onto the Internet.
    http://docs.python.org/lib/module-re.html

    By Anonymous Anonymous, At 12:08 PM  

  • As the Anonymous Jackass says, it does seem that it's just a confusion for re between start/endpos and your flag:

    reObj.match(string, start, end)

    re.IGNORECASE is equal to int(2), so

    reObj.match(string, re.IGNORECASE)

    definitely does the wrong thing ;)

    By Blogger Titus Brown, At 1:20 PM  

  • While I agree that you've found a bug (a quick check of sre.py confirms that 'flags' is ignored if you pass a compiled object to re.match/search), I think it's sort of bad form to pass the object to the module-level shortcuts.

    If you compile an object, then you should just call search/match directly on the object. And since these don't accept 'flags', you'd quickly realize that you need to pass it to compile.

    By Blogger David Avraamides, At 4:41 PM  

  • @David, I like the single step to go from uncompiled regex to a compiled one. I think it's more TOOWTDI. It's certainly much smoother than moving the flag and changing the module name to a variable name.

    By Blogger Brandon Corfman, At 5:14 PM  

  • I don't know how the Python re module works internally, so maybe I'm wrong, but I'm guessing that you will get a different compiled regex if you alternate between ignoring and not ignoring case. If you compile the regex without ignore case, then call re.match with ignore case, you would end up compiling the regex twice. You would have to recompile every time re.match was called, negating any benefit from compiling your regex to begin with.

    By Anonymous Anonymous, At 9:47 PM  

  • Yeah it's a bug.

    But you really should get in the habit of using compiled regex from the get go. Why wouldn't you other than habit?

    Also getting into the habit of using objects and methods instead of functions. It's the wave of the future! circa 1990's :)

    By Blogger njharman, At 3:55 AM  

  • @Anon, you're probably right about the flags needing to be part of the compiled regex. In the case where I forget though, it'd be good to have the bug fixed so I can't pass a flag to re.match by accident. (Although after this post, I hope I remember what to do now! :)

    By Blogger Brandon Corfman, At 8:06 AM  

  • Based on a comment by effbot on programming.reddit.com, I went ahead and submitted this problem as a bug on Python's issue tracker.

    By Blogger Brandon Corfman, At 10:47 AM  

  • The bug was just fixed within hours by Raymond Hettinger and checked into the Python SVN repository. Wow!

    By Blogger Brandon Corfman, At 1:20 PM  

Post a Comment



<< Home