Wednesday, November 16, 2011

Regular expressions: backtracking can kill your performance

Or why you should learn atomic grouping...

After last post on turning off useless backtracking by using atomic grouping, I kept on reading the Regular Expressions Cookbook and ran another experiment to validate the performance difference. Let's start with the results:

Normal Regex - # of loops: 1000, # of matches: 0, time (ms): 66149
Atomic Grouping - # of loops: 1000, # of matches: 0, time (ms): 11196

Note that a normal regex was 6 times slower than atomic grouping to fail to match.

The example from the book is a regex to match a well-formed html page. I saved a Wikipedia page, made a few adjustments to the program used in the last post (e.g. to read the html contents from file), and ran the tests.

The regular expression used is:
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Every test I ran with this regex and with the version without atomic grouping, I can see the normal regex being 6 times slower. If you want to know more about using atomic grouping or a regular regex, please read my last post.

And this difference was found without setting the regex to be compiled. After setting this flag, these are the values I get:

Normal Regex - # of loops: 1000, # of matches: 0, time (ms): 49319
Atomic Grouping - # of loops: 1000, # of matches: 0, time (ms): 9471

Still a pretty significant change between normal regex and atomic grouping.
Post a Comment