Tuesday, November 15, 2011

Regular expressions: turning off useless backtracking

Last time I mentioned the nice feature on how to change quantifier to be greedy or lazy, and then help you match what you really want. This time, it is how to make your regular expression more efficient. First, we need to remember last post where greedy or lazy quantifier change how backtracking works. Sometimes backtracking just doesn’t make sense. Look at this example:
\b\d+\b
It is supposed to match integers at word boundaries (\b means boundaries). At first, it may be a bit hard to understand, but backtracking is unnecessary here. For instance, try to run this regular expression on an example: .
789abcdef654 123
At the point when the regular expression fails (when it checks that “a” is a word boundary), it doesn’t make sense to start backtracking to see if 9 is a word boundary (or 8, for that matter). We should just go ahead and move on to the next token.

This is where it pays off to know well the tool you’re using. Different flavors of regular expressions offer ways to avoid keep backtracking positions, so when a match fails, it just moves on. In case of Java and .NET, both support atomic grouping. .
\b(?>\d+)\b
Atomic grouping here is represented by “(?>)”. When the regular expression engine leaves the group, all backtracking positions are lost, so a failed match will not have any recourse other than moving on to the next characters to find further matches. It will give up on the current context.

In the example above, when it matches 789, it will leave the atomic group, so when \b fails to be matched there are no backtracking positions to be tried. From this quick analysis, we see that we avoid a lot of extra computation by avoid these useless backtracking.

The question about saving time is how much we actually save here. I wrote a test code to benchmark these different options and verify whether we are talking about any substantial savings or not.
static void Main(string[] args)
{
TestAtomicGrouping(1000);
TestAtomicGrouping(10000);
TestAtomicGrouping(100000);
}

static void TestAtomicGrouping(int numLoops)
{
Regex regex1 = new Regex(@"\b\d+\b", RegexOptions.Compiled);
Regex regex2 = new Regex(@"\b(?>\d+)\b", RegexOptions.Compiled);

StringBuilder sb = new StringBuilder();
for (int i = 0; i < 100; i++)
sb.Append('1');
sb.Append('a');
sb.Append(' ');
for (int i = 0; i < 100; i++)
sb.Append('1');

string testString = sb.ToString();
int firstMatchCount = 0;
DateTime start = DateTime.Now;
for (int i = 0; i < numLoops; i++)
{
if (regex1.IsMatch(testString))
firstMatchCount++;
}
TimeSpan firstTest = DateTime.Now - start;

start = DateTime.Now;
int secondMatchCount = 0;
for (int i = 0; i < numLoops; i++)
{
if (regex2.IsMatch(testString))
secondMatchCount++;
}
TimeSpan secondTest = DateTime.Now - start;

Console.WriteLine("Normal Regex - # of loops: {0}, # of matches: {1}, time (ms): {2}",
numLoops, firstMatchCount, firstTest.TotalMilliseconds);
Console.WriteLine("Atomic Grouping - # of loops: {0}, # of matches: {1}, time (ms): {2}",
numLoops, secondMatchCount, secondTest.TotalMilliseconds);
}
Now we need to see results:

Normal Regex - # of loops: 1000, # of matches: 1000, time (ms): 48.0028
Atomic Grouping - # of loops: 1000, # of matches: 1000, time (ms): 30.0017

Normal Regex - # of loops: 10000, # of matches: 10000, time (ms): 386.0221
Atomic Grouping - # of loops: 10000, # of matches: 10000, time (ms): 239.0137

Normal Regex - # of loops: 100000, # of matches: 100000, time (ms): 3145.1799
Atomic Grouping - # of loops: 100000, # of matches: 100000, time (ms): 2079.1189

So, at the end of the day, getting rid of backtracking can be quite significant if you’re matching this regular expression quite often. In this test, we could save 32% or more of the matching time just by “turning off” backtracking with atomic grouping.
Post a Comment