Monday, November 14, 2011

Regular expressions: greedy vs. lazy quantifier

I am reading the great book Regular Expressions Cookbook after seeing today a few things that I did not about regular expressions. I will get to the interesting regular expression I had to work on in a future post, but for now I will share something I found quite interesting: greedy and lazy quantifiers.

Let's start by trying to match a paragraph in HTML. A paragraph is typically surrounded by <p> and </p>. So, I would write regular expression as:

<p>.*</p>
This should take care of matching the paragraph, right? Partially right. If you have a long HTML, with multiple paragraphs, this will match from the first paragraph start (<p>) to the very last paragraph end (</p>). This "*" is actually called a greedy quantifier.

If you want to make it behave differently, you will want to use what is called lazy quantifier. This is just the regular question mark placed after another quantifier. Note that, if question mark is placed after a regex token, it means "zero or once". This is not what we are talking about here - question mark here after a quantifier means that it changes the quantifier behavior.

<p>.*?</p>
In the example above, it matches the first paragraph only, not the entire text.

Under the covers, the regular expression engine uses backtracking to match the expression. For a greedy quantifier, it eats up all the content that matches the current regular expression and then moves to the next token. In the case of the paragraph matching example, it reads the entire text until the very end. Then it moves on to the next token (in this case <) - and since it fails as the document finished, it back tracks, and tries to match < again. It keeps going back each character until it matches.

For the lazy quantifier, it repeats as few times as it can, moving to the next regex token (here <). If the token is not matched, then it back tracks and moves forward another time, seeing if the token is matched.

I was happy to learn this, as I had always asked myself how to control this behavior. And quite interesting to understand the regular expression engine behavior, what can come in handy. Just be careful when using the question mark. As said above, it can serve two purposes depending on where it's placed.
Post a Comment