QUANTIFIERS MASTERED
Last time we touched on greedy vs non-greedy. You learned that * is greedy and *? is non-greedy. But we didn't really explain WHY. Tonight we open the hood and look at the engine.
How the Engine Actually Works
A regex engine reads your pattern left to right and tries to match it against the string. When it hits a quantifier, it has to make a choice: how much text should this thing consume?
A greedy quantifier says "give me as much as possible." It lunges forward and grabs everything it can. Then, if the rest of the pattern fails to match, it gives back one character at a time (this is called backtracking) until the overall pattern succeeds.
A non-greedy quantifier says "give me as little as possible." It starts by consuming nothing, tries to match the rest of the pattern, and if that fails, it takes one more character and tries again.
Same destination, opposite directions:
+--------------------------------------------------+
| GREEDY (.*) |
| Start: grab EVERYTHING >>>>>>>>>>>>>>>>>>>> |
| Fail? Give back one < |
| Fail? Give back one < |
| Match! Stop here. |
+--------------------------------------------------+
+--------------------------------------------------+
| NON-GREEDY (.*?) |
| Start: grab NOTHING |
| Fail? Take one more > |
| Fail? Take one more > |
| Match! Stop here. |
+--------------------------------------------------+
Both find a valid match. But they often find DIFFERENT valid matches because they approach from opposite ends.
The Classic HTML Example
Say you have this string:
<b>bold</b> and <i>italic</i>
Greedy pattern <.*> grabs from the first < to the LAST >:
<b>bold</b> and <i>italic</i>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
One giant match. Almost never what you want.
Non-greedy pattern <.*?> starts small and stops at the NEXT >:
<b>bold</b> and <i>italic</i>
^^^ Match 1: <b>
^^^^ Match 2: </b>
^^^ Match 3: <i>
^^^^ Match 4: </i>
Four separate matches. Try it yourself:
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*>'
echo '<b>bold</b> and <i>italic</i>' | grep -oP '<.*?>'
The -o flag shows only the matched portions so you can see the difference clearly.
Quoted Strings: Same Problem
Given this line:
name="Alice" city="Portland" role="admin"
Greedy ".*" grabs from the first quote to the LAST quote:
"Alice" city="Portland" role="admin"
One match. Useless.
Non-greedy ".*?" gives you each quoted value:
echo 'name="Alice" city="Portland" role="admin"' | grep -oP '".*?"'
Output:
"Alice"
"Portland"
"admin"
Log Line Gotcha
Parsing bracketed fields in log entries:
[2026-02-19 10:30:00] [INFO] Server started on port 8080
Greedy \[.*\] swallows both bracketed sections into one match. Non- greedy \[.*?\] gives you two clean, separate matches. This pattern comes up constantly in real log parsing.
Non-Greedy Bounded Quantifiers
You know {n,m} for bounded repetition. Make it non-greedy by adding a question mark: {n,m}?
\d{2,4} matches "12345" -> grabs "1234" (greedy, takes max)
\d{2,4}? matches "12345" -> grabs "12" (non-greedy, takes min)
Not used every day, but when you need it, you need it.
Possessive Quantifiers: The Third Option
Now we get to the part most people never learn. PCRE has a third type of quantifier: possessive. Add a plus sign after the quantifier:
*+ possessive star
++ possessive plus
?+ possessive question mark
{n,m}+ possessive bounded
A possessive quantifier is like greedy, but it NEVER backtracks. Once it grabs text, it refuses to give any back. If the rest of the pattern can't match, the whole thing fails immediately.
+--------------------------------------------------+
| POSSESSIVE (*+) |
| Start: grab EVERYTHING >>>>>>>>>>>>>>>>>>>> |
| Fail? Too bad. FAIL. No backtracking. |
+--------------------------------------------------+
Why would you want this? Two reasons:
- Performance. When you KNOW backtracking won't help, possessive quantifiers skip it. This prevents catastrophic backtracking where the engine tries millions of combinations and hangs.
- Correctness. Sometimes you want "all or nothing" behavior.
Performance example. Matching lines that are all digits with ^\d+$. On input "12345678x", the greedy \d+ grabs all digits, fails at $, backs up to 7, fails, backs up to 6... pointless work because giving back digits will never make "x" match $. With ^\d++$ the possessive ++ grabs all digits, fails at $, and immediately gives up.
The behavior difference in action:
echo "aaaaaab" | grep -P "a++ab"
FAILS to match. The a++ possessively grabs all six a's and refuses to give one back for the literal "a" in the pattern. Compare:
echo "aaaaaab" | grep -P "a+ab"
Succeeds. Greedy a+ grabs all six, fails, backtracks one, and the literal "a" and "b" match.
When To Use Which
+--------------------------------------------+
| Greedy (* + ?) |
| Default. Longest possible match. |
| Backtracking is fine. |
+--------------------------------------------+
| Non-greedy (*? +? ??) |
| Shortest match. Parsing delimited |
| fields, tags, quotes. |
+--------------------------------------------+
| Possessive (*+ ++ ?+) |
| No backtracking. Performance guard or |
| "all or nothing" semantics. |
+--------------------------------------------+
In practice: greedy most of the time, non-greedy when parsing delimited content, possessive when writing patterns for production systems that process large volumes of text.
Now you understand what the engine is doing under the hood. That's the difference between using regex and mastering it.