CAPTURE GROUPS - THE FULL PICTURE
In Regex Therapy we learned to put parentheses around parts of our pattern and refer to them as $1, $2, $3 in replacements. That's the beginning. Tonight we see everything capture groups can really do.
Quick Refresher: Numbered Groups
Parentheses create numbered groups, numbered by the position of their opening paren, left to right, starting at 1.
Pattern: (foo)(bar)(baz)
Group 1: foo
Group 2: bar
Group 3: baz
In a replacement, $1 gives you whatever group 1 captured. You know this part. Let's go further.
Named Capture Groups
When you have a complex pattern with five or six groups, keeping track of which number means what gets painful fast. Named groups fix this.
Two syntaxes (because regex standards are fun like that):
(?<name>pattern) Perl/PCRE style
(?P<name>pattern) Python style (also works in PCRE)
Both do the same thing. Instead of this:
(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.*)
Where you have to remember that $1 is date, $2 is time, $3 is level, $4 is message... you write this:
(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) (?<level>\w+) (?<message>.*)
Now the pattern documents itself. Anyone reading it can tell exactly what each group captures.
Using Named Groups in Practice
Let's parse a real log line:
2026-02-19 14:30:45 ERROR Database connection timeout after 30s
Try it with perl to see named captures in action:
echo '2026-02-19 14:30:45 ERROR Database connection timeout' | \
perl -ne 'if (/(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) (?<level>\w+) (?<message>.*)/) {
print "Date: $+{date}\n";
print "Time: $+{time}\n";
print "Level: $+{level}\n";
print "Message: $+{message}\n";
}'
Output:
Date: 2026-02-19
Time: 14:30:45
Level: ERROR
Message: Database connection timeout
In Perl, named captures live in the %+ hash. In Python you'd use match.group('date'). In PHP it's $matches['date']. Every language has its way to access them, but the regex syntax itself is the same.
For replacements, you can use ${name}:
perl -pe 's/(?<first>\w+) (?<last>\w+)/${last}, ${first}/' <<< "John Smith"
Output: Smith, John
Non-Capturing Groups: Quick Refresher
We covered (?:...) last time. Non-capturing groups let you group for alternation or quantifiers without creating a numbered capture.
(?:https?|ftp)://(\S+)
The (?:https?|ftp) groups the protocol alternatives but doesn't capture them. Group $1 is just the URL path. Rule of thumb: if you need parentheses for structure but don't need the captured text, use (?:). Keeps your group numbers clean.
Backreferences Inside the Pattern
You know you can use $1 in a REPLACEMENT string. But you can also reference a previous capture INSIDE THE PATTERN ITSELF using \1.
This doesn't mean "match the same pattern again." It means "match the exact same TEXT that was already captured."
The classic example: finding duplicated words.
\b(\w+)\s+\1\b
\b word boundary
(\w+) capture a word into group 1
\s+ one or more whitespace characters
\1 match the SAME TEXT that group 1 captured
\b word boundary
Test it:
echo "the the quick brown fox fox jumped" | grep -oP '\b(\w+)\s+\1\b'
Output:
the the
fox fox
The \1 matched the SPECIFIC word that (\w+) captured on that pass. With named groups, use \k<name> instead of \1:
\b(?<word>\w+)\s+\k<word>\b
Matching Paired Tags
Another real use of backreferences: matching HTML-style paired tags.
<(\w+)>.*?</\1>
Captures the tag name, then requires the closing tag to match.
echo '<b>bold text</b> and <i>italic</i>' | grep -oP '<(\w+)>.*?</\1>'
Output:
<b>bold text</b>
<i>italic</i>
Properly paired tags only. Mismatched tags like <b>text</i> correctly don't match.
Nested Group Numbering
Here's a gotcha. Nested groups are numbered by where their OPENING parenthesis appears, reading left to right.
((a)(b(c)))
Count the opening parens:
Position 1: ( -> Group 1 captures: abc
Position 2: ( -> Group 2 captures: a
Position 3: ( -> Group 3 captures: bc
Position 4: ( -> Group 4 captures: c
Verify it:
echo "abc" | perl -ne 'if (/((a)(b(c)))/) {
print "Group 1: $1\n";
print "Group 2: $2\n";
print "Group 3: $3\n";
print "Group 4: $4\n";
}'
Output:
Group 1: abc
Group 2: a
Group 3: bc
Group 4: c
A practical case: parsing a function call with (\w+)\((\w+),\s*(\w+)\) gives you groups 1/2/3 for name, arg1, arg2. Wrap the whole thing in an outer group and everything shifts up by one because the new outer paren becomes group 1. Named groups avoid this completely because names don't shift when you add or remove parens.
Putting It Together
Parse an Apache access log line with named captures:
(?<ip>\d+\.\d+\.\d+\.\d+) - - \[(?<date>[^\]]+)\] "(?<method>\w+) (?<path>\S+) \S+" (?<status>\d{3}) (?<size>\d+)
Against:
192.168.1.50 - - [19/Feb/2026:14:30:00 +0000] "GET /api/users HTTP/1.1" 200 1534
This pulls out IP, date, HTTP method, path, status code, and response size, all as named captures you can reference individually. Compare that to six numbered groups where you'd have to count and remember which number is which.
Named captures turn regex from a write-only language into something your future self can actually read.