Techalicious Academy / 2026-02-19-advanced-regex

(Visit our meetup for more great tutorials)

CAPTURE GROUPS - THE FULL PICTURE

In Regex Therapy we learned to put parentheses around parts of our pattern and refer to them as $1, $2, $3 in replacements. That's the beginning. Tonight we see everything capture groups can really do.

Quick Refresher: Numbered Groups

Parentheses create numbered groups, numbered by the position of their opening paren, left to right, starting at 1.

Pattern:  (foo)(bar)(baz)
Group 1:  foo
Group 2:  bar
Group 3:  baz

In a replacement, $1 gives you whatever group 1 captured. You know this part. Let's go further.

Named Capture Groups

When you have a complex pattern with five or six groups, keeping track of which number means what gets painful fast. Named groups fix this.

Two syntaxes (because regex standards are fun like that):

(?<name>pattern)    Perl/PCRE style
(?P<name>pattern)   Python style (also works in PCRE)

Both do the same thing. Instead of this:

(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) (\w+) (.*)

Where you have to remember that $1 is date, $2 is time, $3 is level, $4 is message... you write this:

(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) (?<level>\w+) (?<message>.*)

Now the pattern documents itself. Anyone reading it can tell exactly what each group captures.

Using Named Groups in Practice

Let's parse a real log line:

2026-02-19 14:30:45 ERROR Database connection timeout after 30s

Try it with perl to see named captures in action:

echo '2026-02-19 14:30:45 ERROR Database connection timeout' | \
  perl -ne 'if (/(?<date>\d{4}-\d{2}-\d{2}) (?<time>\d{2}:\d{2}:\d{2}) (?<level>\w+) (?<message>.*)/) {
    print "Date:    $+{date}\n";
    print "Time:    $+{time}\n";
    print "Level:   $+{level}\n";
    print "Message: $+{message}\n";
  }'

Output:

Date:    2026-02-19
Time:    14:30:45
Level:   ERROR
Message: Database connection timeout

In Perl, named captures live in the %+ hash. In Python you'd use match.group('date'). In PHP it's $matches['date']. Every language has its way to access them, but the regex syntax itself is the same.

For replacements, you can use ${name}:

perl -pe 's/(?<first>\w+) (?<last>\w+)/${last}, ${first}/' <<< "John Smith"

Output: Smith, John

Non-Capturing Groups: Quick Refresher

We covered (?:...) last time. Non-capturing groups let you group for alternation or quantifiers without creating a numbered capture.

(?:https?|ftp)://(\S+)

The (?:https?|ftp) groups the protocol alternatives but doesn't capture them. Group $1 is just the URL path. Rule of thumb: if you need parentheses for structure but don't need the captured text, use (?:). Keeps your group numbers clean.

Backreferences Inside the Pattern

You know you can use $1 in a REPLACEMENT string. But you can also reference a previous capture INSIDE THE PATTERN ITSELF using \1.

This doesn't mean "match the same pattern again." It means "match the exact same TEXT that was already captured."

The classic example: finding duplicated words.

\b(\w+)\s+\1\b

\b        word boundary
(\w+)     capture a word into group 1
\s+       one or more whitespace characters
\1        match the SAME TEXT that group 1 captured
\b        word boundary

Test it:

echo "the the quick brown fox fox jumped" | grep -oP '\b(\w+)\s+\1\b'

Output:

the the
fox fox

The \1 matched the SPECIFIC word that (\w+) captured on that pass. With named groups, use \k<name> instead of \1:

\b(?<word>\w+)\s+\k<word>\b

Matching Paired Tags

Another real use of backreferences: matching HTML-style paired tags.

<(\w+)>.*?</\1>

Captures the tag name, then requires the closing tag to match.

echo '<b>bold text</b> and <i>italic</i>' | grep -oP '<(\w+)>.*?</\1>'

Output:

<b>bold text</b>
<i>italic</i>

Properly paired tags only. Mismatched tags like <b>text</i> correctly don't match.

Nested Group Numbering

Here's a gotcha. Nested groups are numbered by where their OPENING parenthesis appears, reading left to right.

((a)(b(c)))

Count the opening parens:

Position 1:  (           -> Group 1 captures: abc
Position 2:    (         -> Group 2 captures: a
Position 3:      (       -> Group 3 captures: bc
Position 4:        (     -> Group 4 captures: c

Verify it:

echo "abc" | perl -ne 'if (/((a)(b(c)))/) {
  print "Group 1: $1\n";
  print "Group 2: $2\n";
  print "Group 3: $3\n";
  print "Group 4: $4\n";
}'

Output:

Group 1: abc
Group 2: a
Group 3: bc
Group 4: c

A practical case: parsing a function call with (\w+)\((\w+),\s*(\w+)\) gives you groups 1/2/3 for name, arg1, arg2. Wrap the whole thing in an outer group and everything shifts up by one because the new outer paren becomes group 1. Named groups avoid this completely because names don't shift when you add or remove parens.

Putting It Together

Parse an Apache access log line with named captures:

(?<ip>\d+\.\d+\.\d+\.\d+) - - \[(?<date>[^\]]+)\] "(?<method>\w+) (?<path>\S+) \S+" (?<status>\d{3}) (?<size>\d+)

Against:

192.168.1.50 - - [19/Feb/2026:14:30:00 +0000] "GET /api/users HTTP/1.1" 200 1534

This pulls out IP, date, HTTP method, path, status code, and response size, all as named captures you can reference individually. Compare that to six numbered groups where you'd have to count and remember which number is which.

Named captures turn regex from a write-only language into something your future self can actually read.