LOOKAHEAD AND LOOKBEHIND

This is where regex gets surgical.

Everything we've matched so far has been about consuming characters. The engine reads forward, eats characters, and says "matched" or "didn't." Lookaround assertions work differently. They check what's around a position without consuming anything. They peek, then step back.

Think of it like standing in a doorway. You lean forward and look into the next room. You see what's there. But you haven't actually moved. Your feet are still in the same spot. That's a lookahead.

The Four Types

There are exactly four lookaround assertions. Two look forward, two look backward. Two check for presence, two check for absence.

(?=...)   Positive lookahead    what follows IS this
(?!...)   Negative lookahead    what follows is NOT this
(?<=...)  Positive lookbehind   what came before IS this
(?<!...)  Negative lookbehind   what came before is NOT this

The syntax is ugly. No way around it. But once you use them a few times, they become second nature.

Here's the key concept in a picture:

+-------------------------------------------------------+
|  Normal match:   the engine MOVES through the text    |
|                                                       |
|    /foo/  on "foobar"                                 |
|    ^^^                                                |
|    engine advances past "foo", match consumes it      |
|                                                       |
|  Lookaround:     the engine PEEKS but stays put       |
|                                                       |
|    /(?=foo)/  on "foobar"                             |
|    ^                                                  |
|    engine confirms "foo" is ahead, but matches        |
|    a zero-width position. Nothing consumed.           |
+-------------------------------------------------------+

That "zero-width" part is everything. The match has no length. It matches a position between characters, not characters themselves.

Positive Lookahead: (?=...)

"Match this, but only if THAT follows."

The classic example. You want to match a number, but only when it appears before the word "dollars":

Pattern:  \d+(?= dollars)
Text:     He paid 100 dollars for 50 cats
Matches:  100

The 50 doesn't match because "cats" follows it, not "dollars." And here's the critical detail: the match is just "100". The word "dollars" is not part of the match. The lookahead confirmed it was there but didn't consume it.

Another practical one. Find function names in code by matching words that come right before an opening parenthesis:

Pattern:  \w+(?=\()
Text:     result = calculate(x) + transform(y)
Matches:  calculate, transform

The parentheses aren't included in the match. You get just the function names. Clean.

One more. Find words followed by a comma:

Pattern:  \w+(?=,)
Text:     apples, bananas, cherries
Matches:  apples, bananas

Not cherries, because there's no comma after it.

Negative Lookahead: (?!...)

"Match this, but only if THAT does NOT follow."

Flip it around. Match numbers that are NOT followed by "dollars":

Pattern:  \d+(?! dollars)
Text:     He paid 100 dollars for 50 cats
Matches:  10, 50

Wait, why "10" and not "100"? This is a subtlety that trips people up. The engine tries at position 0 of "100". It sees "100 dollars" and the lookahead fails. So it tries at position 1. Now it sees "00" and what follows is "0 dollars" which doesn't literally match " dollars" either. Actually, let's be more precise. The engine finds that \d+ can match "10" at the start, and what follows "10" is "0 dollars", which doesn't begin with " dollars". Match.

To avoid this gotcha, anchor your pattern with a word boundary:

Pattern:  \b\d+(?! dollars)
Text:     He paid 100 dollars for 50 cats
Matches:  50

Better. The \b forces the match to start at a word boundary, so we get whole numbers only.

Another useful one. Match "foo" when it's NOT followed by "bar":

Pattern:  foo(?!bar)
Text:     fooXYZ foobaz foobar food
Matches:  foo (in fooXYZ), foo (in foobaz), foo (in food)

The foobar is skipped. The rest all match.

Positive Lookbehind: (?<=...)

"Match this, but only if THAT came before it."

Now we look backward. Match digits, but only when they come right after a dollar sign:

Pattern:  (?<=\$)\d+
Text:     Price is $500 and 200 units left
Matches:  500

The 200 doesn't match because there's no $ before it. And the $ is not part of the match. You get just the digits.

Extract the domain part from email-like strings:

Pattern:  (?<=@)\w+
Text:     user@example, admin@server, test@host
Matches:  example, server, host

Find values that follow an equals sign:

Pattern:  (?<==)\w+
Text:     name=Mike age=42 city=Portland
Matches:  Mike, 42, Portland

The equals sign isn't captured. Just the values.

Negative Lookbehind: (?<!...)

"Match this, but only if THAT did NOT come before it."

Match digits that are NOT preceded by a dollar sign:

Pattern:  (?<!\$)\d+
Text:     Price is $500 and 200 units left
Matches:  00, 200

Again, the "00" gotcha. The engine finds digits at the "00" position within "500" where the preceding character is "5", not "$". Use word boundaries to get clean results:

Pattern:  (?<!\$)\b\d+
Text:     Price is $500 and 200 units left
Matches:  200

The classic example everyone remembers. Match "happy" but not "unhappy":

Pattern:  (?<!un)happy
Text:     I'm happy but she's unhappy
Matches:  happy (the first one only)

Combining Lookarounds

Here's where it gets genuinely powerful. You can stack multiple lookaheads at the same position. They all check from the same spot.

The textbook example is password validation. Say you need a password that has all of these:

At least 8 characters
At least one uppercase letter
At least one digit
At least one special character from !@#$%

Each requirement becomes a lookahead anchored at the start:

(?=.*[A-Z])         somewhere there's an uppercase letter
(?=.*\d)            somewhere there's a digit
(?=.*[!@#$%])       somewhere there's a special char
.{8,}               and the whole thing is 8+ chars

Combined:

^(?=.*[A-Z])(?=.*\d)(?=.*[!@#$%]).{8,}$

Let's test it:

"password"       Fails (no uppercase, no digit, no special)
"Password1"      Fails (no special character)
"Password1!"     Matches (has uppercase P, digit 1, special !)
"P1!"            Fails (only 3 characters, need 8)

Each lookahead independently checks its condition without advancing the position. They all fire from the ^ anchor. Then .{8,} actually consumes the string and enforces the length.

This pattern shows up constantly in form validation. Now you know exactly how it works.

Lookbehind Limitations

One thing to know. In many regex engines, lookbehinds must be fixed-width. That means you can't use quantifiers like * or + inside a lookbehind.

(?<=\d+)foo     ILLEGAL in most engines
(?<=\d{3})foo   LEGAL (fixed width of 3)

Perl itself handles variable-width lookbehind fine. So does Python's regex module (not re, but regex). Most other languages and tools require fixed width. Keep this in mind when you're writing portable patterns.

Lookaheads have no such restriction. You can put whatever you want in a lookahead.

Real-World Patterns

Extract prices with currency symbols from mixed text:

Pattern:  (?<=\$)\d+(?:\.\d{2})?
Text:     Items cost $19.99 and $5.00 plus 3.50 shipping
Matches:  19.99, 5.00

The 3.50 is skipped because there's no $ before it.

Find words between quotes without capturing the quotes:

Pattern:  (?<=")[^"]+(?=")
Text:     She said "hello" and then "goodbye"
Matches:  hello, goodbye

Match a comma-separated value that's NOT the last field:

Pattern:  [^,]+(?=,)
Text:     alice,bob,carol
Matches:  alice, bob

Carol is skipped because no comma follows.

Lookarounds are one of those features where once you get them, you wonder how you ever wrote regex without them. They show up in every serious pattern you'll write from here on out.