REGEX FUNDAMENTALS - THE QUICK VERSION
This is the foundation everything else tonight builds on. If you were at Regex Therapy, this is a fast refresher. If this is your first time with pattern matching, this gets you up to speed.
Atoms and Left-to-Right Matching
A regex pattern is made of "atoms." An atom is the smallest unit the engine can match. A literal character is an atom. A character class is an atom. A group is an atom. The engine takes your pattern and breaks it into these atoms, then works through them left to right.
Here's what actually happens when the engine matches the pattern "cat" against the text "the cat sat":
Text: t h e c a t s a t
Position: 0 1 2 3 4 5 6 7 8 9
Position 0: Try 'c' against 't' → no match, slide forward
Position 1: Try 'c' against 'h' → no match, slide forward
Position 2: Try 'c' against 'e' → no match, slide forward
Position 3: Try 'c' against ' ' → no match, slide forward
Position 4: Try 'c' against 'c' → match! Try 'a' against position 5
Position 5: Try 'a' against 'a' → match! Try 't' against position 6
Position 6: Try 't' against 't' → match! All atoms matched!
Result: "cat" found at position 4
The engine is persistent. It tries every starting position in the text until it finds a match or runs out of text. This is the fundamental mechanism behind everything we do tonight. Every advanced feature we cover is built on top of this left-to-right, atom-by-atom process.
Character Classes
Square brackets define a set of allowed characters. The engine matches exactly one character from the set.
[aeiou] One vowel
[abc] The letter a, b, or c
Ranges use a dash inside the brackets:
[a-z] Any lowercase letter
[A-Z] Any uppercase letter
[0-9] Any digit
[a-zA-Z] Any letter, upper or lower
[a-zA-Z0-9] Any letter or digit
You can mix ranges and individual characters:
[a-z0-9_] Lowercase letter, digit, or underscore
Negate with a caret at the start:
[^0-9] Anything that is NOT a digit
[^aeiou] Anything that is NOT a vowel
Shorthand classes save typing:
\d Digit same as [0-9]
\D Not a digit same as [^0-9]
\w Word character same as [a-zA-Z0-9_]
\W Not word char same as [^a-zA-Z0-9_]
\s Whitespace space, tab, newline
\S Not whitespace
Try it:
echo "Hello World 123" | grep -Po '[A-Z][a-z]+'
That matches a capital letter followed by one or more lowercase letters. Result: "Hello" and "World".
Meta Characters
These characters have special meaning in regex. They're the grammar of the language. When the engine sees them, it doesn't match them literally. It interprets them.
. Any single character (except newline by default)
^ Start of line or string
$ End of line or string
* Zero or more of the previous atom
+ One or more of the previous atom
? Zero or one of the previous atom (makes it optional)
| Alternation (OR)
( ) Grouping and capturing
[ ] Character class
{ } Quantifier bounds like {3} or {2,5}
\ Escape character
When you need to match a literal meta character, escape it with a backslash:
\. A literal dot
\$ A literal dollar sign
\^ A literal caret
\( A literal opening parenthesis
\\ A literal backslash
Example: matching a filename with a real dot:
photo\.jpg matches "photo.jpg" (not "photoxjpg")
price: \$\d+ matches "price: $50"
Alternation
The pipe character | means "or." The engine tries the left side first. If that doesn't match, it tries the right side.
cat|dog matches "cat" or "dog"
red|green|blue matches any of the three
gray|grey matches both spellings
Scope matters. The pipe applies to everything on each side unless you use parentheses to contain it:
cat|dog food matches "cat" OR "dog food" (the whole string)
(cat|dog) food matches "cat food" OR "dog food"
A cleaner way to handle spelling variations:
gr(a|e)y matches "gray" or "grey"
colo(u|)r matches "color" or "colour"
You can build entire day-of-week matchers:
(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day
+-------------------------------------------------------+
| If all of this is comfortable, you're ready for |
| tonight. If any of it feels fuzzy, the Regex |
| Therapy tutorial on techalicious.academy covers |
| every piece in detail. Grab it for five bucks. |
+-------------------------------------------------------+