POSIX CLASSES AND SMART ALTERNATION

Two topics that look simple on the surface but have real depth once you start using them seriously.

POSIX Character Classes

You already know \d for digits, \w for word characters, \s for whitespace. POSIX character classes are an older, more portable set of predefined character groups. They come from the POSIX standard and they have one killer advantage: locale awareness.

Here are the classes:

[:alpha:]    Letters (a-z, A-Z, and locale-dependent like accented chars)
[:digit:]    Digits 0-9
[:alnum:]    Letters and digits combined
[:upper:]    Uppercase letters
[:lower:]    Lowercase letters
[:space:]    All whitespace (space, tab, newline, carriage return, etc.)
[:blank:]    Just horizontal whitespace (space and tab only)
[:punct:]    Punctuation characters
[:print:]    Printable characters (including space)
[:graph:]    Printable characters (excluding space)
[:cntrl:]    Control characters (null, bell, backspace, etc.)
[:xdigit:]   Hexadecimal digits (0-9, a-f, A-F)

Now here's the thing that trips everyone up the first time.

The Double Bracket Rule

POSIX classes MUST go inside a character class bracket. The syntax looks like double brackets:

WRONG:   [:alpha:]
RIGHT:   [[:alpha:]]

The outer brackets are the character class. The inner [:...:] is the POSIX class name. This is not optional. If you write [:alpha:] without the outer brackets, you'll get bizarre results because the engine interprets it as a character class containing the characters : a l p h (which is absolutely not what you wanted).

+-----------------------------------------------------------+
|  ALWAYS double-bracket POSIX classes.                     |
|                                                           |
|    [[:alpha:]]    Correct. Matches any letter.            |
|    [:alpha:]      WRONG. Matches ':', 'a', 'l', 'p', 'h' |
+-----------------------------------------------------------+

When POSIX Beats Shorthand

So why bother when we have \d and \w? Because \w is hardcoded.

In every regex engine, \w is exactly [a-zA-Z0-9_]. Always. It knows English letters, digits, and underscore. That's it. It has no idea that "e" with an accent is still a letter.

But [[:alpha:]] is locale-aware. On a system set to French or German or Spanish, it recognizes accented characters as letters. If you're processing international text, this matters a lot.

Pattern:  \w+
Text:     cafe
Matches:  cafe

Pattern:  \w+
Text:     cafe with accent (imagine an accent on the e)
Matches:  caf   (stops at the accented character!)

Pattern:  [[:alpha:]]+
Text:     cafe with accent
Matches:  the whole word, accent included

For English-only text on modern systems, \d and \w are fine. For anything international, lean on POSIX.

Combining POSIX Classes

You can mix POSIX classes with other characters inside brackets:

[[:alpha:]_]         Letters plus underscore
[[:digit:].-]        Digits, dots, and hyphens
[[:upper:][:digit:]] Uppercase letters and digits

You can negate them too:

[^[:digit:]]         Anything that's not a digit
[^[:space:]]         Anything that's not whitespace

And combine positive and negative:

[[:alpha:]^[:digit:]]  Actually, don't. Negation goes at the front.
[^[:digit:][:space:]]  Non-digit and non-whitespace

Practical POSIX Examples

Validate a username that allows international characters:

^[[:alpha:]][[:alnum:]_]{2,19}$

That says: start with any letter (including accented), followed by 2 to 19 letters, digits, or underscores. Total length 3 to 20.

Match hex color codes:

#[[:xdigit:]]{6}\b

Matches #FF00AA, #1a2b3c, and friends.

Find lines that start with a printable character (no control chars):

^[[:print:]]

Match only visible characters (no spaces, no control):

[[:graph:]]+

ALTERNATION DONE RIGHT

You learned alternation in the basics class. cat|dog. Simple OR logic. Now let's talk about the gotchas and the power moves.

Alternation Scope

The pipe character has the lowest precedence of any regex operator. That means it splits the ENTIRE pattern unless you constrain it with grouping.

Pattern:  cat|dog food
Means:    "cat" OR "dog food"
NOT:      "cat food" OR "dog food"

If you want both options to share a suffix:

Pattern:  (cat|dog) food
Means:    "cat food" OR "dog food"

This is the most common alternation mistake. The pipe is greedier in scope than people expect. When in doubt, add parentheses.

Order Matters

The regex engine tries alternation options left to right and takes the first match. This has real consequences.

Pattern:  Jan|January
Text:     January 15th
Matches:  Jan

Wait, what? You might expect "January" to match since it's right there in the text. But the engine tries "Jan" first. It matches. Done. It never even looks at the second alternative.

The fix is simple. Put longer options first:

Pattern:  January|Jan
Text:     January 15th
Matches:  January

+-----------------------------------------------------------+
|  Rule: In alternation, put longer options before shorter   |
|  ones. The engine takes the first match it finds.          |
+-----------------------------------------------------------+

This applies to any set of alternatives where one is a prefix of another:

WRONG:   http|https       (always matches "http", never "https")
RIGHT:   https|http       (tries "https" first)

WRONG:   do|done|doing    (always matches "do")
RIGHT:   doing|done|do    (longest first)

Non-Capturing Groups for Scope

When you use alternation with a suffix or prefix, you need grouping. But you might not need to capture. That's where (?:...) comes in:

Pattern:  (?:red|green|blue) pill
Matches:  "red pill", "green pill", "blue pill"

No capture group created. Just scoping. This is a best practice when you don't need the matched alternative for backreferences.

Character Classes vs Alternation

For single characters, use character classes. Always.

SLOW:    (a|e|i|o|u)
FAST:    [aeiou]

They do the same thing, but character classes are optimized by the regex engine. A character class is a single operation: "is this character in this set?" Alternation is a backtracking search through multiple options.

For multi-character patterns, you need alternation:

[Monday]         Matches M, o, n, d, a, y (single chars!)
(?:Mon|Tue|Wed)  Matches the actual words

Character classes are always single characters. Never put whole words in brackets thinking they'll work like alternation.

Nested Alternation

You can nest alternation for structured matching:

(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day

That matches any day of the week. The alternation handles the variable prefix, and "day" is the shared suffix.

More complex nesting:

(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?)

Matches both abbreviated and full month names for Jan through April. Each month has an optional suffix handled by a nested group with ?.

Alternation with Other Features

Combine alternation with lookaround for precise matching:

(?<=\b)(?:Dr|Mr|Mrs|Ms)\.(?=\s)

Matches honorifics like "Dr." or "Mrs." but only when they're whole words with a space after the period.

Combine with backreferences:

(?:(\w+)\s+\1)

Matches duplicated words like "the the" or "is is". The alternation isn't doing much here, but the grouping is. In more complex patterns, you might alternate between different duplication patterns.

Real-World Example

Parse log levels from various log formats:

\b(?:DEBUG|INFO|WARN(?:ING)?|ERROR|FATAL|CRIT(?:ICAL)?)\b

This handles the common variations: WARN and WARNING, CRIT and CRITICAL. The longer forms come first within each group (they don't need to here because the ? makes it work either way, but it's good habit). Word boundaries on both sides ensure you don't match these words inside other words.

Test it:

grep -P '\b(?:DEBUG|INFO|WARN(?:ING)?|ERROR|FATAL|CRIT(?:ICAL)?)\b' server.log