REGEX MEETS AI

You might think regex is old-school now that AI can do everything. Actually, regex and AI are partners, not competitors. They solve different parts of the same problems, and the people who understand both have a massive advantage over people who only know one.

Let's talk about how they work together in practice.

AI Can Write Regex For You

This is the obvious one. You can ask an LLM to generate patterns.

Prompt: "Write a regex that matches IPv4 addresses"

LLM says: \b(?:\d{1,3}\.){3}\d{1,3}\b

And that works. It matches things like 192.168.1.1 and 10.0.0.255. But here's the question: does it ONLY match valid IP addresses? What about 999.999.999.999? That matches the pattern but it's not a real IP address.

The LLM gave you a starting point. Whether it's correct enough for your use case depends on understanding what the pattern does. Let's break it down using what we learned tonight:

\b                   word boundary
(?:\d{1,3}\.){3}     non-capturing group: 1-3 digits then a dot,
                      repeated exactly 3 times
\d{1,3}              1-3 more digits (the last octet)
\b                   word boundary

It matches the shape of an IP but doesn't validate the values. For strict validation, you'd need the octet-checking pattern we'll build in the exercises.

+-----------------------------------------------------------+
|  The AI generates. You validate.                          |
|  Can't validate what you don't understand.                |
+-----------------------------------------------------------+

This is exactly why tonight matters. AI makes regex easier to produce, but you still need to read it, verify it, and know when it's wrong. The people who skip learning regex and just paste what ChatGPT gives them are the ones who ship broken validation into production.

Another example. Ask an LLM to write a regex for email validation:

Prompt: "Write a regex to validate email addresses"

You'll get something like:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Is that right? Reading it with what you know now:

^                       start of string
[a-zA-Z0-9._%+-]+       one or more valid username chars
@                        literal @
[a-zA-Z0-9.-]+          one or more domain chars
\.                       literal dot
[a-zA-Z]{2,}            two or more letters (TLD)
$                        end of string

Decent for most cases. But it won't match emails with plus addressing correctly, and the TLD part is too simple for newer TLDs. You can evaluate this because you can READ it.

Parsing AI Output With Regex

This is where regex becomes essential in AI workflows. LLMs return freeform text. Sometimes you need to extract structured data from that text. Regex is the tool for the job.

Extract JSON from an LLM response that has conversation around it:

LLM output:  "Here is the data you requested: {"name": "Mike", "age": 42}
              I hope that helps!"

Pattern:  \{[^{}]*\}
Matches:  {"name": "Mike", "age": 42}

For nested JSON, you'd need a more careful approach (or a proper JSON parser), but for simple single-level objects this works perfectly.

Extract code blocks from markdown-formatted AI output:

Pattern:  ```(\w*)\n([\s\S]*?)```

That captures:
  Group 1: the language name (python, bash, etc.)
  Group 2: the code content

LLM output:  "Here's a Python example:
              ```python
              print('hello')
              ```"

Match group 1: python
Match group 2: print('hello')

Parse confidence scores from AI analysis text:

Pattern:  [Cc]onfidence:?\s*(\d+(?:\.\d+)?)\s*%?

LLM output: "The image shows a cat. Confidence: 94.5%"
Group 1:    94.5

Pull structured lists from AI responses:

Pattern:  ^\d+\.\s+(.+)$  (with /m flag)

LLM output: "Here are the results:
             1. First item found
             2. Second item found
             3. Third item found"

Group 1 matches: "First item found", "Second item found",
                 "Third item found"

Every time you build a pipeline that calls an API and processes the response, these patterns come into play.

Cleaning Data For AI

Garbage in, garbage out. Before you feed text to an AI model, you often need to clean it up. Regex is the scalpel for this.

Strip HTML tags before sending text to an LLM:

Pattern:  <[^>]+>
Replace:  (empty string)

Input:   "<p>Hello <b>world</b></p>"
Output:  "Hello world"

Normalize whitespace so the model doesn't waste tokens on it:

Pattern:  \s+
Replace:  (single space)

Input:   "too    many     spaces   here"
Output:  "too many spaces here"

Remove URLs from text to focus the AI on content:

Pattern:  https?://\S+
Replace:  [URL]

Input:   "Check out https://example.com/page?id=5 for details"
Output:  "Check out [URL] for details"

Clean up control characters and garbage encoding:

Pattern:  [^\x20-\x7E\n\t]
Replace:  (empty string)

That keeps only printable ASCII, newlines, and tabs. Everything else gets stripped.

Extract just the text portions from a CSV before analysis:

Pattern:  (?:^|,)"?([^",]*)"?
Input:    name,"Mike Smith",42,"Portland, OR"

Each of these is a one-liner in Perl:

perl -pe 's/<[^>]+>//g' input.html > clean.txt
perl -pe 's/\s+/ /g' messy.txt > normalized.txt
perl -pe 's/https?:\/\/\S+/[URL]/g' text.txt > cleaned.txt

Validating AI Output

AI models hallucinate. They generate confident-sounding nonsense. Regex can serve as a sanity check on what comes back.

Check that an AI-generated email looks valid:

if ($email =~ /^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$/) {
    # Probably a real email format
} else {
    # AI hallucinated garbage, discard it
}

Verify that AI-extracted dates are in a valid format:

Pattern:  ^\d{4}-(?:0[1-9]|1[0-2])-(?:0[1-9]|[12]\d|3[01])$

That checks for YYYY-MM-DD with valid month and day ranges. Not just any four digits and any two digits, but actually valid months 01 through 12 and valid days 01 through 31.

Ensure phone numbers extracted by AI look right:

Pattern:  ^(?:\+1\s?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$

Validate that a URL extracted by AI is well-formed:

Pattern:  ^https?://[a-zA-Z0-9][\w.-]*\.[a-zA-Z]{2,}(?:/\S*)?$

None of these are foolproof. But they catch the obvious hallucinations, which is most of them. If the AI says someone's email is "yes_the_email_is_real@" or their phone number is "approximately 5", regex catches that instantly.

Building AI Pipelines

The real power is in the full pipeline. Raw data comes in. Regex cleans it. AI processes it. Regex extracts the results.

+-----------------------------------------------------------+
|                                                           |
|  Raw Data                                                 |
|     |                                                     |
|     v                                                     |
|  [ Regex Cleanup ]  strip HTML, normalize whitespace,     |
|     |                remove noise                         |
|     v                                                     |
|  [ AI Processing ]  classify, summarize, extract          |
|     |                                                     |
|     v                                                     |
|  [ Regex Extract ]  pull structured data from AI output   |
|     |                                                     |
|     v                                                     |
|  [ Regex Validate ] sanity-check the extracted data       |
|     |                                                     |
|     v                                                     |
|  Clean Results                                            |
|                                                           |
+-----------------------------------------------------------+

Here's a concrete example. You have thousands of server log lines and you want AI to classify them by severity and suggest fixes.

Step 1: Regex extracts the relevant log lines:

grep -P '\b(?:ERROR|WARN|FATAL)\b' server.log > problems.txt

Step 2: Regex cleans and formats them for the AI:

perl -pe 's/^\d{4}-\d{2}-\d{2}T[\d:]+\s+//' problems.txt > cleaned.txt

Step 3: Feed cleaned text to an LLM for classification.

Step 4: Regex extracts the AI's classifications:

perl -ne 'print "$1: $2\n" if /(\w+)\s*[:-]\s*(.+)/' ai_output.txt

Step 5: Regex validates the output format:

perl -ne 'print if /^(?:HIGH|MEDIUM|LOW):\s+.{10,}$/' results.txt

Each step is simple. The pipeline is powerful. And regex is the glue holding the whole thing together.

The Bottom Line

Regex and AI aren't competing technologies. Regex is precise, deterministic, and fast. AI is flexible, probabilistic, and smart. Use regex where you need exact pattern matching. Use AI where you need understanding and generation. Use both together where you need a pipeline that's both smart and reliable.

The people building serious AI tools right now all know regex. It's not a coincidence.