Pattern Matching and Regular Expressions

🔍 Topic 10 — Pattern Matching and Regular Expressions

Bash uses two related but distinct pattern systems. Glob patterns (also called shell patterns) are used for filename matching and the case statement — they use *, ?, and [...]. Regular expressions (regex) are used inside [[ =~ ]] and tools like grep, sed, and awk — they are far more expressive. This chapter covers both systems thoroughly, explains where each one applies, and shows the key patterns you'll reach for constantly in real scripts.

1 — Glob Patterns (Shell Wildcards)

Globs are expanded by the shell itself before any command sees them. They match filenames in the filesystem — or strings in [[ == ]] and case.

Pattern	Matches	Example
`*`	Any string of zero or more characters (not including `/`)	`*.log` → all `.log` files
`?`	Exactly one character	`file?.txt` → `file1.txt`, `fileA.txt`
`[abc]`	One character from the set	`[abc].sh` → `a.sh`, `b.sh`, `c.sh`
`[a-z]`	One character in the range	`[0-9].txt` → single-digit filenames
`[^abc]`	One character NOT in the set	`[^0-9]*` → files not starting with a digit
`**`	Any path including `/` (requires `globstar` option)	`*/.py` → all `.py` files recursively

🐧 Glob patterns in action

# Filename expansion (pathname expansion)
ls *.sh                         # all shell scripts
ls report_2026-??.csv            # report_2026-01.csv through 09.csv etc.
ls [A-Z]*.txt                   # text files starting with a capital letter

# Recursive glob — must enable globstar first
shopt -s globstar
for f in **/*.py; do
    echo "$f"
done

# Include dotfiles (hidden files) in globs
shopt -s dotglob
ls *                             # now includes .hidden files

# Return glob unexpanded if no match (instead of passing literal string)
# nullglob: expands to nothing if no match  failglob: throws error
shopt -s nullglob
files=( *.csv )
[[ "${#files[@]}" -eq 0 ]] && echo "No CSV files found"

# Globs in case — match strings, not filenames
filename="archive.tar.gz"
case "$filename" in
    *.tar.gz)  echo "gzipped tarball" ;;
    *.zip)     echo "zip archive" ;;
    *.sh)      echo "shell script" ;;
    *)         echo "unknown type" ;;
esac
gzipped tarball

⚠️ Always quote glob results when assigning to variables. file=*.txt stores the literal string *.txt. files=(*.txt) correctly expands into an array. When looping: for f in *.txt is fine unquoted, but once the filename is in a variable, use "$f" everywhere.

2 — Extended Globs

Extended globs add five powerful pattern operators. Enable them with shopt -s extglob. They work in filename expansion, case, and [[ == ]].

Pattern	Meaning	Example
`?(pat)`	Zero or one occurrence of pat	`file?(s).txt` → `file.txt` or `files.txt`
`*(pat)`	Zero or more occurrences of pat	`*(0)1` → `1`, `01`, `001`
`+(pat)`	One or more occurrences of pat	`+([0-9])` → one or more digits
`@(pat)`	Exactly one occurrence of pat (alternation)	`@(jpg\|png\|gif)` → exactly one of those
`!(pat)`	Anything except pat	`!(*.log)` → all files except `.log`

🐧 Extended glob examples

shopt -s extglob

# Match image files with one extension from a list
for img in *.@(jpg|jpeg|png|gif|webp); do
    echo "Image: $img"
done

# List everything EXCEPT backup files
ls !(*.bak|*.tmp)

# Match version strings: v1, v12, v123 (but not v)
ver="v42"
[[ "$ver" == v+([0-9]) ]] && echo "valid version"
valid version

# Strip extension using extended glob in parameter expansion
file="photo.backup.tar.gz"
echo "${file%%+(.+([a-z]))}"      # strip all dot-extensions
photo

# case with extended globs
input="yes"
case "$input" in
    @(y|yes|Y|YES))  echo "Confirmed" ;;
    @(n|no|N|NO))   echo "Declined" ;;
    *)               echo "Unknown response" ;;
esac

3 — Pattern Matching Inside `[[ ]]`

The double-bracket [[ ]] construct supports two kinds of matching: glob patterns with ==, and regular expressions with =~.

🐧 [[ == ]] glob matching vs [[ =~ ]] regex matching

filename="report_2026-06.csv"

# == uses a GLOB pattern (not regex) — right side is unquoted
[[ "$filename" == report_*.csv ]]  && echo "glob match"
glob match

# If you QUOTE the pattern it becomes a literal string comparison
[[ "$filename" == "report_*.csv" ]] && echo "this won't print — literal * char"

# =~ uses EXTENDED REGEX — right side is also unquoted
[[ "$filename" =~ report_[0-9]{4}-[0-9]{2}\.csv ]] && echo "regex match"
regex match

# Practical: validate an IPv4 address
ip="192.168.1.105"
octet='([0-9]{1,3})'
if [[ "$ip" =~ ^${octet}\.${octet}\.${octet}\.${octet}$ ]]; then
    echo "Looks like an IP address"
fi

# Store the regex in a variable for readability (do NOT quote it)
email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
[[ "test@example.com" =~ $email_re ]] && echo "valid email format"
valid email format

Store complex regex patterns in a variable and use $var unquoted on the right side of =~. Quoting the right side of either == or =~ makes it a literal string comparison — the pattern metacharacters are ignored.

4 — Capturing Groups with `BASH_REMATCH`

After a successful =~ match, bash populates the read-only array BASH_REMATCH: index [0] is the whole match, and indices [1], [2]… are the captured groups (parenthesised parts of the pattern).

🐧 Using BASH_REMATCH to extract parts of a string

# Extract year, month, day from a date string
date_str="Today is 2026-06-09 and it's Tuesday."
date_re='([0-9]{4})-([0-9]{2})-([0-9]{2})'

if [[ "$date_str" =~ $date_re ]]; then
    echo "Full match : ${BASH_REMATCH[0]}"   → 2026-06-09
    echo "Year       : ${BASH_REMATCH[1]}"   → 2026
    echo "Month      : ${BASH_REMATCH[2]}"   → 06
    echo "Day        : ${BASH_REMATCH[3]}"   → 09
fi

# Parse a URL into components
url="https://api.example.com:8443/v2/users?page=2"
url_re='^(https?)://([^:/]+)(:([0-9]+))?(/[^?]*)(\?.*)?$'

[[ "$url" =~ $url_re ]] && {
    echo "Scheme : ${BASH_REMATCH[1]}"    → https
    echo "Host   : ${BASH_REMATCH[2]}"    → api.example.com
    echo "Port   : ${BASH_REMATCH[4]}"    → 8443
    echo "Path   : ${BASH_REMATCH[5]}"    → /v2/users
    echo "Query  : ${BASH_REMATCH[6]}"    → ?page=2
}

5 — Regular Expression Fundamentals

Regex is its own mini-language. Here are the building blocks you need to know. Bash's =~ uses Extended Regular Expressions (ERE) — the same dialect as grep -E and awk.

Anchors

^   start of string
$   end of string
\b  word boundary
     (grep/sed only)

# Match whole string
^hello$  → only "hello"

# Start only
^error   → "error" at start

Character classes

[abc]   one of a, b, c
[^abc]  not a, b, or c
[a-z]   lowercase letter
[0-9]   digit
.       any char (not \n)
\d      digit (some tools)

# POSIX classes (portable)
[:alpha:]  letters
[:digit:]  digits
[:space:]  whitespace
[:alnum:]  letters+digits

Quantifiers

?      0 or 1
*      0 or more
+      1 or more
{n}    exactly n
{n,}   n or more
{n,m}  between n and m

# Greedy vs lazy
# Default is greedy (match as much as possible)
.*   greedy
# Lazy not in basic ERE/BRE;
# use Perl-style grep -P for \*?

Groups & alternation

(abc)    group / capture
(?:abc)  non-capture group
             (not in BRE)
a|b      a OR b
(cat|dog) cat OR dog

# Alternation examples
^(ERROR|WARN|INFO)
# matches log level prefixes

Escaping special chars

# In ERE these need escaping
# to be treated as literals:
\.  \*  \+  \?  \(  \)
\[  \]  \{  \}  \^  \$  \|

# Match a literal dot
192\.168\.1\.[0-9]+

# Match a literal parenthesis
\(deprecated\)

Useful shortcuts

# Integer: one or more digits
[0-9]+    or    [[:digit:]]+

# Word: letters/digits/underscore
[a-zA-Z0-9_]+

# Optional sign + integer
-?[0-9]+

# Whitespace (one or more)
[[:space:]]+

# Blank line
^[[:space:]]*$

BRE vs ERE — What Changes?

Two regex dialects in Linux:

BRE (Basic Regular Expressions) — used by default grep and sed. The metacharacters + ? | ( ) { } are literal unless you escape them with \. So + is a literal plus; \+ means "one or more".

ERE (Extended Regular Expressions) — used by grep -E, awk, and bash's =~. The metacharacters + ? | ( ) { } are special by default; you escape with \ to match them literally.

Rule of thumb: always use ERE. Pass -E to grep and -E to sed. Less escaping, more readable.

6 — Regex in `grep`

🐧 grep -E (Extended Regex) patterns

# Match lines starting with a log level
grep -E "^(ERROR|WARN|CRITICAL)" app.log

# Extract email addresses from a file
grep -Eo "[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}" file.txt

# Find lines with a 4-digit year between 1900 and 2099
grep -E "(19|20)[0-9]{2}" dates.txt

# Match lines containing BOTH "error" AND "disk" (chained grep)
grep -i "error" syslog | grep -i "disk"

# Match lines of 10 or more characters
grep -E ".{10,}" file.txt

# Find lines with repeated words ("the the", "is is", etc.)
grep -E "\b([a-z]+) \1\b" essay.txt

# Extract IPv4 addresses
grep -Eo "([0-9]{1,3}\.){3}[0-9]{1,3}" access.log | sort -u

# Lines that do NOT match a pattern
grep -Ev "^(#|$)" config.txt      # exclude comments and blank lines

7 — Regex in `sed`

The s/pattern/replacement/ command in sed uses BRE by default. Use sed -E to switch to ERE (avoiding the need to escape +, (), etc.).

🐧 sed -E (Extended Regex) substitutions

# Normalise whitespace (collapse runs of spaces to one)
sed -E 's/[[:space:]]+/ /g' file.txt

# Remove HTML tags
sed -E 's/<[^>]+>//g' page.html

# Capture groups with \1 \2 back-references
# Reformat dates from DD/MM/YYYY to YYYY-MM-DD
echo "09/06/2026" | sed -E 's|([0-9]{2})/([0-9]{2})/([0-9]{4})|\3-\2-\1|'
2026-06-09

# Wrap numbers in brackets using & (whole match)
echo "Item costs 42 pounds" | sed -E 's/[0-9]+/[&]/g'
Item costs [42] pounds

# Swap first and second fields on a colon-delimited line
echo "alice:admin" | sed -E 's/^([^:]+):([^:]+)/\2:\1/'
admin:alice

# Delete lines matching a pattern
sed -E '/^[[:space:]]*(#|$)/d' config.txt   # strip blanks and comments

# Print only lines between two markers
sed -n '/^START/,/^END/p' file.txt

8 — Regex in `awk`

awk natively uses ERE. Patterns can appear as standalone conditions, inside if, or with the ~ (match) and !~ (no match) operators on specific fields.

🐧 awk regex patterns and operators

# Standalone regex — filter lines matching the pattern
awk '/^ERROR/' app.log

# Negate — lines NOT matching
awk '!/^#/' config.txt

# ~ operator: match a specific field
awk -F: '$1 ~ /^[a-z]/ {print $1}' /etc/passwd   # usernames starting lowercase
awk -F, '$3 !~ /[0-9]/ {print $0}' data.csv     # rows where col 3 has no digit

# Extract + reformat using match() and capture
# gawk (GNU awk) supports capture groups in match()
echo "2026-06-09" | gawk 'match($0, /([0-9]{4})-([0-9]{2})-([0-9]{2})/, a) {
    printf "Day: %s, Month: %s, Year: %s\n", a[3], a[2], a[1]
}'
Day: 09, Month: 06, Year: 2026

# Process a range of lines between two patterns
awk '/BEGIN_SECTION/,/END_SECTION/' file.txt

# Conditional with sub() / gsub() for regex replacement
awk '{gsub(/[[:space:]]+/, "_"); print}' file.txt   # spaces → underscores

9 — Practical Validation Patterns

These are production-ready regex fragments for common validation tasks in Bash scripts.

🐧 Input validation with [[ =~ ]]

#!/bin/bash

# ── Integers ────────────────────────────────────────
is_integer() { [[ "$1" =~ ^-?[0-9]+$ ]]; }

# ── Positive integer (no sign) ──────────────────────
is_positive_int() { [[ "$1" =~ ^[0-9]+$ ]]; }

# ── Decimal number ──────────────────────────────────
is_number() { [[ "$1" =~ ^-?[0-9]+(\.[0-9]+)?$ ]]; }

# ── Email (basic) ───────────────────────────────────
email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
is_email() { [[ "$1" =~ $email_re ]]; }

# ── IPv4 address ────────────────────────────────────
ipv4_re='^([0-9]{1,3}\.){3}[0-9]{1,3}$'
is_ipv4() { [[ "$1" =~ $ipv4_re ]]; }

# ── ISO date YYYY-MM-DD ─────────────────────────────
date_re='^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
is_iso_date() { [[ "$1" =~ $date_re ]]; }

# ── Hostname ────────────────────────────────────────
host_re='^[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+$'
is_hostname() { [[ "$1" =~ $host_re ]]; }

# ── Usage ───────────────────────────────────────────
is_integer  "-42"            && echo "integer ✓"
is_email    "me@example.com" && echo "email ✓"
is_iso_date "2026-06-09"     && echo "date ✓"
! is_ipv4   "999.0.0.1"      && echo "bad IP ✓"

10 — Quick Reference

Glob vs Regex — side by side

Goal	Glob (shell, case, ==)	Regex (=~, grep -E, awk)
Any string	`*`	.*
Any single character	`?`	.
One of these characters	`[abc]`	[abc]
Start of string	N/A (matches whole string)	^
End of string	N/A	$
One or more	`+(pat)` extglob	+
Zero or one	`?(pat)` extglob	?
Alternation	`@(a\|b)` extglob	(a\|b)
Negate	`!(pat)` extglob	[^...] or grep `-v`

Regex metacharacter summary (ERE)

Symbol	Meaning
^	Start of string / line
$	End of string / line
.	Any single character (except newline)
*	Zero or more of preceding
+	One or more of preceding
?	Zero or one of preceding
{n,m}	Between n and m occurrences
[abc]	Character class
[^abc]	Negated character class
(abc)	Capturing group
a\|b	Alternation — a or b
\	Escape next metacharacter
[:alpha:]	POSIX letter class (inside `[...]`)
[:digit:]	POSIX digit class (inside `[...]`)
[:space:]	POSIX whitespace class (inside `[...]`)

Where each pattern system is used

Context	System	Notes
Filename expansion	Glob	Expanded by shell before command runs
`case` patterns	Glob (+ extglob)	Matches whole string, not substring
`[[ str == pat ]]`	Glob (+ extglob)	Right side must be unquoted
`[[ str =~ re ]]`	ERE	Right side unquoted; sets `BASH_REMATCH`
`grep` (default)	BRE	Use `-E` for ERE
`grep -E`	ERE	Recommended for readability
`sed` (default)	BRE	Use `-E` for ERE
`awk`	ERE	Always ERE, no flag needed

✏️ Exercises

Apply what you have learned. Write each script yourself before looking at the sample solution.

Exercise 1

Write a script called validate_input.sh that prompts the user for five pieces of information one at a time — name, age, email address, an IPv4 address, and a date in YYYY-MM-DD format — and validates each with a regex using [[ =~ ]]. Keep re-prompting for each field until valid input is entered. Print a summary once all five fields are collected.

Hint: write one while true loop per field. Use the validation functions from section 9 (is_integer, is_email, is_ipv4, is_iso_date). For the name, require at least 2 alphabetic characters.

Sample Solution

#!/bin/bash
# validate_input.sh

email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
ipv4_re='^([0-9]{1,3}\.){3}[0-9]{1,3}$'
date_re='^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'

prompt_until() {
    local label="$1" re="$2" result
    while true; do
        read -r -p "$label: " result
        [[ "$result" =~ $re ]] && break
        echo "  ✗ Invalid — please try again." >&2
    done
    echo "$result"
}

name=$(prompt_until "Full name (letters only, 2+ chars)" '^[a-zA-Z ]{2,}$')
age=$(prompt_until  "Age (1-3 digit number)"              '^[0-9]{1,3}$')
email=$(prompt_until "Email address"                      "$email_re")
ip=$(prompt_until    "IPv4 address (e.g. 192.168.1.1)"     "$ipv4_re")
dob=$(prompt_until   "Date of birth (YYYY-MM-DD)"          "$date_re")

echo
echo "── Summary ──────────────"
printf "Name  : %s\n"  "$name"
printf "Age   : %s\n"  "$age"
printf "Email : %s\n"  "$email"
printf "IP    : %s\n"  "$ip"
printf "DOB   : %s\n"  "$dob"

Exercise 2

Write a script called parse_log.sh that reads an Apache-style access log file (path as argument) and uses BASH_REMATCH to parse each line. Extract the IP address, HTTP method, URL path, and response code. Print a summary showing: total requests, unique IPs, count of each HTTP method, and count of each response code, sorted numerically.

Hint: Apache combined log format is: IP - - [date] "METHOD /path HTTP/1.1" CODE size .... Build a regex with capture groups for IP (group 1), method (group 2), path (group 3), code (group 4). Use associative arrays to accumulate counts.

Sample Solution

#!/bin/bash
# parse_log.sh — usage: ./parse_log.sh access.log

logfile="${1:?Usage: $0 <access.log>}"
[[ -f "$logfile" ]] || { echo "Not found: $logfile" >&2; exit 1; }

# Apache combined log regex
log_re='^([0-9.]+) [^ ]+ [^ ]+ \[[^]]+\] "([A-Z]+) ([^ ]+) [^"]*" ([0-9]{3})'

declare -A methods codes ips
total=0

while IFS= read -r line; do
    [[ "$line" =~ $log_re ]] || continue
    ip="${BASH_REMATCH[1]}"
    method="${BASH_REMATCH[2]}"
    code="${BASH_REMATCH[4]}"
    (( total++ ))
    ips["$ip"]=1
    (( methods["$method"]++ ))
    (( codes["$code"]++ ))
done < "$logfile"

printf "Total requests : %d\n"  "$total"
printf "Unique IPs     : %d\n\n" "${#ips[@]}"

echo "HTTP Methods:"
for m in $(printf "%s\n" "${!methods[@]}" | sort); do
    printf "  %-8s %d\n" "$m" "${methods[$m]}"
done

echo
echo "Response Codes:"
for c in $(printf "%s\n" "${!codes[@]}" | sort -n); do
    printf "  %s  %d\n" "$c" "${codes[$c]}"
done

Exercise 3

Write a script called rename_dated.sh that renames files in the current directory whose names contain a date in DD-MM-YYYY format to use ISO format (YYYY-MM-DD) instead. For example, report_09-06-2026.csv becomes report_2026-06-09.csv. Dry-run mode (when called with --dry-run) should print what would be renamed without actually doing it.

Hint: use for f in * with [[ "$f" =~ ([0-9]{2})-([0-9]{2})-([0-9]{4}) ]]. Rebuild the new filename using BASH_REMATCH and sed or parameter expansion to swap the date portion. Check for a --dry-run argument with $1.

Sample Solution

#!/bin/bash
# rename_dated.sh — usage: ./rename_dated.sh [--dry-run]

dry_run=0
[[ "$1" == "--dry-run" ]] && dry_run=1

date_re='([0-9]{2})-([0-9]{2})-([0-9]{4})'
count=0

for f in *; do
    [[ -f "$f" ]] || continue
    [[ "$f" =~ $date_re ]] || continue

    dd="${BASH_REMATCH[1]}"
    mm="${BASH_REMATCH[2]}"
    yyyy="${BASH_REMATCH[3]}"
    old_date="${dd}-${mm}-${yyyy}"
    new_date="${yyyy}-${mm}-${dd}"
    new_name="${f//${old_date}/${new_date}}"

    if [[ "$new_name" != "$f" ]]; then
        printf "  %s  →  %s\n" "$f" "$new_name"
        if [[ $dry_run -eq 0 ]]; then
            mv -- "$f" "$new_name"
        fi
        (( count++ ))
    fi
done

if [[ $count -eq 0 ]]; then
    echo "No files with DD-MM-YYYY dates found."
elif [[ $dry_run -eq 1 ]]; then
    printf "\n[dry-run] %d file(s) would be renamed.\n" "$count"
else
    printf "\n%d file(s) renamed.\n" "$count"
fi

Exercise 4

Write a script called extract_urls.sh that accepts a file (HTML or text) and extracts all unique URLs from it using grep -Eo. Print them one per line, sorted, with duplicates removed. As a bonus, categorise them: print http/https URLs first, then mailto: links, then any other protocol.

Hint: use grep -Eo 'https?://[^"<> ]+|mailto:[^"<> ]+' to extract URLs. Pipe through sort -u to deduplicate. Use grep to split into categories, or use a loop with [[ =~ ]] to classify each URL.

Sample Solution

#!/bin/bash
# extract_urls.sh — usage: ./extract_urls.sh page.html

file="${1:?Usage: $0 <file>}"
[[ -f "$file" ]] || { echo "Not found: $file" >&2; exit 1; }

# Extract all URLs into an array (deduplicated)
url_re='https?://[^"<> ]+|mailto:[^"<> ]+'
urls=()
while IFS= read -r url; do
    urls+=( "$url" )
done <(grep -Eo "$url_re" "$file" | sort -u)

if [[ "${#urls[@]}" -eq 0 ]]; then
    echo "No URLs found in $file"
    exit 0
fi

printf "Found %d unique URL(s):\n\n" "${#urls[@]}"

# Categorise
declare -a http_urls mailto_urls other_urls
for url in "${urls[@]}"; do
    case "$url" in
        https://*|http://*) http_urls+=( "$url" ) ;;
        mailto:*)           mailto_urls+=( "$url" ) ;;
        *)                  other_urls+=( "$url" ) ;;
    esac
done

print_section() {
    local label="$1"; shift
    [[ $# -eq 0 ]] && return
    echo "── $label ──"
    printf "  %s\n" "$@"
    echo
}

print_section "HTTP/HTTPS  (${#http_urls[@]})"   "${http_urls[@]}"
print_section "Mailto      (${#mailto_urls[@]})" "${mailto_urls[@]}"
print_section "Other       (${#other_urls[@]})"  "${other_urls[@]}"

Pattern Matching and Regular Expressions

🔍 Topic 10 — Pattern Matching and Regular Expressions

1 — Glob Patterns (Shell Wildcards)

2 — Extended Globs

3 — Pattern Matching Inside [[ ]]

4 — Capturing Groups with BASH_REMATCH

5 — Regular Expression Fundamentals

BRE vs ERE — What Changes?

6 — Regex in grep

7 — Regex in sed

8 — Regex in awk

9 — Practical Validation Patterns

10 — Quick Reference

Glob vs Regex — side by side

Regex metacharacter summary (ERE)

Where each pattern system is used

✏️ Exercises

3 — Pattern Matching Inside `[[ ]]`

4 — Capturing Groups with `BASH_REMATCH`

6 — Regex in `grep`

7 — Regex in `sed`

8 — Regex in `awk`