Pattern Matching and Regular Expressions
🔍 Topic 10 — Pattern Matching and Regular Expressions
Bash uses two related but distinct pattern systems. Glob patterns (also called shell patterns) are used for filename matching and the case statement — they use *, ?, and [...]. Regular expressions (regex) are used inside [[ =~ ]] and tools like grep, sed, and awk — they are far more expressive. This chapter covers both systems thoroughly, explains where each one applies, and shows the key patterns you'll reach for constantly in real scripts.
1 — Glob Patterns (Shell Wildcards)
Globs are expanded by the shell itself before any command sees them. They match filenames in the filesystem — or strings in [[ == ]] and case.
| Pattern | Matches | Example |
|---|---|---|
* | Any string of zero or more characters (not including /) | *.log → all .log files |
? | Exactly one character | file?.txt → file1.txt, fileA.txt |
[abc] | One character from the set | [abc].sh → a.sh, b.sh, c.sh |
[a-z] | One character in the range | [0-9].txt → single-digit filenames |
[^abc] | One character NOT in the set | [^0-9]* → files not starting with a digit |
** | Any path including / (requires globstar option) | **/*.py → all .py files recursively |
# Filename expansion (pathname expansion)
ls *.sh # all shell scripts
ls report_2026-??.csv # report_2026-01.csv through 09.csv etc.
ls [A-Z]*.txt # text files starting with a capital letter
# Recursive glob — must enable globstar first
shopt -s globstar
for f in **/*.py; do
echo "$f"
done
# Include dotfiles (hidden files) in globs
shopt -s dotglob
ls * # now includes .hidden files
# Return glob unexpanded if no match (instead of passing literal string)
# nullglob: expands to nothing if no match failglob: throws error
shopt -s nullglob
files=( *.csv )
[[ "${#files[@]}" -eq 0 ]] && echo "No CSV files found"
# Globs in case — match strings, not filenames
filename="archive.tar.gz"
case "$filename" in
*.tar.gz) echo "gzipped tarball" ;;
*.zip) echo "zip archive" ;;
*.sh) echo "shell script" ;;
*) echo "unknown type" ;;
esac
gzipped tarball
file=*.txt stores the literal string *.txt. files=(*.txt) correctly expands into an array. When looping: for f in *.txt is fine unquoted, but once the filename is in a variable, use "$f" everywhere.
2 — Extended Globs
Extended globs add five powerful pattern operators. Enable them with shopt -s extglob. They work in filename expansion, case, and [[ == ]].
| Pattern | Meaning | Example |
|---|---|---|
?(pat) | Zero or one occurrence of pat | file?(s).txt → file.txt or files.txt |
*(pat) | Zero or more occurrences of pat | *(0)1 → 1, 01, 001 |
+(pat) | One or more occurrences of pat | +([0-9]) → one or more digits |
@(pat) | Exactly one occurrence of pat (alternation) | @(jpg|png|gif) → exactly one of those |
!(pat) | Anything except pat | !(*.log) → all files except .log |
shopt -s extglob
# Match image files with one extension from a list
for img in *.@(jpg|jpeg|png|gif|webp); do
echo "Image: $img"
done
# List everything EXCEPT backup files
ls !(*.bak|*.tmp)
# Match version strings: v1, v12, v123 (but not v)
ver="v42"
[[ "$ver" == v+([0-9]) ]] && echo "valid version"
valid version
# Strip extension using extended glob in parameter expansion
file="photo.backup.tar.gz"
echo "${file%%+(.+([a-z]))}" # strip all dot-extensions
photo
# case with extended globs
input="yes"
case "$input" in
@(y|yes|Y|YES)) echo "Confirmed" ;;
@(n|no|N|NO)) echo "Declined" ;;
*) echo "Unknown response" ;;
esac
3 — Pattern Matching Inside [[ ]]
The double-bracket [[ ]] construct supports two kinds of matching: glob patterns with ==, and regular expressions with =~.
filename="report_2026-06.csv"
# == uses a GLOB pattern (not regex) — right side is unquoted
[[ "$filename" == report_*.csv ]] && echo "glob match"
glob match
# If you QUOTE the pattern it becomes a literal string comparison
[[ "$filename" == "report_*.csv" ]] && echo "this won't print — literal * char"
# =~ uses EXTENDED REGEX — right side is also unquoted
[[ "$filename" =~ report_[0-9]{4}-[0-9]{2}\.csv ]] && echo "regex match"
regex match
# Practical: validate an IPv4 address
ip="192.168.1.105"
octet='([0-9]{1,3})'
if [[ "$ip" =~ ^${octet}\.${octet}\.${octet}\.${octet}$ ]]; then
echo "Looks like an IP address"
fi
# Store the regex in a variable for readability (do NOT quote it)
email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
[[ "test@example.com" =~ $email_re ]] && echo "valid email format"
valid email format
$var unquoted on the right side of =~. Quoting the right side of either == or =~ makes it a literal string comparison — the pattern metacharacters are ignored.4 — Capturing Groups with BASH_REMATCH
After a successful =~ match, bash populates the read-only array BASH_REMATCH: index [0] is the whole match, and indices [1], [2]… are the captured groups (parenthesised parts of the pattern).
# Extract year, month, day from a date string
date_str="Today is 2026-06-09 and it's Tuesday."
date_re='([0-9]{4})-([0-9]{2})-([0-9]{2})'
if [[ "$date_str" =~ $date_re ]]; then
echo "Full match : ${BASH_REMATCH[0]}" → 2026-06-09
echo "Year : ${BASH_REMATCH[1]}" → 2026
echo "Month : ${BASH_REMATCH[2]}" → 06
echo "Day : ${BASH_REMATCH[3]}" → 09
fi
# Parse a URL into components
url="https://api.example.com:8443/v2/users?page=2"
url_re='^(https?)://([^:/]+)(:([0-9]+))?(/[^?]*)(\?.*)?$'
[[ "$url" =~ $url_re ]] && {
echo "Scheme : ${BASH_REMATCH[1]}" → https
echo "Host : ${BASH_REMATCH[2]}" → api.example.com
echo "Port : ${BASH_REMATCH[4]}" → 8443
echo "Path : ${BASH_REMATCH[5]}" → /v2/users
echo "Query : ${BASH_REMATCH[6]}" → ?page=2
}
5 — Regular Expression Fundamentals
Regex is its own mini-language. Here are the building blocks you need to know. Bash's =~ uses Extended Regular Expressions (ERE) — the same dialect as grep -E and awk.
^ start of string
$ end of string
\b word boundary
(grep/sed only)
# Match whole string
^hello$ → only "hello"
# Start only
^error → "error" at start
[abc] one of a, b, c
[^abc] not a, b, or c
[a-z] lowercase letter
[0-9] digit
. any char (not \n)
\d digit (some tools)
# POSIX classes (portable)
[:alpha:] letters
[:digit:] digits
[:space:] whitespace
[:alnum:] letters+digits
? 0 or 1
* 0 or more
+ 1 or more
{n} exactly n
{n,} n or more
{n,m} between n and m
# Greedy vs lazy
# Default is greedy (match as much as possible)
.* greedy
# Lazy not in basic ERE/BRE;
# use Perl-style grep -P for \*?
(abc) group / capture
(?:abc) non-capture group
(not in BRE)
a|b a OR b
(cat|dog) cat OR dog
# Alternation examples
^(ERROR|WARN|INFO)
# matches log level prefixes
# In ERE these need escaping
# to be treated as literals:
\. \* \+ \? \( \)
\[ \] \{ \} \^ \$ \|
# Match a literal dot
192\.168\.1\.[0-9]+
# Match a literal parenthesis
\(deprecated\)
# Integer: one or more digits
[0-9]+ or [[:digit:]]+
# Word: letters/digits/underscore
[a-zA-Z0-9_]+
# Optional sign + integer
-?[0-9]+
# Whitespace (one or more)
[[:space:]]+
# Blank line
^[[:space:]]*$
BRE vs ERE — What Changes?
BRE (Basic Regular Expressions) — used by default
grep and sed. The metacharacters + ? | ( ) { } are literal unless you escape them with \. So + is a literal plus; \+ means "one or more".ERE (Extended Regular Expressions) — used by
grep -E, awk, and bash's =~. The metacharacters + ? | ( ) { } are special by default; you escape with \ to match them literally.Rule of thumb: always use ERE. Pass
-E to grep and -E to sed. Less escaping, more readable.
6 — Regex in grep
# Match lines starting with a log level
grep -E "^(ERROR|WARN|CRITICAL)" app.log
# Extract email addresses from a file
grep -Eo "[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}" file.txt
# Find lines with a 4-digit year between 1900 and 2099
grep -E "(19|20)[0-9]{2}" dates.txt
# Match lines containing BOTH "error" AND "disk" (chained grep)
grep -i "error" syslog | grep -i "disk"
# Match lines of 10 or more characters
grep -E ".{10,}" file.txt
# Find lines with repeated words ("the the", "is is", etc.)
grep -E "\b([a-z]+) \1\b" essay.txt
# Extract IPv4 addresses
grep -Eo "([0-9]{1,3}\.){3}[0-9]{1,3}" access.log | sort -u
# Lines that do NOT match a pattern
grep -Ev "^(#|$)" config.txt # exclude comments and blank lines
7 — Regex in sed
The s/pattern/replacement/ command in sed uses BRE by default. Use sed -E to switch to ERE (avoiding the need to escape +, (), etc.).
# Normalise whitespace (collapse runs of spaces to one)
sed -E 's/[[:space:]]+/ /g' file.txt
# Remove HTML tags
sed -E 's/<[^>]+>//g' page.html
# Capture groups with \1 \2 back-references
# Reformat dates from DD/MM/YYYY to YYYY-MM-DD
echo "09/06/2026" | sed -E 's|([0-9]{2})/([0-9]{2})/([0-9]{4})|\3-\2-\1|'
2026-06-09
# Wrap numbers in brackets using & (whole match)
echo "Item costs 42 pounds" | sed -E 's/[0-9]+/[&]/g'
Item costs [42] pounds
# Swap first and second fields on a colon-delimited line
echo "alice:admin" | sed -E 's/^([^:]+):([^:]+)/\2:\1/'
admin:alice
# Delete lines matching a pattern
sed -E '/^[[:space:]]*(#|$)/d' config.txt # strip blanks and comments
# Print only lines between two markers
sed -n '/^START/,/^END/p' file.txt
8 — Regex in awk
awk natively uses ERE. Patterns can appear as standalone conditions, inside if, or with the ~ (match) and !~ (no match) operators on specific fields.
# Standalone regex — filter lines matching the pattern
awk '/^ERROR/' app.log
# Negate — lines NOT matching
awk '!/^#/' config.txt
# ~ operator: match a specific field
awk -F: '$1 ~ /^[a-z]/ {print $1}' /etc/passwd # usernames starting lowercase
awk -F, '$3 !~ /[0-9]/ {print $0}' data.csv # rows where col 3 has no digit
# Extract + reformat using match() and capture
# gawk (GNU awk) supports capture groups in match()
echo "2026-06-09" | gawk 'match($0, /([0-9]{4})-([0-9]{2})-([0-9]{2})/, a) {
printf "Day: %s, Month: %s, Year: %s\n", a[3], a[2], a[1]
}'
Day: 09, Month: 06, Year: 2026
# Process a range of lines between two patterns
awk '/BEGIN_SECTION/,/END_SECTION/' file.txt
# Conditional with sub() / gsub() for regex replacement
awk '{gsub(/[[:space:]]+/, "_"); print}' file.txt # spaces → underscores
9 — Practical Validation Patterns
These are production-ready regex fragments for common validation tasks in Bash scripts.
#!/bin/bash
# ── Integers ────────────────────────────────────────
is_integer() { [[ "$1" =~ ^-?[0-9]+$ ]]; }
# ── Positive integer (no sign) ──────────────────────
is_positive_int() { [[ "$1" =~ ^[0-9]+$ ]]; }
# ── Decimal number ──────────────────────────────────
is_number() { [[ "$1" =~ ^-?[0-9]+(\.[0-9]+)?$ ]]; }
# ── Email (basic) ───────────────────────────────────
email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
is_email() { [[ "$1" =~ $email_re ]]; }
# ── IPv4 address ────────────────────────────────────
ipv4_re='^([0-9]{1,3}\.){3}[0-9]{1,3}$'
is_ipv4() { [[ "$1" =~ $ipv4_re ]]; }
# ── ISO date YYYY-MM-DD ─────────────────────────────
date_re='^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
is_iso_date() { [[ "$1" =~ $date_re ]]; }
# ── Hostname ────────────────────────────────────────
host_re='^[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+$'
is_hostname() { [[ "$1" =~ $host_re ]]; }
# ── Usage ───────────────────────────────────────────
is_integer "-42" && echo "integer ✓"
is_email "me@example.com" && echo "email ✓"
is_iso_date "2026-06-09" && echo "date ✓"
! is_ipv4 "999.0.0.1" && echo "bad IP ✓"
10 — Quick Reference
Glob vs Regex — side by side
| Goal | Glob (shell, case, ==) | Regex (=~, grep -E, awk) |
|---|---|---|
| Any string | * | .* |
| Any single character | ? | . |
| One of these characters | [abc] | [abc] |
| Start of string | N/A (matches whole string) | ^ |
| End of string | N/A | $ |
| One or more | +(pat) extglob | + |
| Zero or one | ?(pat) extglob | ? |
| Alternation | @(a|b) extglob | (a|b) |
| Negate | !(pat) extglob | [^...] or grep -v |
Regex metacharacter summary (ERE)
| Symbol | Meaning |
|---|---|
| ^ | Start of string / line |
| $ | End of string / line |
| . | Any single character (except newline) |
| * | Zero or more of preceding |
| + | One or more of preceding |
| ? | Zero or one of preceding |
| {n,m} | Between n and m occurrences |
| [abc] | Character class |
| [^abc] | Negated character class |
| (abc) | Capturing group |
| a|b | Alternation — a or b |
| \ | Escape next metacharacter |
| [:alpha:] | POSIX letter class (inside [...]) |
| [:digit:] | POSIX digit class (inside [...]) |
| [:space:] | POSIX whitespace class (inside [...]) |
Where each pattern system is used
| Context | System | Notes |
|---|---|---|
| Filename expansion | Glob | Expanded by shell before command runs |
case patterns | Glob (+ extglob) | Matches whole string, not substring |
[[ str == pat ]] | Glob (+ extglob) | Right side must be unquoted |
[[ str =~ re ]] | ERE | Right side unquoted; sets BASH_REMATCH |
grep (default) | BRE | Use -E for ERE |
grep -E | ERE | Recommended for readability |
sed (default) | BRE | Use -E for ERE |
awk | ERE | Always ERE, no flag needed |
✏️ Exercises
Apply what you have learned. Write each script yourself before looking at the sample solution.
validate_input.sh that prompts the user for five pieces of information one at a time — name, age, email address, an IPv4 address, and a date in YYYY-MM-DD format — and validates each with a regex using [[ =~ ]]. Keep re-prompting for each field until valid input is entered. Print a summary once all five fields are collected.while true loop per field. Use the validation functions from section 9 (is_integer, is_email, is_ipv4, is_iso_date). For the name, require at least 2 alphabetic characters.#!/bin/bash
# validate_input.sh
email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$'
ipv4_re='^([0-9]{1,3}\.){3}[0-9]{1,3}$'
date_re='^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
prompt_until() {
local label="$1" re="$2" result
while true; do
read -r -p "$label: " result
[[ "$result" =~ $re ]] && break
echo " ✗ Invalid — please try again." >&2
done
echo "$result"
}
name=$(prompt_until "Full name (letters only, 2+ chars)" '^[a-zA-Z ]{2,}$')
age=$(prompt_until "Age (1-3 digit number)" '^[0-9]{1,3}$')
email=$(prompt_until "Email address" "$email_re")
ip=$(prompt_until "IPv4 address (e.g. 192.168.1.1)" "$ipv4_re")
dob=$(prompt_until "Date of birth (YYYY-MM-DD)" "$date_re")
echo
echo "── Summary ──────────────"
printf "Name : %s\n" "$name"
printf "Age : %s\n" "$age"
printf "Email : %s\n" "$email"
printf "IP : %s\n" "$ip"
printf "DOB : %s\n" "$dob"
parse_log.sh that reads an Apache-style access log file (path as argument) and uses BASH_REMATCH to parse each line. Extract the IP address, HTTP method, URL path, and response code. Print a summary showing: total requests, unique IPs, count of each HTTP method, and count of each response code, sorted numerically.IP - - [date] "METHOD /path HTTP/1.1" CODE size .... Build a regex with capture groups for IP (group 1), method (group 2), path (group 3), code (group 4). Use associative arrays to accumulate counts.#!/bin/bash
# parse_log.sh — usage: ./parse_log.sh access.log
logfile="${1:?Usage: $0 <access.log>}"
[[ -f "$logfile" ]] || { echo "Not found: $logfile" >&2; exit 1; }
# Apache combined log regex
log_re='^([0-9.]+) [^ ]+ [^ ]+ \[[^]]+\] "([A-Z]+) ([^ ]+) [^"]*" ([0-9]{3})'
declare -A methods codes ips
total=0
while IFS= read -r line; do
[[ "$line" =~ $log_re ]] || continue
ip="${BASH_REMATCH[1]}"
method="${BASH_REMATCH[2]}"
code="${BASH_REMATCH[4]}"
(( total++ ))
ips["$ip"]=1
(( methods["$method"]++ ))
(( codes["$code"]++ ))
done < "$logfile"
printf "Total requests : %d\n" "$total"
printf "Unique IPs : %d\n\n" "${#ips[@]}"
echo "HTTP Methods:"
for m in $(printf "%s\n" "${!methods[@]}" | sort); do
printf " %-8s %d\n" "$m" "${methods[$m]}"
done
echo
echo "Response Codes:"
for c in $(printf "%s\n" "${!codes[@]}" | sort -n); do
printf " %s %d\n" "$c" "${codes[$c]}"
done
rename_dated.sh that renames files in the current directory whose names contain a date in DD-MM-YYYY format to use ISO format (YYYY-MM-DD) instead. For example, report_09-06-2026.csv becomes report_2026-06-09.csv. Dry-run mode (when called with --dry-run) should print what would be renamed without actually doing it.for f in * with [[ "$f" =~ ([0-9]{2})-([0-9]{2})-([0-9]{4}) ]]. Rebuild the new filename using BASH_REMATCH and sed or parameter expansion to swap the date portion. Check for a --dry-run argument with $1.#!/bin/bash
# rename_dated.sh — usage: ./rename_dated.sh [--dry-run]
dry_run=0
[[ "$1" == "--dry-run" ]] && dry_run=1
date_re='([0-9]{2})-([0-9]{2})-([0-9]{4})'
count=0
for f in *; do
[[ -f "$f" ]] || continue
[[ "$f" =~ $date_re ]] || continue
dd="${BASH_REMATCH[1]}"
mm="${BASH_REMATCH[2]}"
yyyy="${BASH_REMATCH[3]}"
old_date="${dd}-${mm}-${yyyy}"
new_date="${yyyy}-${mm}-${dd}"
new_name="${f//${old_date}/${new_date}}"
if [[ "$new_name" != "$f" ]]; then
printf " %s → %s\n" "$f" "$new_name"
if [[ $dry_run -eq 0 ]]; then
mv -- "$f" "$new_name"
fi
(( count++ ))
fi
done
if [[ $count -eq 0 ]]; then
echo "No files with DD-MM-YYYY dates found."
elif [[ $dry_run -eq 1 ]]; then
printf "\n[dry-run] %d file(s) would be renamed.\n" "$count"
else
printf "\n%d file(s) renamed.\n" "$count"
fi
extract_urls.sh that accepts a file (HTML or text) and extracts all unique URLs from it using grep -Eo. Print them one per line, sorted, with duplicates removed. As a bonus, categorise them: print http/https URLs first, then mailto: links, then any other protocol.grep -Eo 'https?://[^"<> ]+|mailto:[^"<> ]+' to extract URLs. Pipe through sort -u to deduplicate. Use grep to split into categories, or use a loop with [[ =~ ]] to classify each URL.#!/bin/bash
# extract_urls.sh — usage: ./extract_urls.sh page.html
file="${1:?Usage: $0 <file>}"
[[ -f "$file" ]] || { echo "Not found: $file" >&2; exit 1; }
# Extract all URLs into an array (deduplicated)
url_re='https?://[^"<> ]+|mailto:[^"<> ]+'
urls=()
while IFS= read -r url; do
urls+=( "$url" )
done <(grep -Eo "$url_re" "$file" | sort -u)
if [[ "${#urls[@]}" -eq 0 ]]; then
echo "No URLs found in $file"
exit 0
fi
printf "Found %d unique URL(s):\n\n" "${#urls[@]}"
# Categorise
declare -a http_urls mailto_urls other_urls
for url in "${urls[@]}"; do
case "$url" in
https://*|http://*) http_urls+=( "$url" ) ;;
mailto:*) mailto_urls+=( "$url" ) ;;
*) other_urls+=( "$url" ) ;;
esac
done
print_section() {
local label="$1"; shift
[[ $# -eq 0 ]] && return
echo "── $label ──"
printf " %s\n" "$@"
echo
}
print_section "HTTP/HTTPS (${#http_urls[@]})" "${http_urls[@]}"
print_section "Mailto (${#mailto_urls[@]})" "${mailto_urls[@]}"
print_section "Other (${#other_urls[@]})" "${other_urls[@]}"