Pattern Matching and Regular Expressions

🔍 Topic 10 — Pattern Matching and Regular Expressions

Bash uses two related but distinct pattern systems. Glob patterns (also called shell patterns) are used for filename matching and the case statement — they use *, ?, and [...]. Regular expressions (regex) are used inside [[ =~ ]] and tools like grep, sed, and awk — they are far more expressive. This chapter covers both systems thoroughly, explains where each one applies, and shows the key patterns you'll reach for constantly in real scripts.

1 — Glob Patterns (Shell Wildcards)

Globs are expanded by the shell itself before any command sees them. They match filenames in the filesystem — or strings in [[ == ]] and case.

PatternMatchesExample
*Any string of zero or more characters (not including /)*.log → all .log files
?Exactly one characterfile?.txtfile1.txt, fileA.txt
[abc]One character from the set[abc].sha.sh, b.sh, c.sh
[a-z]One character in the range[0-9].txt → single-digit filenames
[^abc]One character NOT in the set[^0-9]* → files not starting with a digit
**Any path including / (requires globstar option)**/*.py → all .py files recursively
🐧 Glob patterns in action
# Filename expansion (pathname expansion) ls *.sh # all shell scripts ls report_2026-??.csv # report_2026-01.csv through 09.csv etc. ls [A-Z]*.txt # text files starting with a capital letter # Recursive glob — must enable globstar first shopt -s globstar for f in **/*.py; do echo "$f" done # Include dotfiles (hidden files) in globs shopt -s dotglob ls * # now includes .hidden files # Return glob unexpanded if no match (instead of passing literal string) # nullglob: expands to nothing if no match failglob: throws error shopt -s nullglob files=( *.csv ) [[ "${#files[@]}" -eq 0 ]] && echo "No CSV files found" # Globs in case — match strings, not filenames filename="archive.tar.gz" case "$filename" in *.tar.gz) echo "gzipped tarball" ;; *.zip) echo "zip archive" ;; *.sh) echo "shell script" ;; *) echo "unknown type" ;; esac gzipped tarball
⚠️ Always quote glob results when assigning to variables. file=*.txt stores the literal string *.txt. files=(*.txt) correctly expands into an array. When looping: for f in *.txt is fine unquoted, but once the filename is in a variable, use "$f" everywhere.

2 — Extended Globs

Extended globs add five powerful pattern operators. Enable them with shopt -s extglob. They work in filename expansion, case, and [[ == ]].

PatternMeaningExample
?(pat)Zero or one occurrence of patfile?(s).txtfile.txt or files.txt
*(pat)Zero or more occurrences of pat*(0)11, 01, 001
+(pat)One or more occurrences of pat+([0-9]) → one or more digits
@(pat)Exactly one occurrence of pat (alternation)@(jpg|png|gif) → exactly one of those
!(pat)Anything except pat!(*.log) → all files except .log
🐧 Extended glob examples
shopt -s extglob # Match image files with one extension from a list for img in *.@(jpg|jpeg|png|gif|webp); do echo "Image: $img" done # List everything EXCEPT backup files ls !(*.bak|*.tmp) # Match version strings: v1, v12, v123 (but not v) ver="v42" [[ "$ver" == v+([0-9]) ]] && echo "valid version" valid version # Strip extension using extended glob in parameter expansion file="photo.backup.tar.gz" echo "${file%%+(.+([a-z]))}" # strip all dot-extensions photo # case with extended globs input="yes" case "$input" in @(y|yes|Y|YES)) echo "Confirmed" ;; @(n|no|N|NO)) echo "Declined" ;; *) echo "Unknown response" ;; esac

3 — Pattern Matching Inside [[ ]]

The double-bracket [[ ]] construct supports two kinds of matching: glob patterns with ==, and regular expressions with =~.

🐧 [[ == ]] glob matching vs [[ =~ ]] regex matching
filename="report_2026-06.csv" # == uses a GLOB pattern (not regex) — right side is unquoted [[ "$filename" == report_*.csv ]] && echo "glob match" glob match # If you QUOTE the pattern it becomes a literal string comparison [[ "$filename" == "report_*.csv" ]] && echo "this won't print — literal * char" # =~ uses EXTENDED REGEX — right side is also unquoted [[ "$filename" =~ report_[0-9]{4}-[0-9]{2}\.csv ]] && echo "regex match" regex match # Practical: validate an IPv4 address ip="192.168.1.105" octet='([0-9]{1,3})' if [[ "$ip" =~ ^${octet}\.${octet}\.${octet}\.${octet}$ ]]; then echo "Looks like an IP address" fi # Store the regex in a variable for readability (do NOT quote it) email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$' [[ "test@example.com" =~ $email_re ]] && echo "valid email format" valid email format
Store complex regex patterns in a variable and use $var unquoted on the right side of =~. Quoting the right side of either == or =~ makes it a literal string comparison — the pattern metacharacters are ignored.

4 — Capturing Groups with BASH_REMATCH

After a successful =~ match, bash populates the read-only array BASH_REMATCH: index [0] is the whole match, and indices [1], [2]… are the captured groups (parenthesised parts of the pattern).

🐧 Using BASH_REMATCH to extract parts of a string
# Extract year, month, day from a date string date_str="Today is 2026-06-09 and it's Tuesday." date_re='([0-9]{4})-([0-9]{2})-([0-9]{2})' if [[ "$date_str" =~ $date_re ]]; then echo "Full match : ${BASH_REMATCH[0]}" → 2026-06-09 echo "Year : ${BASH_REMATCH[1]}" → 2026 echo "Month : ${BASH_REMATCH[2]}" → 06 echo "Day : ${BASH_REMATCH[3]}" → 09 fi # Parse a URL into components url="https://api.example.com:8443/v2/users?page=2" url_re='^(https?)://([^:/]+)(:([0-9]+))?(/[^?]*)(\?.*)?$' [[ "$url" =~ $url_re ]] && { echo "Scheme : ${BASH_REMATCH[1]}" → https echo "Host : ${BASH_REMATCH[2]}" → api.example.com echo "Port : ${BASH_REMATCH[4]}" → 8443 echo "Path : ${BASH_REMATCH[5]}" → /v2/users echo "Query : ${BASH_REMATCH[6]}" → ?page=2 }

5 — Regular Expression Fundamentals

Regex is its own mini-language. Here are the building blocks you need to know. Bash's =~ uses Extended Regular Expressions (ERE) — the same dialect as grep -E and awk.

Anchors
^ start of string $ end of string \b word boundary (grep/sed only) # Match whole string ^hello$ → only "hello" # Start only ^error → "error" at start
Character classes
[abc] one of a, b, c [^abc] not a, b, or c [a-z] lowercase letter [0-9] digit . any char (not \n) \d digit (some tools) # POSIX classes (portable) [:alpha:] letters [:digit:] digits [:space:] whitespace [:alnum:] letters+digits
Quantifiers
? 0 or 1 * 0 or more + 1 or more {n} exactly n {n,} n or more {n,m} between n and m # Greedy vs lazy # Default is greedy (match as much as possible) .* greedy # Lazy not in basic ERE/BRE; # use Perl-style grep -P for \*?
Groups & alternation
(abc) group / capture (?:abc) non-capture group (not in BRE) a|b a OR b (cat|dog) cat OR dog # Alternation examples ^(ERROR|WARN|INFO) # matches log level prefixes
Escaping special chars
# In ERE these need escaping # to be treated as literals: \. \* \+ \? \( \) \[ \] \{ \} \^ \$ \| # Match a literal dot 192\.168\.1\.[0-9]+ # Match a literal parenthesis \(deprecated\)
Useful shortcuts
# Integer: one or more digits [0-9]+ or [[:digit:]]+ # Word: letters/digits/underscore [a-zA-Z0-9_]+ # Optional sign + integer -?[0-9]+ # Whitespace (one or more) [[:space:]]+ # Blank line ^[[:space:]]*$

BRE vs ERE — What Changes?

Two regex dialects in Linux:

BRE (Basic Regular Expressions) — used by default grep and sed. The metacharacters + ? | ( ) { } are literal unless you escape them with \. So + is a literal plus; \+ means "one or more".

ERE (Extended Regular Expressions) — used by grep -E, awk, and bash's =~. The metacharacters + ? | ( ) { } are special by default; you escape with \ to match them literally.

Rule of thumb: always use ERE. Pass -E to grep and -E to sed. Less escaping, more readable.

6 — Regex in grep

🐧 grep -E (Extended Regex) patterns
# Match lines starting with a log level grep -E "^(ERROR|WARN|CRITICAL)" app.log # Extract email addresses from a file grep -Eo "[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}" file.txt # Find lines with a 4-digit year between 1900 and 2099 grep -E "(19|20)[0-9]{2}" dates.txt # Match lines containing BOTH "error" AND "disk" (chained grep) grep -i "error" syslog | grep -i "disk" # Match lines of 10 or more characters grep -E ".{10,}" file.txt # Find lines with repeated words ("the the", "is is", etc.) grep -E "\b([a-z]+) \1\b" essay.txt # Extract IPv4 addresses grep -Eo "([0-9]{1,3}\.){3}[0-9]{1,3}" access.log | sort -u # Lines that do NOT match a pattern grep -Ev "^(#|$)" config.txt # exclude comments and blank lines

7 — Regex in sed

The s/pattern/replacement/ command in sed uses BRE by default. Use sed -E to switch to ERE (avoiding the need to escape +, (), etc.).

🐧 sed -E (Extended Regex) substitutions
# Normalise whitespace (collapse runs of spaces to one) sed -E 's/[[:space:]]+/ /g' file.txt # Remove HTML tags sed -E 's/<[^>]+>//g' page.html # Capture groups with \1 \2 back-references # Reformat dates from DD/MM/YYYY to YYYY-MM-DD echo "09/06/2026" | sed -E 's|([0-9]{2})/([0-9]{2})/([0-9]{4})|\3-\2-\1|' 2026-06-09 # Wrap numbers in brackets using & (whole match) echo "Item costs 42 pounds" | sed -E 's/[0-9]+/[&]/g' Item costs [42] pounds # Swap first and second fields on a colon-delimited line echo "alice:admin" | sed -E 's/^([^:]+):([^:]+)/\2:\1/' admin:alice # Delete lines matching a pattern sed -E '/^[[:space:]]*(#|$)/d' config.txt # strip blanks and comments # Print only lines between two markers sed -n '/^START/,/^END/p' file.txt

8 — Regex in awk

awk natively uses ERE. Patterns can appear as standalone conditions, inside if, or with the ~ (match) and !~ (no match) operators on specific fields.

🐧 awk regex patterns and operators
# Standalone regex — filter lines matching the pattern awk '/^ERROR/' app.log # Negate — lines NOT matching awk '!/^#/' config.txt # ~ operator: match a specific field awk -F: '$1 ~ /^[a-z]/ {print $1}' /etc/passwd # usernames starting lowercase awk -F, '$3 !~ /[0-9]/ {print $0}' data.csv # rows where col 3 has no digit # Extract + reformat using match() and capture # gawk (GNU awk) supports capture groups in match() echo "2026-06-09" | gawk 'match($0, /([0-9]{4})-([0-9]{2})-([0-9]{2})/, a) { printf "Day: %s, Month: %s, Year: %s\n", a[3], a[2], a[1] }' Day: 09, Month: 06, Year: 2026 # Process a range of lines between two patterns awk '/BEGIN_SECTION/,/END_SECTION/' file.txt # Conditional with sub() / gsub() for regex replacement awk '{gsub(/[[:space:]]+/, "_"); print}' file.txt # spaces → underscores

9 — Practical Validation Patterns

These are production-ready regex fragments for common validation tasks in Bash scripts.

🐧 Input validation with [[ =~ ]]
#!/bin/bash # ── Integers ──────────────────────────────────────── is_integer() { [[ "$1" =~ ^-?[0-9]+$ ]]; } # ── Positive integer (no sign) ────────────────────── is_positive_int() { [[ "$1" =~ ^[0-9]+$ ]]; } # ── Decimal number ────────────────────────────────── is_number() { [[ "$1" =~ ^-?[0-9]+(\.[0-9]+)?$ ]]; } # ── Email (basic) ─────────────────────────────────── email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$' is_email() { [[ "$1" =~ $email_re ]]; } # ── IPv4 address ──────────────────────────────────── ipv4_re='^([0-9]{1,3}\.){3}[0-9]{1,3}$' is_ipv4() { [[ "$1" =~ $ipv4_re ]]; } # ── ISO date YYYY-MM-DD ───────────────────────────── date_re='^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$' is_iso_date() { [[ "$1" =~ $date_re ]]; } # ── Hostname ──────────────────────────────────────── host_re='^[a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?(\.[a-zA-Z]{2,})+$' is_hostname() { [[ "$1" =~ $host_re ]]; } # ── Usage ─────────────────────────────────────────── is_integer "-42" && echo "integer ✓" is_email "me@example.com" && echo "email ✓" is_iso_date "2026-06-09" && echo "date ✓" ! is_ipv4 "999.0.0.1" && echo "bad IP ✓"

10 — Quick Reference

Glob vs Regex — side by side

GoalGlob (shell, case, ==)Regex (=~, grep -E, awk)
Any string*.*
Any single character?.
One of these characters[abc][abc]
Start of stringN/A (matches whole string)^
End of stringN/A$
One or more+(pat) extglob+
Zero or one?(pat) extglob?
Alternation@(a|b) extglob(a|b)
Negate!(pat) extglob[^...] or grep -v

Regex metacharacter summary (ERE)

SymbolMeaning
^Start of string / line
$End of string / line
.Any single character (except newline)
*Zero or more of preceding
+One or more of preceding
?Zero or one of preceding
{n,m}Between n and m occurrences
[abc]Character class
[^abc]Negated character class
(abc)Capturing group
a|bAlternation — a or b
\Escape next metacharacter
[:alpha:]POSIX letter class (inside [...])
[:digit:]POSIX digit class (inside [...])
[:space:]POSIX whitespace class (inside [...])

Where each pattern system is used

ContextSystemNotes
Filename expansionGlobExpanded by shell before command runs
case patternsGlob (+ extglob)Matches whole string, not substring
[[ str == pat ]]Glob (+ extglob)Right side must be unquoted
[[ str =~ re ]]ERERight side unquoted; sets BASH_REMATCH
grep (default)BREUse -E for ERE
grep -EERERecommended for readability
sed (default)BREUse -E for ERE
awkEREAlways ERE, no flag needed

✏️ Exercises

Apply what you have learned. Write each script yourself before looking at the sample solution.

Exercise 1
Write a script called validate_input.sh that prompts the user for five pieces of information one at a time — name, age, email address, an IPv4 address, and a date in YYYY-MM-DD format — and validates each with a regex using [[ =~ ]]. Keep re-prompting for each field until valid input is entered. Print a summary once all five fields are collected.
Hint: write one while true loop per field. Use the validation functions from section 9 (is_integer, is_email, is_ipv4, is_iso_date). For the name, require at least 2 alphabetic characters.
Sample Solution
#!/bin/bash # validate_input.sh email_re='^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$' ipv4_re='^([0-9]{1,3}\.){3}[0-9]{1,3}$' date_re='^[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$' prompt_until() { local label="$1" re="$2" result while true; do read -r -p "$label: " result [[ "$result" =~ $re ]] && break echo " ✗ Invalid — please try again." >&2 done echo "$result" } name=$(prompt_until "Full name (letters only, 2+ chars)" '^[a-zA-Z ]{2,}$') age=$(prompt_until "Age (1-3 digit number)" '^[0-9]{1,3}$') email=$(prompt_until "Email address" "$email_re") ip=$(prompt_until "IPv4 address (e.g. 192.168.1.1)" "$ipv4_re") dob=$(prompt_until "Date of birth (YYYY-MM-DD)" "$date_re") echo echo "── Summary ──────────────" printf "Name : %s\n" "$name" printf "Age : %s\n" "$age" printf "Email : %s\n" "$email" printf "IP : %s\n" "$ip" printf "DOB : %s\n" "$dob"
Exercise 2
Write a script called parse_log.sh that reads an Apache-style access log file (path as argument) and uses BASH_REMATCH to parse each line. Extract the IP address, HTTP method, URL path, and response code. Print a summary showing: total requests, unique IPs, count of each HTTP method, and count of each response code, sorted numerically.
Hint: Apache combined log format is: IP - - [date] "METHOD /path HTTP/1.1" CODE size .... Build a regex with capture groups for IP (group 1), method (group 2), path (group 3), code (group 4). Use associative arrays to accumulate counts.
Sample Solution
#!/bin/bash # parse_log.sh — usage: ./parse_log.sh access.log logfile="${1:?Usage: $0 <access.log>}" [[ -f "$logfile" ]] || { echo "Not found: $logfile" >&2; exit 1; } # Apache combined log regex log_re='^([0-9.]+) [^ ]+ [^ ]+ \[[^]]+\] "([A-Z]+) ([^ ]+) [^"]*" ([0-9]{3})' declare -A methods codes ips total=0 while IFS= read -r line; do [[ "$line" =~ $log_re ]] || continue ip="${BASH_REMATCH[1]}" method="${BASH_REMATCH[2]}" code="${BASH_REMATCH[4]}" (( total++ )) ips["$ip"]=1 (( methods["$method"]++ )) (( codes["$code"]++ )) done < "$logfile" printf "Total requests : %d\n" "$total" printf "Unique IPs : %d\n\n" "${#ips[@]}" echo "HTTP Methods:" for m in $(printf "%s\n" "${!methods[@]}" | sort); do printf " %-8s %d\n" "$m" "${methods[$m]}" done echo echo "Response Codes:" for c in $(printf "%s\n" "${!codes[@]}" | sort -n); do printf " %s %d\n" "$c" "${codes[$c]}" done
Exercise 3
Write a script called rename_dated.sh that renames files in the current directory whose names contain a date in DD-MM-YYYY format to use ISO format (YYYY-MM-DD) instead. For example, report_09-06-2026.csv becomes report_2026-06-09.csv. Dry-run mode (when called with --dry-run) should print what would be renamed without actually doing it.
Hint: use for f in * with [[ "$f" =~ ([0-9]{2})-([0-9]{2})-([0-9]{4}) ]]. Rebuild the new filename using BASH_REMATCH and sed or parameter expansion to swap the date portion. Check for a --dry-run argument with $1.
Sample Solution
#!/bin/bash # rename_dated.sh — usage: ./rename_dated.sh [--dry-run] dry_run=0 [[ "$1" == "--dry-run" ]] && dry_run=1 date_re='([0-9]{2})-([0-9]{2})-([0-9]{4})' count=0 for f in *; do [[ -f "$f" ]] || continue [[ "$f" =~ $date_re ]] || continue dd="${BASH_REMATCH[1]}" mm="${BASH_REMATCH[2]}" yyyy="${BASH_REMATCH[3]}" old_date="${dd}-${mm}-${yyyy}" new_date="${yyyy}-${mm}-${dd}" new_name="${f//${old_date}/${new_date}}" if [[ "$new_name" != "$f" ]]; then printf " %s → %s\n" "$f" "$new_name" if [[ $dry_run -eq 0 ]]; then mv -- "$f" "$new_name" fi (( count++ )) fi done if [[ $count -eq 0 ]]; then echo "No files with DD-MM-YYYY dates found." elif [[ $dry_run -eq 1 ]]; then printf "\n[dry-run] %d file(s) would be renamed.\n" "$count" else printf "\n%d file(s) renamed.\n" "$count" fi
Exercise 4
Write a script called extract_urls.sh that accepts a file (HTML or text) and extracts all unique URLs from it using grep -Eo. Print them one per line, sorted, with duplicates removed. As a bonus, categorise them: print http/https URLs first, then mailto: links, then any other protocol.
Hint: use grep -Eo 'https?://[^"<> ]+|mailto:[^"<> ]+' to extract URLs. Pipe through sort -u to deduplicate. Use grep to split into categories, or use a loop with [[ =~ ]] to classify each URL.
Sample Solution
#!/bin/bash # extract_urls.sh — usage: ./extract_urls.sh page.html file="${1:?Usage: $0 <file>}" [[ -f "$file" ]] || { echo "Not found: $file" >&2; exit 1; } # Extract all URLs into an array (deduplicated) url_re='https?://[^"<> ]+|mailto:[^"<> ]+' urls=() while IFS= read -r url; do urls+=( "$url" ) done <(grep -Eo "$url_re" "$file" | sort -u) if [[ "${#urls[@]}" -eq 0 ]]; then echo "No URLs found in $file" exit 0 fi printf "Found %d unique URL(s):\n\n" "${#urls[@]}" # Categorise declare -a http_urls mailto_urls other_urls for url in "${urls[@]}"; do case "$url" in https://*|http://*) http_urls+=( "$url" ) ;; mailto:*) mailto_urls+=( "$url" ) ;; *) other_urls+=( "$url" ) ;; esac done print_section() { local label="$1"; shift [[ $# -eq 0 ]] && return echo "── $label ──" printf " %s\n" "$@" echo } print_section "HTTP/HTTPS (${#http_urls[@]})" "${http_urls[@]}" print_section "Mailto (${#mailto_urls[@]})" "${mailto_urls[@]}" print_section "Other (${#other_urls[@]})" "${other_urls[@]}"