Bash Internals: Parsing and Expansion Order

Chapter 1 — Bash Internals: Parsing and Expansion Order

Most Bash bugs are not logic errors — they are expansion order surprises. When a line of shell script behaves unexpectedly, the cause is almost always that the programmer had a different mental model of when the shell interprets each piece of syntax than what actually happens. This chapter builds that model precisely, from the moment Bash reads a character to the moment a process is launched.

1 — Tokenisation: What the Lexer Sees

Before any expansion happens, Bash reads its input and breaks it into tokens. A token is either a word (a sequence of characters treated as a unit) or an operator (a metacharacter like |, ;>, or ()).

The rules the lexer applies — in strict priority order:

  1. Quoting. Any character preceded by \, or any sequence inside '…' or "…", is shielded from further lexer processing. Quoting is resolved during tokenisation, not during expansion.
  2. Operator recognition. If the current character starts or continues an operator token (|, &, ;, (, ), <, >, and their two-character forms), it becomes an operator token — not a word.
  3. Whitespace delimiters. Unquoted IFS characters (space, tab, newline by default) delimit words but are not themselves tokens.
  4. Comments. An unquoted # that appears as the first character of a word starts a comment; everything to the end of the line is discarded before parsing.
# Illustrating tokenisation — what are the tokens here?
echo hello | cat
# Tokens: [echo] [hello] [|] [cat]
# 'hello' and 'cat' are WORDS; '|' is an OPERATOR

# Quoting prevents operator recognition
echo 'hello | cat'
# Tokens: [echo] [hello | cat]
# The pipe is INSIDE a quoted word — not an operator

# The # rule: only a word-initial # is a comment
echo foo#bar   # prints: foo#bar  — # is not word-initial here
echo foo #bar  # prints: foo      — space before # makes it word-initial

2 — The Seven-Stage Expansion Pipeline

After tokenisation, each word token is passed through up to seven expansions, applied strictly left-to-right in this order. Understanding the order is everything — a result from one stage feeds directly into the next.

#ExpansionTrigger syntax
1Brace expansion{a,b}   {1..5}
2Tilde expansion~   ~user   ~+   ~-
3Parameter & variable expansion$var   ${var}   ${!var}
4Arithmetic expansion$(( expr ))
5Command substitution$( cmd )   `cmd`
6Word splitting(implicit — splits on $IFS)
7Pathname expansion (globbing)*   ?   […]
Process substitution (<(cmd) / >(cmd)) is handled concurrently with command substitution but is not part of the standard seven-stage pipeline — it is a Bash extension that produces a filename (/dev/fd/N) before the command line is assembled.

Stage 1 — Brace expansion

Brace expansion happens first, before any variable is looked up. This has a critical implication: you cannot put a variable inside braces to control expansion.

# Works as expected
echo file{1,2,3}.txt
file1.txt file2.txt file3.txt

# Sequence expressions
echo {a..e}
a b c d e
echo {01..05}
01 02 03 04 05
echo {0..20..5}    # with step (bash 4+)
0 5 10 15 20

# Brace expansion is purely textual — no variable lookup
list='a,b,c'
echo {$list}         # does NOT expand list — prints: {a,b,c}
eval echo {$list}    # eval forces re-parsing: prints a b c

# Preamble and postscript are repeated for each element
mkdir -p project/{src,test,docs}/{main,util}
# creates: project/src/main, project/src/util, project/test/main, ...

Stage 2 — Tilde expansion

# ~ expands to $HOME
echo ~          # /home/alice
echo ~root      # /root  (another user's home)
echo ~+         # $PWD (current directory)
echo ~-         # $OLDPWD (previous directory)

# Tilde expansion ONLY fires when ~ is the first char of a word
# and the word is unquoted
echo "~"        # prints literal ~  (quoted)
echo /foo/~     # prints /foo/~    (not word-initial)

# Assignment tilde expansion fires after = and each :
PATH=~/bin:$PATH   # ~ expands here inside an assignment

Stage 3 — Parameter expansion

This is the richest stage. The key advanced points:

# Indirect expansion — ${!var} looks up the value OF the variable named by var
target='PATH'
echo "${!target}"   # prints the value of $PATH

# ${!prefix*} and ${!prefix@} — variable names matching a prefix
FOO_A=1; FOO_B=2
echo "${!FOO_*}"    # FOO_A FOO_B

# Substring extraction: ${var:offset:length}
s='hello world'
echo "${s:6}"       # world
echo "${s:6:3}"     # wor
echo "${s: -5}"     # world  (negative offset — note the space!)

# Pattern removal
f='/path/to/file.tar.gz'
echo "${f##*/}"     # file.tar.gz   (greedy from left)
echo "${f%.*}"      # /path/to/file.tar  (lazy from right)
echo "${f%%.*}"     # /path/to/file  (greedy from right)

# Case conversion (bash 4+)
x='Hello World'
echo "${x,,}"       # hello world
echo "${x^^}"       # HELLO WORLD
echo "${x^}"        # Hello World (first char only)

Stages 4 & 5 — Arithmetic and command substitution

Both produce a string that re-enters the pipeline at stage 6 (word splitting). They are evaluated in the order they appear left-to-right within a word, but each is fully resolved before the result is passed forward.

# Arithmetic expansion — integer math only, no fork
n=5
echo $(( n * 3 + 1 ))     # 16
echo $(( 0xFF ))           # 255  (hex literal)
echo $(( 2**10 ))          # 1024  (exponentiation)

# Command substitution — the output feeds back as a word
today=$(date '+%Y-%m-%d')

# CRITICAL: trailing newlines are stripped by command substitution
x=$(printf 'foo\n\n\n')
echo "'${x}'"   # 'foo'  — all trailing newlines gone

# Preserve trailing newlines with a sentinel character
x=$(printf 'foo\n\n'; echo x)
x="${x%x}"          # strip the sentinel, keep the newlines

Stage 6 — Word splitting

This stage is the source of the majority of quoting bugs. It applies only to the results of parameter expansion, arithmetic expansion, and command substitution — never to literal text already in the script.

# IFS controls word splitting — default is space, tab, newline
words='one two   three'
set -- $words        # word-split: $1=one $2=two $3=three
set -- "$words"     # quoted: $1='one two   three'

# Changing IFS changes what counts as a delimiter
IFS=':'
set -- $PATH         # splits PATH into positional parameters
IFS=$' \t\n'        # restore default

# IFS='' disables word splitting entirely — useful in while-read loops
while IFS= read -r line; do
  # $line is NEVER split, even if it contains spaces or tabs
  echo "$line"
done < file.txt

# Word splitting does NOT apply inside [[ … ]] or (( … ))
x='a b'
[[ $x == 'a b' ]] && echo "match"   # safe — [[ never word-splits
# [ $x == 'a b' ] would fail — old-style [ word-splits

Stage 7 — Pathname expansion (globbing)

# Globbing fires after word splitting — on each resulting word
pattern='*.txt'
ls $pattern      # splits then globs: ls file1.txt file2.txt
ls "$pattern"   # quoted: ls literally looks for a file named '*.txt'

# Extended globs (shopt -s extglob)
shopt -s extglob
ls !(*.log)      # everything except .log files
ls +(foo|bar)*   # one or more of 'foo' or 'bar', then anything
ls @(*.jpg|*.png)  # exactly one of the alternatives

# globstar for recursive matching (bash 4+)
shopt -s globstar
ls **/*.py        # all .py files in any subdirectory

# If no match: by default, the pattern is passed literally
# With nullglob: no-match expands to nothing (empty)
shopt -s nullglob
files=( *.xyz )
(( "${#files[@]}" == 0 )) && echo "no matches"

# failglob: treat no-match as an error instead
shopt -s failglob

3 — Where Quoting Lives in This Model

A persistent confusion: people think quoting is about expansion. It is not. Quoting is resolved by the lexer, before any expansion stage runs. What quoting actually does is mark characters as not subject to the rules that apply to unquoted characters in later stages.

Quote formWhat it preventsWhat still happens inside it
'…'Everything — no expansion at allNothing
"…"Word splitting, globbing, brace expansion, tilde expansionParameter, arithmetic, command substitution; \ before $ ` " \ newline
\xAll special meaning of the next characterNothing (the backslash is consumed)
$'…'All expansion; enables C-style escapes\n \t \xNN \uNNNN etc.
# Double-quote prevents word splitting and globbing but NOT substitution
f='hello world'
echo "The value is: $f"      # parameter expansion fires inside ""
echo "Result: $((2+2))"       # arithmetic expansion fires inside ""
echo "Files: $(ls)"           # command substitution fires inside ""
echo "Glob: *.txt"            # glob does NOT fire inside "" — literal

# $'...' for control characters without $() or printf
NL=$'\n'
TAB=$'\t'
ESC=$'\e'   # or \033
printf 'col1%scol2\n' "$TAB"

4 — Evaluation Contexts That Change the Rules

The seven stages apply to simple commands. Several other constructs have different evaluation rules that catch experienced programmers off guard.

[[ … ]] — the compound test command

# Inside [[: word splitting and pathname expansion are DISABLED
# Pattern matching on the right-hand side of == is still active

x='foo bar'
[[ $x == 'foo bar' ]]       # safe without quotes — no splitting
[[ $x == foo* ]]             # RHS is a PATTERN (not a string)
[[ $x == 'foo*' ]]            # quoted RHS = literal string match
[[ $x =~ ^foo[[:space:]] ]]  # RHS is an ERE regex — never quote it

# BASH_REMATCH captures regex groups
if [[ '2026-06-09' =~ ^([0-9]{4})-([0-9]{2})-([0-9]{2})$ ]]; then
  echo "year=${BASH_REMATCH[1]} month=${BASH_REMATCH[2]} day=${BASH_REMATCH[3]}"
fi

(( … )) — arithmetic evaluation context

# Inside ((): variable names do NOT need $
# String values that look like integers are evaluated as integers
# An expression that evaluates to 0 returns exit code 1 (false)

a=5; b=3
(( a > b )) && echo "a is larger"
(( a += b ))           # a is now 8 — side effects are allowed

# 0 is false, non-zero is true — opposite of shell exit codes
(( 0 )) || echo "zero is false"
(( 1 )) && echo "non-zero is true"

# Pitfall: (( expression )) with -e set — false exits the script!
set -e
count=0
(( count++ ))   # DANGER: count++ evaluates to 0 (the old value), exits script
# Safe pattern:
(( count++ )) || true
# or:
count=$(( count + 1 ))    # $(()) always succeeds

Here-strings and here-documents

# Here-string: a single word, expanded then fed as stdin
# Expansion stages 3-5 apply; no word splitting or globbing
cat <<<"Hello $USER, you have $(ls | wc -l) files"

# Heredoc: delimiter controls expansion
# Unquoted delimiter: stages 3-5 apply
cat << EOF
Hello $USER
EOF

# Quoted delimiter ('EOF', "EOF", \EOF): NO expansion — literal text
cat << 'EOF'
Hello $USER   <-- this is literal, not expanded
EOF

5 — Alias Expansion: Stage Zero

Aliases are not part of the seven-stage pipeline. They are expanded by the parser, before any other processing — including before brace expansion. This makes them powerful but also fragile.

# Aliases are expanded during parsing, not execution
# They apply only to interactive shells and scripts that explicitly enable them
shopt -s expand_aliases   # required in non-interactive scripts

alias ll='ls -la'
ll /tmp          # alias expanded before ll is parsed as a command

# Aliases vs functions: aliases cannot take arguments by position
# Functions are nearly always preferable in scripts
# The ONE valid use case for aliases in scripts: redefining a command globally
alias grep='grep --color=auto'

6 — Practical Consequences: Reading Bugs Through the Model

Let's apply the model to diagnose real bugs you will encounter.

Bug: glob fires when you don't want it to

pattern='*.log'
if [[ -n $pattern ]]; then
  # Safe: [[ never globs
fi

# Dangerous: passes to an external command unquoted
find . -name $pattern   # if *.log matches files HERE, it expands before find sees it
find . -name "$pattern" # correct — quote prevents premature glob

Bug: array element count broken by word splitting

# Wrong: word splitting turns one element into many
items='one two three'
arr=( $items )       # three elements — probably what you want
arr=( "$items" )     # ONE element: 'one two three' — probably NOT what you want

# Use mapfile for line-delimited data instead
mapfile -t arr < file.txt    # each line becomes one element, no splitting

Bug: nested command substitution and quoting

# Outer $() creates a subshell; inner quotes work independently
result=$(echo "$(date '+%Y-%m-%d')")   # both levels of quoting are independent

# Backtick nesting requires escaping inner quotes — avoid backticks
old=`echo \`date\``   # ugly; $() is always clearer

Bug: arithmetic in an unquoted assignment splits the result

x=$(( 1 + 2 ))    # safe — arithmetic result is an integer, no spaces
# But consider:
IFS=2
x=$(( 1 + 2 ))    # result '3' — fine, IFS=2 only matters for unquoted expansions
echo $x           # could split on '2' if result were '12' — quote it

Exercises

Exercise 1 — Expansion order detective

For each of the following lines, predict exactly what will be printed (or whether it will error), then verify in your shell. Explain which expansion stage causes the behaviour.

a='*.sh'
echo $a
echo "$a"
echo ${a/sh/py}
b='PATH'
echo ${!b}
set -- $'one\ntwo\nthree'
echo $#
# echo $a       — parameter expansion → '*.sh'; then word splitting (no spaces)
#                → then globbing: if .sh files exist, lists them; else prints '*.sh'
# Stage 7 (globbing) is the key stage.

# echo "$a"     — quoted: parameter expansion → '*.sh', no word splitting, no globbing
#                → prints literally:  *.sh

# echo ${a/sh/py}  — pattern substitution during parameter expansion (stage 3)
#                   → '*.py'; then word splitting (nothing to split)
#                   → then globbing: expands to .py files if any, else '*.py'

b='PATH'
echo "${!b}"
# Indirect expansion (stage 3): ${!b} looks up the variable whose name is in b
# → expands to the value of $PATH

set -- $'\none\ntwo\nthree'
echo $#
# $'...' is resolved by the LEXER (before any expansion stage)
# The result is a string with embedded newlines.
# set -- $'one\ntwo\nthree' — the $'...' becomes a single unquoted word
# with newlines inside it. Word splitting (stage 6) splits on newlines
# (newline is in IFS by default), giving three positional parameters.
# $# == 3

Exercise 2 — IFS surgery

Write a function split_by SEP STRING RETARRAY that splits STRING on the single-character delimiter SEP and stores the result in the array named RETARRAY — without using any external commands (no awk, cut, tr). Use IFS manipulation and the knowledge from this chapter about word splitting and namerefs.

split_by() {
  # Args: SEP STRING RETARRAY
  local _sep="$1"
  local _str="$2"
  local -n _ret="$3"

  # Save and set IFS locally — local IFS is scoped to this function
  local IFS="$_sep"

  # Word splitting fires when we assign the unquoted expansion to the array.
  # The read -ra trick is cleaner and handles empty fields:
  read -ra _ret <<<"$_str"
}

# Test
declare -a parts
split_by ':' 'one:two:three' parts
printf '[%s]\n' "${parts[@]}"
# [one]
# [two]
# [three]

split_by ',' 'a,,b,c' parts   # empty fields
printf '[%s]\n' "${parts[@]}"
# Note: read -ra collapses consecutive delimiters (IFS whitespace rules)
# For truly empty-field-preserving splits, use a while/case loop with ${str%%SEP*}

Exercise 3 — Glob-safe pattern passing

Write a function safe_find DIR PATTERN that uses find to locate files matching PATTERN in DIR. The function must work correctly regardless of whether the current directory contains files that match PATTERN — i.e., it must prevent premature glob expansion. Also handle the case where DIR does not exist, exiting with a clear message.

safe_find() {
  local dir="${1:?safe_find: DIR required}"
  local pattern="${2:?safe_find: PATTERN required}"

  if [[ ! -d "$dir" ]]; then
    printf 'safe_find: not a directory: %s\n' "$dir" >&2
    return 1
  fi

  # Always quote $pattern so the shell never expands it.
  # find receives the literal string and does its own pattern matching.
  find "$dir" -type f -name "$pattern"
}

# Usage — safe even when *.log files exist in the current directory
safe_find /var/log '*.log'
safe_find /nonexistent '*.txt'   # prints error, returns 1

Exercise 4 — Expansion order puzzle: build a general renderer

Write a function render_template TEMPLATE_FILE that reads a file where {{VAR}} placeholders should be replaced with the values of shell variables of the same name. Requirements:

  • Must NOT use eval (injection risk)
  • Must handle multi-line templates
  • Must leave {{UNDEFINED}} placeholders intact if the variable is not set
  • Should use only Bash built-ins — no sed or awk

Hint: think about how parameter expansion's ${!varname} indirect form and ${var:-default} combine here.

render_template() {
  local file="${1:?template file required}"
  [[ -f "$file" ]] || { printf 'render_template: not found: %s\n' "$file" >&2; return 1; }

  local line out rest varname value

  while IFS= read -r line || [[ -n $line ]]; do
    out=''
    rest="$line"

    # Scan for {{...}} placeholders using parameter expansion — no external tools
    while [[ $rest == *'{{'*'}}'* ]]; do
      # Everything before the first {{
      out+="${rest%%\{\{*}"
      # Everything from {{ onward
      rest="${rest#*\{\{}"
      # The variable name is before }}
      varname="${rest%%\}\}*}"
      # Remainder after }}
      rest="${rest#*\}\}}"

      # Validate: varname must be a legal identifier
      if [[ $varname =~ ^[a-zA-Z_][a-zA-Z0-9_]*$ ]]; then
        # Indirect expansion — ${!varname} retrieves the value
        value="${!varname}"
        if [[ -v $varname ]]; then
          # Variable is set — substitute its value
          out+="$value"
        else
          # Variable not set — leave placeholder intact
          out+="{{$varname}}"
        fi
      else
        # Not a valid identifier — leave it as-is
        out+="{{$varname}}"
      fi
    done

    printf '%s\n' "${out}${rest}"
  done < "$file"
}

# Test
NAME='Alice'
HOST='prod-01'
# Template file contains:
#   Hello {{NAME}}, deploying to {{HOST}}.
#   Token: {{SECRET}}   <-- SECRET not set
render_template template.txt
Hello Alice, deploying to prod-01.
Token: {{SECRET}}