Error Handling and Debugging
🛡️ Topic 11 — Error Handling and Debugging
A script that silently continues past a failed command and corrupts data is far more dangerous than one that crashes loudly. Professional Bash scripts are defensive by design — they set strict execution flags, trap unexpected exits, log what they do, and clean up after themselves no matter how they end. This chapter covers every layer of that defence, plus the full toolkit for finding and fixing bugs when things go wrong.
1 — Exit Codes
Every command in Bash exits with a numeric status code — 0 means success, any non-zero value means failure. This is the foundation of all error handling.
# $? holds the exit code of the most recent command
ls /tmp
echo "Exit code: $?" → 0 (success)
ls /nonexistent 2>/dev/null
echo "Exit code: $?" → 2 (no such file)
# Check exit code immediately after a command
cp source.txt dest.txt
if [[ $? -ne 0 ]]; then
echo "Copy failed" >&2
exit 1
fi
# More concise idiom: use the command directly in if
if ! cp source.txt dest.txt; then
echo "Copy failed" >&2
exit 1
fi
# Common exit code conventions
exit 0 # success
exit 1 # general error
exit 2 # misuse of shell built-in / bad argument
exit 126 # command found but not executable
exit 127 # command not found
exit 130 # terminated by Ctrl+C (128 + signal 2)
$? on the very next line, or save it: rc=$?. An intervening echo or assignment will overwrite it with its own exit code (usually 0).
2 — Strict Mode: set -euo pipefail
These three options — almost always combined — form the backbone of defensive scripting. Put them at the top of every non-trivial script.
#!/bin/bash
set -euo pipefail
# Equivalent to writing all three separately:
# set -e Exit immediately if any command fails
# set -u Treat unset variables as an error
# set -o pipefail Make a pipeline fail if ANY stage fails
set -e — exit on error
set -e
# Without set -e: script continues after failure
ls /nonexistent # exits 2 — script would continue silently!
echo "This runs"
# With set -e: script exits immediately on that ls failure
# set -e does NOT trigger for:
# - commands in an if condition
# - the left side of && or ||
# - commands followed by ! (negation)
# - commands in a while/until condition
if grep -q "pattern" file.txt; then # grep's exit code is handled here
echo "found"
fi
# Use || true to intentionally allow a command to fail
rm stale_lock.pid || true # don't abort if file doesn't exist
mkdir -p output/ || true # -p already handles this, but pattern is common
set -u — catch unset variables
set -u
# Without set -u: typos silently expand to empty string
username="alice"
echo "Hello, $usrname" # typo — prints "Hello, " silently
# With set -u: bash throws an error immediately
bash: usrname: unbound variable
# Use ${var:-default} to safely allow a variable to be unset
log_level="${LOG_LEVEL:-info}" # default to "info" if LOG_LEVEL not set
output_dir="${1:-/tmp/output}" # default to /tmp/output if $1 not given
# Special variables $@ and $* need care with set -u
# Use "${@:-}" or check $# first
[[ $# -gt 0 ]] && echo "First arg: $1"
set -o pipefail — catch pipeline failures
# Without pipefail: pipeline exit code = last command's code
# This silently succeeds even though cat failed!
cat /nonexistent/file.txt | grep "pattern"
echo "Exit: $?" → 1 (grep's code, not cat's)
set -o pipefail
cat /nonexistent/file.txt | grep "pattern"
cat: /nonexistent/file.txt: No such file or directory
# Pipeline exit code is now 1 (cat's failure), script exits
# PIPESTATUS — array of exit codes for each stage of last pipeline
cat file.txt | grep "x" | sort
echo "cat: ${PIPESTATUS[0]}, grep: ${PIPESTATUS[1]}, sort: ${PIPESTATUS[2]}"
3 — The trap Command
trap registers a command or function to run when the script receives a signal or exits. It is essential for cleanup — removing temp files, releasing locks, printing a useful error message — no matter how the script ends.
# trap 'command' SIGNAL [SIGNAL...]
# Key pseudo-signals:
# EXIT — runs when the script exits for any reason
# ERR — runs after any command that returns non-zero (with set -e)
# INT — Ctrl+C (SIGINT)
# TERM — kill command (SIGTERM)
# HUP — terminal hang-up (SIGHUP)
# DEBUG — runs before every command (useful for tracing)
# Remove a trap
trap - EXIT
# List current traps
trap -p
EXIT trap — guaranteed cleanup
#!/bin/bash
set -euo pipefail
# Create a temp directory and guarantee its removal on exit
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
# Everything in TMPDIR is cleaned up whether script succeeds,
# fails, or is killed with Ctrl+C
echo "Working in: $TMPDIR"
cp important_data.csv "$TMPDIR/"
# ... do processing ...
mv "$TMPDIR/result.csv" ./final_result.csv
# Cleanup happens automatically at this point
ERR trap — error location reporting
#!/bin/bash
set -euo pipefail
on_error() {
local exit_code=$?
local line_no="$1"
printf '\n\033[31m[ERROR]\033[0m Script failed at line %d (exit code %d)\n' \
"$line_no" "$exit_code" >&2
}
# $LINENO expands to the current line number at the time trap fires
trap 'on_error $LINENO' ERR
# Combine with EXIT for full coverage
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
echo "Script starting..."
cp /nonexistent "$TMPDIR" # this will fail
echo "This line is never reached"
[ERROR] Script failed at line 18 (exit code 1)
INT and TERM — graceful interruption
#!/bin/bash
LOCK_FILE="/var/run/myscript.lock"
cleanup() {
echo ""
echo "Caught signal — cleaning up..." >&2
rm -f "$LOCK_FILE"
exit 130 # 128 + SIGINT(2) — conventional exit code for Ctrl+C
}
trap 'cleanup' INT TERM
# Acquire lock
touch "$LOCK_FILE"
echo "Running (PID $$). Press Ctrl+C to stop."
while true; do
echo "Working..."
sleep 2
done
4 — Error Handling Patterns
die() — centralised fatal error function
die() {
local msg="${1:-Fatal error}"
local code="${2:-1}"
printf '\033[31m[FATAL]\033[0m %s\n' "$msg" >&2
exit "$code"
}
# Usage — call it anywhere you want to abort with a message
[[ -f "$config" ]] || die "Config file not found: $config"
ping -c1 -W1 "$host" >/dev/null 2&1 || die "Host unreachable: $host"
[[ $EUID -eq 0 ]] || die "This script must be run as root" 2
require() — checking dependencies upfront
require() {
local cmd
for cmd in "$@"; do
command -v "$cmd" >/dev/null 2&1 || \
die "Required command not found: $cmd"
done
}
# At the top of the script, check all dependencies at once
require curl jq awk sed git
# command -v is preferred over which for portability
# It returns 0 if found, non-zero if not
Handling errors in subshells and command substitution
set -e
# GOTCHA: set -e is NOT inherited by command substitution $()
result=$(failing_command) # failing_command runs in a subshell
# The assignment itself fails — set -e DOES catch this
# BUT if you assign in a local declaration, set -e is bypassed!
bad_example() {
local val=$(failing_command) # local always exits 0 — failure is hidden!
}
# CORRECT: declare local first, assign separately
good_example() {
local val
val=$(failing_command) # now set -e sees the failure
}
# Explicitly check when you need the value AND the exit code
output=$(some_command) || { echo "some_command failed" >&2; exit 1; }
local var=$(cmd) silently swallows errors. The local builtin always exits with code 0, masking the failure of the command substitution inside it. Always declare local var on one line and assign it on the next. This is one of the most common silent failure bugs in Bash scripts.
5 — Structured Logging
Good logging is what tells you what happened after a script runs unattended. A minimal log library takes only a dozen lines to write and pays dividends immediately.
#!/bin/bash
# lib/log.sh — source this from your scripts
LOG_LEVEL="${LOG_LEVEL:-INFO}" # override with: LOG_LEVEL=DEBUG ./script.sh
LOG_FILE="${LOG_FILE:-}" # set to a path to also write to a file
declare -A _LOG_LEVELS=( [DEBUG]=0 [INFO]=1 [WARN]=2 [ERROR]=3 )
_log() {
local level="$1"; shift
local msg="$*"
local ts
ts=$(date '+%Y-%m-%d %H:%M:%S')
# Skip if below configured log level
[[ "${_LOG_LEVELS[$level]:-0}" -lt "${_LOG_LEVELS[$LOG_LEVEL]:-1}" ]] && return
local colour
case "$level" in
DEBUG) colour='\033[36m' ;; # cyan
INFO) colour='\033[32m' ;; # green
WARN) colour='\033[33m' ;; # yellow
ERROR) colour='\033[31m' ;; # red
esac
local line
line="[$ts] [${level}] $msg"
# Coloured output to stderr
printf "${colour}%s\033[0m\n" "$line" >&2
# Plain output to log file (no colour codes)
[[ -n "$LOG_FILE" ]] && printf "%s\n" "$line" >> "$LOG_FILE"
}
log_debug() { _log DEBUG "$@"; }
log_info() { _log INFO "$@"; }
log_warn() { _log WARN "$@"; }
log_error() { _log ERROR "$@"; }
#!/bin/bash
set -euo pipefail
source "$(dirname "$0")/lib/log.sh"
LOG_FILE="/var/log/myscript.log"
log_info "Script started (PID $$)"
log_debug "Arguments: $*"
if ! ping -c1 -W1 google.com >/dev/null 2&1; then
log_warn "No network connectivity"
fi
log_info "Processing complete"
# Run with debug logging:
# LOG_LEVEL=DEBUG ./myscript.sh
[2026-06-09 14:32:01] [INFO] Script started (PID 4521)
[2026-06-09 14:32:01] [DEBUG] Arguments: file.csv
[2026-06-09 14:32:02] [INFO] Processing complete
6 — Debugging Tools
set -x — execution tracing
# Enable at the command line — no script modification needed
bash -x myscript.sh arg1 arg2
# Or add to the script header alongside other options
set -euxo pipefail
# Enable/disable around a specific section only
echo "Before the tricky bit"
set -x
complex_operation "$arg1" "$arg2"
set +x
echo "After the tricky bit"
# Customise the trace prompt — show line numbers
PS4='+ ${BASH_SOURCE[0]}:${LINENO}: '
set -x
# Trace output (each line prefixed with ++):
++ myscript.sh:12: cp source.txt /tmp/
++ myscript.sh:13: echo "Done"
bash -n — syntax check without running
# Check syntax of a script — runs no commands
bash -n myscript.sh
# No output = no syntax errors
# bash -n only catches syntax errors, NOT logic errors or missing files
# Combine with set -e in a CI pipeline:
# bash -n is fast — run it first before the real execution
ShellCheck — static analysis
- Quoting bugs (
$varinstead of"$var") - The
local var=$(cmd)silent failure pattern - Unquoted globs and word-splitting issues
- Portability problems (bash-only features in
#!/bin/shscripts) - Common logic errors and deprecated syntax
Install:
apt install shellcheck / brew install shellcheckRun:
shellcheck myscript.shOnline: shellcheck.net — paste your script for instant analysis.
Debugging techniques in practice
# 1. Print variable contents and types
declare -p my_array # shows type and value — great for arrays
declare -p my_var
# 2. Check where a function is defined
declare -f function_name # prints the function body
# 3. Print a stack trace on error
print_stack() {
local i=0
echo "Call stack:" >&2
while caller $i; do
(( i++ ))
done >&2
}
trap 'print_stack' ERR
# 4. Time a section of code
start=$(date +%s%N) # nanoseconds
# ... work ...
elapsed=$(( ($(date +%s%N) - start) / 1000000 ))
echo "Elapsed: ${elapsed}ms"
# 5. BASH_SOURCE, FUNCNAME, LINENO — where am I?
debug_location() {
printf "[%s:%d in %s()]\n" \
"${BASH_SOURCE[1]}" "${BASH_LINENO[0]}" "${FUNCNAME[1]}" >&2
}
# 6. Pause and inspect mid-script
breakpoint() {
set +x
read -r -p "[breakpoint] Press Enter to continue..."
set -x
}
7 — The Defensive Script Template
This is a production-ready starting point that combines everything from this chapter. Copy it as your base for any non-trivial script.
#!/usr/bin/env bash
# =============================================================
# script_name.sh — One-line description of what this does
# Usage: ./script_name.sh [OPTIONS] ARGUMENT
# =============================================================
set -euo pipefail
IFS=$'\n\t' # word-split only on newlines and tabs, not spaces
# ── Script metadata ──────────────────────────────────────────
readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")"
readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# ── Logging ──────────────────────────────────────────────────
LOG_LEVEL="${LOG_LEVEL:-INFO}"
log() { printf '[%s] [%s] %s\n' "$(date '+%H:%M:%S')" "$1" "$2" >&2; }
info() { log INFO "$*"; }
warn() { log WARN "$*"; }
die() { log FATAL "$*"; exit 1; }
# ── Cleanup ───────────────────────────────────────────────────
TMPDIR=$(mktemp -d)
cleanup() {
local rc=$?
rm -rf "$TMPDIR"
[[ $rc -ne 0 ]] && warn "Script exited with code $rc"
}
trap 'cleanup' EXIT
on_error() { die "Unexpected error on line $1"; }
trap 'on_error $LINENO' ERR
# ── Argument parsing ──────────────────────────────────────────
usage() {
printf 'Usage: %s [--verbose] INPUT_FILE\n' "$SCRIPT_NAME"
exit "${1:-0}"
}
verbose=0
input_file=""
while [[ $# -gt 0 ]]; do
case "$1" in
--verbose|-v) verbose=1; shift ;;
--help|-h) usage ;;
--) shift; break ;;
-*) die "Unknown option: $1" ;;
*) input_file="$1"; shift ;;
esac
done
[[ -n "$input_file" ]] || usage 2
[[ -f "$input_file" ]] || die "File not found: $input_file"
# ── Main logic ────────────────────────────────────────────────
main() {
info "Starting $SCRIPT_NAME"
# ... your work here ...
info "Done"
}
main "$@"
#!/bin/bash
# No strict mode
# No trap
# No error checking
cp $1 /backup/
rm $1
echo "Done"
# If cp fails: rm runs anyway
# If $1 has spaces: breaks
# If /backup/ full: silent fail
#!/usr/bin/env bash
set -euo pipefail
trap 'echo "Failed on line $LINENO" >&2' ERR
[[ -f "${1:?No file given}" ]] \
|| { echo "Not a file: $1" >&2; exit 1; }
if cp "$1" /backup/; then
rm "$1"
echo "Done"
else
echo "Copy failed — original kept" >&2
exit 1
fi
8 — Quick Reference
| Tool / Option | What it does | Notes |
|---|---|---|
$? | Exit code of most recent command | 0 = success; save to rc=$? before it's overwritten |
set -e | Exit on any command failure | Does not trigger in if, ||, &&, ! contexts |
set -u | Error on unset variable | Use ${var:-default} for intentionally optional vars |
set -o pipefail | Pipeline fails if any stage fails | PIPESTATUS array has per-stage codes |
set -x | Print each command before executing | Customise prompt with PS4 |
bash -n script | Syntax check without running | Quick pre-flight check |
trap 'cmd' EXIT | Run on any exit | Use for cleanup — temp files, locks |
trap 'cmd' ERR | Run after any error (with set -e) | Use $LINENO to report location |
trap 'cmd' INT TERM | Handle Ctrl+C / kill | Exit with code 130 for INT |
trap - SIGNAL | Remove a trap | — |
command -v name | Check a command exists (portable) | Prefer over which |
declare -p var | Print variable type and value | Essential for debugging arrays |
caller N | Print call stack frame N | Use in a loop for full stack trace |
shellcheck | Static analysis — catches subtle bugs | Run on every script before committing |
✏️ Exercises
Apply what you have learned. Write each script yourself before looking at the sample solution.
awk is available before using it.Broken script to fix:
awk '{sum+=$1} END{print sum}' $1 > /tmp/out.txt; echo "Total: $(cat /tmp/out.txt)"main() function called at the end. Use ${1:?...} for argument checking, [[ -r "$1" ]] to verify readability, and command -v awk to check availability. Create the temp file with mktemp and trap its removal on EXIT.#!/usr/bin/env bash
# sum_column.sh — sum the first column of a file
set -euo pipefail
TMPFILE=""
cleanup() {
[[ -n "$TMPFILE" ]] && rm -f "$TMPFILE"
}
trap 'cleanup' EXIT
trap 'echo "[ERROR] Failed on line $LINENO" >&2' ERR
main() {
local input
input="${1:?Usage: $0 <file>}"
[[ -r "$input" ]] || { echo "Not readable: $input" >&2; exit 1; }
command -v awk >/dev/null 2&1 || { echo "awk not found" >&2; exit 1; }
TMPFILE=$(mktemp)
awk '{sum += $1} END {print sum}' "$input" > "$TMPFILE"
echo "Total: $(<"$TMPFILE")"
}
main "$@"
safe_deploy.sh that simulates a deployment with five steps (each represented by a function that may or may not succeed). Use a full defensive setup: strict mode, an ERR trap that logs the failing step name and line number, an EXIT trap that logs whether the deployment succeeded or failed (based on the exit code), and a rollback() function that is called on ERR to undo any completed steps. Each step should log its progress using a simple log function.$? to determine success or failure. Simulate random step failure with (( RANDOM % 3 == 0 )).#!/usr/bin/env bash
# safe_deploy.sh
set -euo pipefail
completed_steps=()
log() { printf '[%s] %s\n' "$(date '+%H:%M:%S')" "$*"; }
info() { log "INFO $*"; }
error(){ log "ERROR $*" >&2; }
rollback() {
error "Rolling back ${#completed_steps[@]} completed step(s)..."
for (( i=${#completed_steps[@]}-1; i>=0; i-- )); do
error " ↩ Undoing: ${completed_steps[$i]}"
sleep 0.3
done
error "Rollback complete."
}
on_error() {
error "Deployment failed at line $1"
rollback
}
trap 'on_error $LINENO' ERR
on_exit() {
local rc=$?
if [[ $rc -eq 0 ]]; then
info "✓ Deployment SUCCEEDED"
else
error "✗ Deployment FAILED (exit $rc)"
fi
}
trap 'on_exit' EXIT
run_step() {
local name="$1"
info "Running: $name"
sleep 0.5
# Simulate random failure (1-in-3 chance)
(( RANDOM % 3 != 0 )) || { error "Step failed: $name"; return 1; }
completed_steps+=( "$name" )
info " ✓ $name"
}
info "=== Deployment starting ==="
run_step "1. Run database migrations"
run_step "2. Upload static assets"
run_step "3. Deploy application code"
run_step "4. Restart application servers"
run_step "5. Warm up caches"
info "=== All steps complete ==="
retry.sh that wraps any command and retries it up to N times with a configurable delay between attempts. Usage: ./retry.sh --attempts 5 --delay 2 -- curl https://example.com. Log each attempt number, whether it succeeded or failed, and the exit code. Exit 0 only if the command eventually succeeds; exit 1 if all attempts fail.--attempts and --delay from $@, stopping at --. Use a for loop with a C-style counter. Capture the command's exit code with cmd_rc=$? inside a subshell. Use sleep "$delay" between attempts and skip the sleep after the final attempt.#!/usr/bin/env bash
# retry.sh — usage: ./retry.sh [--attempts N] [--delay S] -- COMMAND [ARGS...]
set -uo pipefail # note: no -e so we can capture failing command's exit code
attempts=3
delay=1
log() { printf '[retry] %s\n' "$*" >&2; }
while [[ $# -gt 0 && "$1" != "--" ]]; do
case "$1" in
--attempts) attempts="$2"; shift 2 ;;
--delay) delay="$2"; shift 2 ;;
*) echo "Unknown option: $1" >&2; exit 2 ;;
esac
done
shift # remove the '--'
[[ $# -gt 0 ]] || { echo "No command given after --" >&2; exit 2; }
log "Command : $*"
log "Attempts : $attempts"
log "Delay : ${delay}s"
for (( i=1; i<=attempts; i++ )); do
log "Attempt $i / $attempts..."
if "$@"; then
log "✓ Succeeded on attempt $i"
exit 0
else
local rc=$?
log "✗ Failed (exit $rc)"
if [[ $i -lt $attempts ]]; then
log "Waiting ${delay}s before retry..."
sleep "$delay"
fi
fi
done
log "All $attempts attempt(s) failed."
exit 1
health_check.sh that checks a list of services and URLs. For each service (e.g. nginx, ssh), it should verify the service is running using systemctl is-active. For each URL, it should check HTTP reachability using curl -sf. Results should be logged with coloured PASS/FAIL labels. At the end, print a summary count of passes and failures. Exit 0 if all checks pass, exit 1 if any fail.check() function that takes a label and a command; it runs the command, captures the exit code, and prints PASS in green or FAIL in red using printf '\033[32mPASS\033[0m'. Count failures in a variable, not with set -e.#!/usr/bin/env bash
# health_check.sh
set -uo pipefail # no -e — we handle each failure ourselves
# ── Configure checks here ─────────────────────────────────────
services=( ssh cron )
urls=(
"https://example.com"
"https://api.github.com"
)
# ── Counters ──────────────────────────────────────────────────
passes=0
failures=0
check() {
local label="$1"; shift
local rc
if "$@" >/dev/null 2&1; then
printf ' \033[32mPASS\033[0m %s\n' "$label"
(( passes++ ))
else
printf ' \033[31mFAIL\033[0m %s\n' "$label"
(( failures++ ))
fi
}
printf '\n\033[1mHealth Check — %s\033[0m\n' "$(date '+%Y-%m-%d %H:%M:%S')"
printf '%-6s %s\n' "Status" "Check"
printf '%.0s─' {1..40}; echo
echo "Services:"
for svc in "${services[@]}"; do
check "service: $svc" systemctl is-active "$svc"
done
echo "URLs:"
for url in "${urls[@]}"; do
check "url: $url" curl -sf --max-time 5 "$url"
done
printf '%.0s─' {1..40}; echo
printf 'Summary: \033[32m%d passed\033[0m, \033[31m%d failed\033[0m\n\n' \
"$passes" "$failures"
[[ $failures -eq 0 ]]