Error Handling and Debugging

🛡️ Topic 11 — Error Handling and Debugging

A script that silently continues past a failed command and corrupts data is far more dangerous than one that crashes loudly. Professional Bash scripts are defensive by design — they set strict execution flags, trap unexpected exits, log what they do, and clean up after themselves no matter how they end. This chapter covers every layer of that defence, plus the full toolkit for finding and fixing bugs when things go wrong.

1 — Exit Codes

Every command in Bash exits with a numeric status code — 0 means success, any non-zero value means failure. This is the foundation of all error handling.

🐧 Reading and using exit codes
# $? holds the exit code of the most recent command ls /tmp echo "Exit code: $?" → 0 (success) ls /nonexistent 2>/dev/null echo "Exit code: $?" → 2 (no such file) # Check exit code immediately after a command cp source.txt dest.txt if [[ $? -ne 0 ]]; then echo "Copy failed" >&2 exit 1 fi # More concise idiom: use the command directly in if if ! cp source.txt dest.txt; then echo "Copy failed" >&2 exit 1 fi # Common exit code conventions exit 0 # success exit 1 # general error exit 2 # misuse of shell built-in / bad argument exit 126 # command found but not executable exit 127 # command not found exit 130 # terminated by Ctrl+C (128 + signal 2)
⚠️ $? is reset after every command. If you need to test the exit code of a command, check $? on the very next line, or save it: rc=$?. An intervening echo or assignment will overwrite it with its own exit code (usually 0).

2 — Strict Mode: set -euo pipefail

These three options — almost always combined — form the backbone of defensive scripting. Put them at the top of every non-trivial script.

🐧 The strict mode header
#!/bin/bash set -euo pipefail # Equivalent to writing all three separately: # set -e Exit immediately if any command fails # set -u Treat unset variables as an error # set -o pipefail Make a pipeline fail if ANY stage fails

set -e — exit on error

🐧 What set -e does and when it doesn't trigger
set -e # Without set -e: script continues after failure ls /nonexistent # exits 2 — script would continue silently! echo "This runs" # With set -e: script exits immediately on that ls failure # set -e does NOT trigger for: # - commands in an if condition # - the left side of && or || # - commands followed by ! (negation) # - commands in a while/until condition if grep -q "pattern" file.txt; then # grep's exit code is handled here echo "found" fi # Use || true to intentionally allow a command to fail rm stale_lock.pid || true # don't abort if file doesn't exist mkdir -p output/ || true # -p already handles this, but pattern is common

set -u — catch unset variables

🐧 set -u in action
set -u # Without set -u: typos silently expand to empty string username="alice" echo "Hello, $usrname" # typo — prints "Hello, " silently # With set -u: bash throws an error immediately bash: usrname: unbound variable # Use ${var:-default} to safely allow a variable to be unset log_level="${LOG_LEVEL:-info}" # default to "info" if LOG_LEVEL not set output_dir="${1:-/tmp/output}" # default to /tmp/output if $1 not given # Special variables $@ and $* need care with set -u # Use "${@:-}" or check $# first [[ $# -gt 0 ]] && echo "First arg: $1"

set -o pipefail — catch pipeline failures

🐧 Why pipefail matters
# Without pipefail: pipeline exit code = last command's code # This silently succeeds even though cat failed! cat /nonexistent/file.txt | grep "pattern" echo "Exit: $?" → 1 (grep's code, not cat's) set -o pipefail cat /nonexistent/file.txt | grep "pattern" cat: /nonexistent/file.txt: No such file or directory # Pipeline exit code is now 1 (cat's failure), script exits # PIPESTATUS — array of exit codes for each stage of last pipeline cat file.txt | grep "x" | sort echo "cat: ${PIPESTATUS[0]}, grep: ${PIPESTATUS[1]}, sort: ${PIPESTATUS[2]}"

3 — The trap Command

trap registers a command or function to run when the script receives a signal or exits. It is essential for cleanup — removing temp files, releasing locks, printing a useful error message — no matter how the script ends.

🐧 trap syntax and signal names
# trap 'command' SIGNAL [SIGNAL...] # Key pseudo-signals: # EXIT — runs when the script exits for any reason # ERR — runs after any command that returns non-zero (with set -e) # INT — Ctrl+C (SIGINT) # TERM — kill command (SIGTERM) # HUP — terminal hang-up (SIGHUP) # DEBUG — runs before every command (useful for tracing) # Remove a trap trap - EXIT # List current traps trap -p

EXIT trap — guaranteed cleanup

🐧 Cleanup on exit with a temp directory
#!/bin/bash set -euo pipefail # Create a temp directory and guarantee its removal on exit TMPDIR=$(mktemp -d) trap 'rm -rf "$TMPDIR"' EXIT # Everything in TMPDIR is cleaned up whether script succeeds, # fails, or is killed with Ctrl+C echo "Working in: $TMPDIR" cp important_data.csv "$TMPDIR/" # ... do processing ... mv "$TMPDIR/result.csv" ./final_result.csv # Cleanup happens automatically at this point

ERR trap — error location reporting

🐧 ERR trap that prints which line failed
#!/bin/bash set -euo pipefail on_error() { local exit_code=$? local line_no="$1" printf '\n\033[31m[ERROR]\033[0m Script failed at line %d (exit code %d)\n' \ "$line_no" "$exit_code" >&2 } # $LINENO expands to the current line number at the time trap fires trap 'on_error $LINENO' ERR # Combine with EXIT for full coverage TMPDIR=$(mktemp -d) trap 'rm -rf "$TMPDIR"' EXIT echo "Script starting..." cp /nonexistent "$TMPDIR" # this will fail echo "This line is never reached" [ERROR] Script failed at line 18 (exit code 1)

INT and TERM — graceful interruption

🐧 Handling Ctrl+C and kill signals
#!/bin/bash LOCK_FILE="/var/run/myscript.lock" cleanup() { echo "" echo "Caught signal — cleaning up..." >&2 rm -f "$LOCK_FILE" exit 130 # 128 + SIGINT(2) — conventional exit code for Ctrl+C } trap 'cleanup' INT TERM # Acquire lock touch "$LOCK_FILE" echo "Running (PID $$). Press Ctrl+C to stop." while true; do echo "Working..." sleep 2 done

4 — Error Handling Patterns

die() — centralised fatal error function

🐧 A reusable die() function
die() { local msg="${1:-Fatal error}" local code="${2:-1}" printf '\033[31m[FATAL]\033[0m %s\n' "$msg" >&2 exit "$code" } # Usage — call it anywhere you want to abort with a message [[ -f "$config" ]] || die "Config file not found: $config" ping -c1 -W1 "$host" >/dev/null 2&1 || die "Host unreachable: $host" [[ $EUID -eq 0 ]] || die "This script must be run as root" 2

require() — checking dependencies upfront

🐧 Verify required tools exist before doing any work
require() { local cmd for cmd in "$@"; do command -v "$cmd" >/dev/null 2&1 || \ die "Required command not found: $cmd" done } # At the top of the script, check all dependencies at once require curl jq awk sed git # command -v is preferred over which for portability # It returns 0 if found, non-zero if not

Handling errors in subshells and command substitution

🐧 set -e and command substitution gotchas
set -e # GOTCHA: set -e is NOT inherited by command substitution $() result=$(failing_command) # failing_command runs in a subshell # The assignment itself fails — set -e DOES catch this # BUT if you assign in a local declaration, set -e is bypassed! bad_example() { local val=$(failing_command) # local always exits 0 — failure is hidden! } # CORRECT: declare local first, assign separately good_example() { local val val=$(failing_command) # now set -e sees the failure } # Explicitly check when you need the value AND the exit code output=$(some_command) || { echo "some_command failed" >&2; exit 1; }
🚨 Critical: local var=$(cmd) silently swallows errors. The local builtin always exits with code 0, masking the failure of the command substitution inside it. Always declare local var on one line and assign it on the next. This is one of the most common silent failure bugs in Bash scripts.

5 — Structured Logging

Good logging is what tells you what happened after a script runs unattended. A minimal log library takes only a dozen lines to write and pays dividends immediately.

🐧 A reusable log library (lib/log.sh)
#!/bin/bash # lib/log.sh — source this from your scripts LOG_LEVEL="${LOG_LEVEL:-INFO}" # override with: LOG_LEVEL=DEBUG ./script.sh LOG_FILE="${LOG_FILE:-}" # set to a path to also write to a file declare -A _LOG_LEVELS=( [DEBUG]=0 [INFO]=1 [WARN]=2 [ERROR]=3 ) _log() { local level="$1"; shift local msg="$*" local ts ts=$(date '+%Y-%m-%d %H:%M:%S') # Skip if below configured log level [[ "${_LOG_LEVELS[$level]:-0}" -lt "${_LOG_LEVELS[$LOG_LEVEL]:-1}" ]] && return local colour case "$level" in DEBUG) colour='\033[36m' ;; # cyan INFO) colour='\033[32m' ;; # green WARN) colour='\033[33m' ;; # yellow ERROR) colour='\033[31m' ;; # red esac local line line="[$ts] [${level}] $msg" # Coloured output to stderr printf "${colour}%s\033[0m\n" "$line" >&2 # Plain output to log file (no colour codes) [[ -n "$LOG_FILE" ]] && printf "%s\n" "$line" >> "$LOG_FILE" } log_debug() { _log DEBUG "$@"; } log_info() { _log INFO "$@"; } log_warn() { _log WARN "$@"; } log_error() { _log ERROR "$@"; }
🐧 Using the log library in a script
#!/bin/bash set -euo pipefail source "$(dirname "$0")/lib/log.sh" LOG_FILE="/var/log/myscript.log" log_info "Script started (PID $$)" log_debug "Arguments: $*" if ! ping -c1 -W1 google.com >/dev/null 2&1; then log_warn "No network connectivity" fi log_info "Processing complete" # Run with debug logging: # LOG_LEVEL=DEBUG ./myscript.sh [2026-06-09 14:32:01] [INFO] Script started (PID 4521) [2026-06-09 14:32:01] [DEBUG] Arguments: file.csv [2026-06-09 14:32:02] [INFO] Processing complete

6 — Debugging Tools

set -x — execution tracing

🐧 Trace mode: see every command as it executes
# Enable at the command line — no script modification needed bash -x myscript.sh arg1 arg2 # Or add to the script header alongside other options set -euxo pipefail # Enable/disable around a specific section only echo "Before the tricky bit" set -x complex_operation "$arg1" "$arg2" set +x echo "After the tricky bit" # Customise the trace prompt — show line numbers PS4='+ ${BASH_SOURCE[0]}:${LINENO}: ' set -x # Trace output (each line prefixed with ++): ++ myscript.sh:12: cp source.txt /tmp/ ++ myscript.sh:13: echo "Done"

bash -n — syntax check without running

🐧 Validate syntax before executing
# Check syntax of a script — runs no commands bash -n myscript.sh # No output = no syntax errors # bash -n only catches syntax errors, NOT logic errors or missing files # Combine with set -e in a CI pipeline: # bash -n is fast — run it first before the real execution

ShellCheck — static analysis

ShellCheck is the single most valuable tool for Bash development. It performs static analysis and catches:
  • Quoting bugs ($var instead of "$var")
  • The local var=$(cmd) silent failure pattern
  • Unquoted globs and word-splitting issues
  • Portability problems (bash-only features in #!/bin/sh scripts)
  • Common logic errors and deprecated syntax

Install: apt install shellcheck / brew install shellcheck
Run: shellcheck myscript.sh
Online: shellcheck.net — paste your script for instant analysis.

Debugging techniques in practice

🐧 Other useful debugging patterns
# 1. Print variable contents and types declare -p my_array # shows type and value — great for arrays declare -p my_var # 2. Check where a function is defined declare -f function_name # prints the function body # 3. Print a stack trace on error print_stack() { local i=0 echo "Call stack:" >&2 while caller $i; do (( i++ )) done >&2 } trap 'print_stack' ERR # 4. Time a section of code start=$(date +%s%N) # nanoseconds # ... work ... elapsed=$(( ($(date +%s%N) - start) / 1000000 )) echo "Elapsed: ${elapsed}ms" # 5. BASH_SOURCE, FUNCNAME, LINENO — where am I? debug_location() { printf "[%s:%d in %s()]\n" \ "${BASH_SOURCE[1]}" "${BASH_LINENO[0]}" "${FUNCNAME[1]}" >&2 } # 6. Pause and inspect mid-script breakpoint() { set +x read -r -p "[breakpoint] Press Enter to continue..." set -x }

7 — The Defensive Script Template

This is a production-ready starting point that combines everything from this chapter. Copy it as your base for any non-trivial script.

🐧 Complete defensive script skeleton
#!/usr/bin/env bash # ============================================================= # script_name.sh — One-line description of what this does # Usage: ./script_name.sh [OPTIONS] ARGUMENT # ============================================================= set -euo pipefail IFS=$'\n\t' # word-split only on newlines and tabs, not spaces # ── Script metadata ────────────────────────────────────────── readonly SCRIPT_NAME="$(basename "${BASH_SOURCE[0]}")" readonly SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # ── Logging ────────────────────────────────────────────────── LOG_LEVEL="${LOG_LEVEL:-INFO}" log() { printf '[%s] [%s] %s\n' "$(date '+%H:%M:%S')" "$1" "$2" >&2; } info() { log INFO "$*"; } warn() { log WARN "$*"; } die() { log FATAL "$*"; exit 1; } # ── Cleanup ─────────────────────────────────────────────────── TMPDIR=$(mktemp -d) cleanup() { local rc=$? rm -rf "$TMPDIR" [[ $rc -ne 0 ]] && warn "Script exited with code $rc" } trap 'cleanup' EXIT on_error() { die "Unexpected error on line $1"; } trap 'on_error $LINENO' ERR # ── Argument parsing ────────────────────────────────────────── usage() { printf 'Usage: %s [--verbose] INPUT_FILE\n' "$SCRIPT_NAME" exit "${1:-0}" } verbose=0 input_file="" while [[ $# -gt 0 ]]; do case "$1" in --verbose|-v) verbose=1; shift ;; --help|-h) usage ;; --) shift; break ;; -*) die "Unknown option: $1" ;; *) input_file="$1"; shift ;; esac done [[ -n "$input_file" ]] || usage 2 [[ -f "$input_file" ]] || die "File not found: $input_file" # ── Main logic ──────────────────────────────────────────────── main() { info "Starting $SCRIPT_NAME" # ... your work here ... info "Done" } main "$@"
❌ Fragile script
#!/bin/bash # No strict mode # No trap # No error checking cp $1 /backup/ rm $1 echo "Done" # If cp fails: rm runs anyway # If $1 has spaces: breaks # If /backup/ full: silent fail
✅ Defensive script
#!/usr/bin/env bash set -euo pipefail trap 'echo "Failed on line $LINENO" >&2' ERR [[ -f "${1:?No file given}" ]] \ || { echo "Not a file: $1" >&2; exit 1; } if cp "$1" /backup/; then rm "$1" echo "Done" else echo "Copy failed — original kept" >&2 exit 1 fi

8 — Quick Reference

Tool / OptionWhat it doesNotes
$?Exit code of most recent command0 = success; save to rc=$? before it's overwritten
set -eExit on any command failureDoes not trigger in if, ||, &&, ! contexts
set -uError on unset variableUse ${var:-default} for intentionally optional vars
set -o pipefailPipeline fails if any stage failsPIPESTATUS array has per-stage codes
set -xPrint each command before executingCustomise prompt with PS4
bash -n scriptSyntax check without runningQuick pre-flight check
trap 'cmd' EXITRun on any exitUse for cleanup — temp files, locks
trap 'cmd' ERRRun after any error (with set -e)Use $LINENO to report location
trap 'cmd' INT TERMHandle Ctrl+C / killExit with code 130 for INT
trap - SIGNALRemove a trap
command -v nameCheck a command exists (portable)Prefer over which
declare -p varPrint variable type and valueEssential for debugging arrays
caller NPrint call stack frame NUse in a loop for full stack trace
shellcheckStatic analysis — catches subtle bugsRun on every script before committing

✏️ Exercises

Apply what you have learned. Write each script yourself before looking at the sample solution.

Exercise 1
Add full defensive error handling to this broken script. It should: (1) use strict mode, (2) trap EXIT to remove a temp file it creates, (3) trap ERR to print the failing line number, (4) validate that exactly one argument was given and that it is a readable file, (5) check that awk is available before using it.

Broken script to fix:
awk '{sum+=$1} END{print sum}' $1 > /tmp/out.txt; echo "Total: $(cat /tmp/out.txt)"
Hint: wrap the logic in a main() function called at the end. Use ${1:?...} for argument checking, [[ -r "$1" ]] to verify readability, and command -v awk to check availability. Create the temp file with mktemp and trap its removal on EXIT.
Sample Solution
#!/usr/bin/env bash # sum_column.sh — sum the first column of a file set -euo pipefail TMPFILE="" cleanup() { [[ -n "$TMPFILE" ]] && rm -f "$TMPFILE" } trap 'cleanup' EXIT trap 'echo "[ERROR] Failed on line $LINENO" >&2' ERR main() { local input input="${1:?Usage: $0 <file>}" [[ -r "$input" ]] || { echo "Not readable: $input" >&2; exit 1; } command -v awk >/dev/null 2&1 || { echo "awk not found" >&2; exit 1; } TMPFILE=$(mktemp) awk '{sum += $1} END {print sum}' "$input" > "$TMPFILE" echo "Total: $(<"$TMPFILE")" } main "$@"
Exercise 2
Write a script called safe_deploy.sh that simulates a deployment with five steps (each represented by a function that may or may not succeed). Use a full defensive setup: strict mode, an ERR trap that logs the failing step name and line number, an EXIT trap that logs whether the deployment succeeded or failed (based on the exit code), and a rollback() function that is called on ERR to undo any completed steps. Each step should log its progress using a simple log function.
Hint: track completed steps in an array. In the rollback function, iterate the array in reverse and call an undo function for each step. The EXIT trap can read $? to determine success or failure. Simulate random step failure with (( RANDOM % 3 == 0 )).
Sample Solution
#!/usr/bin/env bash # safe_deploy.sh set -euo pipefail completed_steps=() log() { printf '[%s] %s\n' "$(date '+%H:%M:%S')" "$*"; } info() { log "INFO $*"; } error(){ log "ERROR $*" >&2; } rollback() { error "Rolling back ${#completed_steps[@]} completed step(s)..." for (( i=${#completed_steps[@]}-1; i>=0; i-- )); do error " ↩ Undoing: ${completed_steps[$i]}" sleep 0.3 done error "Rollback complete." } on_error() { error "Deployment failed at line $1" rollback } trap 'on_error $LINENO' ERR on_exit() { local rc=$? if [[ $rc -eq 0 ]]; then info "✓ Deployment SUCCEEDED" else error "✗ Deployment FAILED (exit $rc)" fi } trap 'on_exit' EXIT run_step() { local name="$1" info "Running: $name" sleep 0.5 # Simulate random failure (1-in-3 chance) (( RANDOM % 3 != 0 )) || { error "Step failed: $name"; return 1; } completed_steps+=( "$name" ) info " ✓ $name" } info "=== Deployment starting ===" run_step "1. Run database migrations" run_step "2. Upload static assets" run_step "3. Deploy application code" run_step "4. Restart application servers" run_step "5. Warm up caches" info "=== All steps complete ==="
Exercise 3
Write a script called retry.sh that wraps any command and retries it up to N times with a configurable delay between attempts. Usage: ./retry.sh --attempts 5 --delay 2 -- curl https://example.com. Log each attempt number, whether it succeeded or failed, and the exit code. Exit 0 only if the command eventually succeeds; exit 1 if all attempts fail.
Hint: parse --attempts and --delay from $@, stopping at --. Use a for loop with a C-style counter. Capture the command's exit code with cmd_rc=$? inside a subshell. Use sleep "$delay" between attempts and skip the sleep after the final attempt.
Sample Solution
#!/usr/bin/env bash # retry.sh — usage: ./retry.sh [--attempts N] [--delay S] -- COMMAND [ARGS...] set -uo pipefail # note: no -e so we can capture failing command's exit code attempts=3 delay=1 log() { printf '[retry] %s\n' "$*" >&2; } while [[ $# -gt 0 && "$1" != "--" ]]; do case "$1" in --attempts) attempts="$2"; shift 2 ;; --delay) delay="$2"; shift 2 ;; *) echo "Unknown option: $1" >&2; exit 2 ;; esac done shift # remove the '--' [[ $# -gt 0 ]] || { echo "No command given after --" >&2; exit 2; } log "Command : $*" log "Attempts : $attempts" log "Delay : ${delay}s" for (( i=1; i<=attempts; i++ )); do log "Attempt $i / $attempts..." if "$@"; then log "✓ Succeeded on attempt $i" exit 0 else local rc=$? log "✗ Failed (exit $rc)" if [[ $i -lt $attempts ]]; then log "Waiting ${delay}s before retry..." sleep "$delay" fi fi done log "All $attempts attempt(s) failed." exit 1
Exercise 4
Write a script called health_check.sh that checks a list of services and URLs. For each service (e.g. nginx, ssh), it should verify the service is running using systemctl is-active. For each URL, it should check HTTP reachability using curl -sf. Results should be logged with coloured PASS/FAIL labels. At the end, print a summary count of passes and failures. Exit 0 if all checks pass, exit 1 if any fail.
Hint: define arrays for services and URLs at the top of the script. Use a generic check() function that takes a label and a command; it runs the command, captures the exit code, and prints PASS in green or FAIL in red using printf '\033[32mPASS\033[0m'. Count failures in a variable, not with set -e.
Sample Solution
#!/usr/bin/env bash # health_check.sh set -uo pipefail # no -e — we handle each failure ourselves # ── Configure checks here ───────────────────────────────────── services=( ssh cron ) urls=( "https://example.com" "https://api.github.com" ) # ── Counters ────────────────────────────────────────────────── passes=0 failures=0 check() { local label="$1"; shift local rc if "$@" >/dev/null 2&1; then printf ' \033[32mPASS\033[0m %s\n' "$label" (( passes++ )) else printf ' \033[31mFAIL\033[0m %s\n' "$label" (( failures++ )) fi } printf '\n\033[1mHealth Check — %s\033[0m\n' "$(date '+%Y-%m-%d %H:%M:%S')" printf '%-6s %s\n' "Status" "Check" printf '%.0s─' {1..40}; echo echo "Services:" for svc in "${services[@]}"; do check "service: $svc" systemctl is-active "$svc" done echo "URLs:" for url in "${urls[@]}"; do check "url: $url" curl -sf --max-time 5 "$url" done printf '%.0s─' {1..40}; echo printf 'Summary: \033[32m%d passed\033[0m, \033[31m%d failed\033[0m\n\n' \ "$passes" "$failures" [[ $failures -eq 0 ]]