Daemon Scripts and Process Management
Chapter 9 — Daemon Scripts and Process Management
A well-written daemon is invisible when healthy and unambiguous when broken. This chapter covers the full lifecycle: the double-fork daemonisation pattern, robust PID files, idiomatic start/stop/status dispatch, and native integration with systemd — the init system on every modern Linux distribution. You will also see how to handle signals cleanly, manage child processes, and write the .service unit that makes all of the hand-crafted shell daemonisation unnecessary in a systemd world.
1 — Why Daemonisation Is Complicated
A daemon is a process that runs detached from any controlling terminal, with its file descriptors closed, in its own session and process group. Getting there from a normal shell script requires several deliberate steps — each one correcting a specific way the process could remain accidentally attached to its parent environment.
| Problem | Cause | Fix |
|---|---|---|
| Daemon killed when terminal closes | Still member of the shell's session; SIGHUP is delivered | Call setsid() — first fork + setsid in child |
| Session leader can acquire a new controlling terminal | First child after setsid() is session leader | Second fork — grandchild can never be session leader |
| Working directory locks a filesystem (prevents unmount) | CWD inherited from parent | cd / in daemon |
| Inherited file descriptors leak resources or hold files open | All parent FDs carried across fork | Close or redirect stdin/stdout/stderr to /dev/null |
| Inherited umask produces unexpected file permissions | umask from shell | umask 022 (or your required value) |
2 — The Double-Fork Pattern
#!/usr/bin/env bash # lib/daemonise.sh — portable double-fork in pure Bash # Usage: daemonise PIDFILE CMD [ARGS...] daemonise() { local pidfile="$1"; shift # ── First fork ────────────────────────────────────────────────── # The parent exits immediately; the shell that launched us # considers the job done and returns to the prompt. ( # We are now in a subshell — first child. # setsid(2) is not directly callable from Bash, but starting # a new process group via a subshell + exec achieves the same # effect on Linux when combined with the next fork. # Use the 'setsid' utility if available for a true setsid call. # ── Second fork ───────────────────────────────────────────── ( # Grandchild: can never become session leader. # Detach from terminal and reset environment. cd / umask 022 # Redirect standard file descriptors exec 0 /dev/null exec 1> /dev/null exec 2>&1 # Write PID file before exec so it's available immediately printf '%d\n' "$$" > "$pidfile" # Replace grandchild with the actual daemon command exec "$@" ) & ) & } # ── Better approach on systems with the 'setsid' utility ──────── daemonise_setsid() { local pidfile="$1"; shift # setsid -f forks, calls setsid(2), then execs the command. # The double-fork is not strictly needed when setsid -f is used # because the child that called setsid() is not the session leader # of the new session (the grandchild is). --fork achieves this. setsid --fork bash -c " cd / umask 022 exec 0/dev/null 2>&1 printf '%d\n' \"\$\$\" > \"$pidfile\" exec \"\$@\" " -- "$@" }
3 — PID Files: Correct Usage
A PID file is a single-line file containing the daemon's PID. It is the canonical way for a start script to later find, signal, and stop the daemon. PID files are simple to get right but easy to get subtly wrong.
# ── Write ──────────────────────────────────────────────────────── # Write BEFORE exec so the PID is visible as soon as the process exists. # Use printf, not echo, to avoid platform differences. printf '%d\n' "$$" > /run/myapp.pid # ── Read and validate ──────────────────────────────────────────── # Never trust a PID file blindly. The process may have died and # a different process may now hold the same PID. read_pid() { local pidfile="$1" [[ -f "$pidfile" ]] || return 1 local pid read -r pid < "$pidfile" # Validate: must be a positive integer [[ $pid =~ ^[0-9]+$ ]] || { echo "corrupt pidfile" >&2; return 1; } printf '%d' "$pid" } pid_is_alive() { local pid="$1" # kill -0 sends no signal but checks if the process exists and # we have permission to signal it. kill -0 "$pid" 2>/dev/null } pid_is_our_daemon() { local pid="$1" name="$2" # Cross-check the process name to guard against PID reuse local comm comm=$(cat "/proc/${pid}/comm" 2>/dev/null) [[ "$comm" == "$name" || "$comm" == "${name:0:15}" ]] # Linux truncates comm to 15 chars in /proc/PID/comm } # ── Cleanup on exit ────────────────────────────────────────────── PIDFILE=/run/myapp.pid cleanup() { rm -f "$PIDFILE" } trap cleanup EXIT # ── Locking: prevent two instances ─────────────────────────────── # The safest way is to hold an exclusive lock on the PID file itself. # flock -n fails immediately if another process holds the lock. exec 200>"$PIDFILE" flock -n 200 || { echo "already running" >&2; exit 1; } printf '%d\n' "$$" >&200 # The lock is held for the lifetime of the process (FD 200 stays open). # When the process exits (cleanly or via signal), the OS releases the lock. # This is more robust than writing then deleting the PID file.
4 — Start / Stop / Status Dispatch Pattern
A well-structured init script is a dispatcher: one script, one argument, clean exit codes. The LSB (Linux Standard Base) defines the exit codes that init systems and monitoring tools expect.
| Action | Exit 0 | Exit 1 | Exit 2 | Exit 3 |
|---|---|---|---|---|
start | started (or already running) | generic error | invalid argument | — |
stop | stopped (or already stopped) | generic error | — | — |
status | running | dead, PID file exists | dead, lock file exists | not running |
restart | restarted | generic error | — | — |
#!/usr/bin/env bash # /etc/init.d/myapp — LSB-compliant init script skeleton set -euo pipefail NAME=myapp DAEMON=/usr/local/bin/myapp PIDFILE=/run/${NAME}.pid LOGFILE=/var/log/${NAME}.log RUNAS=myapp # run as this user DAEMON_ARGS="--config /etc/myapp/myapp.conf" # ── Helper functions ───────────────────────────────────────────── get_pid() { [[ -f "$PIDFILE" ]] || return 1 local pid read -r pid < "$PIDFILE" [[ $pid =~ ^[0-9]+$ ]] && printf '%d' "$pid" } is_running() { local pid pid=$(get_pid) || return 1 kill -0 "$pid" 2>/dev/null } # ── Actions ────────────────────────────────────────────────────── do_start() { if is_running; then echo "$NAME is already running (pid $(get_pid))" return 0 fi echo -n "Starting $NAME... " # Drop privileges and daemonise start-stop-daemon --start \ --quiet \ --background \ --make-pidfile --pidfile "$PIDFILE" \ --chuid "$RUNAS" \ --exec "$DAEMON" \ -- $DAEMON_ARGS \ >> "$LOGFILE" 2&1 # Wait up to 5 s for the PID file to appear local i for i in $(seq 1 10); do is_running && { echo "OK"; return 0; } sleep 0.5 done echo "FAILED" return 1 } do_stop() { if ! is_running; then echo "$NAME is not running" rm -f "$PIDFILE" # clean up stale PID file return 0 fi local pid; pid=$(get_pid) echo -n "Stopping $NAME (pid $pid)... " # Graceful shutdown: SIGTERM, wait, then SIGKILL if needed kill -TERM "$pid" local i for i in $(seq 1 20); do is_running || { rm -f "$PIDFILE"; echo "OK"; return 0; } sleep 0.5 done echo -n "(SIGKILL) " kill -KILL "$pid" 2>/dev/null rm -f "$PIDFILE" echo "OK" } do_status() { if is_running; then echo "$NAME is running (pid $(get_pid))" return 0 # LSB: 0 = running elif [[ -f "$PIDFILE" ]]; then echo "$NAME is dead but PID file exists" return 1 # LSB: 1 = dead, PID file present else echo "$NAME is not running" return 3 # LSB: 3 = not running, no PID file fi } do_reload() { is_running || { echo "$NAME is not running"; return 1; } kill -HUP "$(get_pid)" echo "$NAME reloaded" } # ── Dispatcher ─────────────────────────────────────────────────── case "${1:-}" in start) do_start ;; stop) do_stop ;; restart) do_stop; do_start ;; reload) do_reload ;; status) do_status ;; *) echo "Usage: $0 {start|stop|restart|reload|status}" >&2 exit 2 ;; esac
5 — Signal Handling in Long-Running Scripts
#!/usr/bin/env bash # A daemon main loop with clean signal handling set -euo pipefail PIDFILE=/run/myapp.pid LOGFILE=/var/log/myapp.log RUNNING=1 RELOAD=0 log() { printf '[%s] %s\n' "$(date '%F %T')" "$*" >> "$LOGFILE"; } # ── Signal handlers ────────────────────────────────────────────── handle_term() { log "SIGTERM received — shutting down" RUNNING=0 } handle_hup() { log "SIGHUP received — will reload config on next iteration" RELOAD=1 } handle_usr1() { log "SIGUSR1 received — dumping stats" dump_stats } trap handle_term TERM INT trap handle_hup HUP trap handle_usr1 USR1 # ── Write PID and start main loop ──────────────────────────────── printf '%d\n' "$$" > "$PIDFILE" trap 'rm -f "$PIDFILE"' EXIT load_config() { log "loading config"; } do_work() { log "working..."; sleep 5; } dump_stats() { log "items processed: ${PROCESSED:-0}"; } load_config log "started (pid $$)" while (( RUNNING )); do if (( RELOAD )); then load_config RELOAD=0 fi do_work done log "shutdown complete"
Signals and their conventional meanings
| Signal | Number | Conventional use in daemons |
|---|---|---|
SIGTERM | 15 | Graceful shutdown — finish current work, clean up, exit |
SIGKILL | 9 | Immediate kill — cannot be caught or ignored |
SIGHUP | 1 | Reload configuration (re-read config file without restart) |
SIGUSR1 | 10 | User-defined — dump stats, rotate logs, toggle debug |
SIGUSR2 | 12 | User-defined — second custom action |
SIGINT | 2 | Interactive interrupt (Ctrl-C) — often same as SIGTERM |
6 — Managing Child Processes
#!/usr/bin/env bash # A supervisor: spawn N workers and restart them if they die set -uo pipefail # no -e: we handle errors ourselves WORKERS=4 RUNNING=1 declare -A WORKER_PIDS # slot → PID start_worker() { local slot="$1" # Start the worker in background; record its PID /usr/local/bin/worker --slot "$slot" & WORKER_PIDS["$slot"]="$!" log "slot $slot: started worker pid ${WORKER_PIDS[$slot]}" } stop_all() { log "stopping all workers" local slot for slot in "${!WORKER_PIDS[@]}"; do kill -TERM "${WORKER_PIDS[$slot]}" 2>/dev/null done wait # wait for all children to exit log "all workers stopped" } trap 'RUNNING=0; stop_all' TERM INT # ── Spawn initial workers ──────────────────────────────────────── for slot in $(seq 0 $(( WORKERS - 1 ))); do start_worker "$slot" done # ── Supervisor loop: restart dead workers ──────────────────────── while (( RUNNING )); do for slot in "${!WORKER_PIDS[@]}"; do local pid="${WORKER_PIDS[$slot]}" # If the worker exited, wait -n would consume its status. # Instead, use kill -0 to check liveness. if ! kill -0 "$pid" 2>/dev/null; then wait "$pid" # reap the zombie log "slot $slot: worker $pid exited (rc=$?), restarting" start_worker "$slot" fi done sleep 1 done
7 — Integrating with systemd
On modern Linux, you should not write a double-fork daemon at all. systemd manages process lifetimes natively and handles PID tracking, logging, dependency ordering, socket activation, and auto-restart. Write a simple foreground process and let systemd do the rest.
Type=simple — the common case
# /etc/systemd/system/myapp.service [Unit] Description=My Application Daemon After=network.target Wants=network.target [Service] Type=simple User=myapp Group=myapp WorkingDirectory=/var/lib/myapp ExecStart=/usr/local/bin/myapp --foreground --config /etc/myapp.conf ExecReload=/bin/kill -HUP $MAINPID ExecStop=/bin/kill -TERM $MAINPID PIDFile=/run/myapp.pid # optional with Type=simple Restart=on-failure RestartSec=5s TimeoutStopSec=30s # Security hardening NoNewPrivileges=true PrivateTmp=true ProtectSystem=strict ProtectHome=true ReadWritePaths=/var/lib/myapp /var/log/myapp [Install] WantedBy=multi-user.target
Type=notify — sd_notify integration
With Type=notify, systemd waits for your process to send a READY=1 notification before marking it as started. This is the correct type when your daemon needs to initialise (open sockets, load caches) before accepting connections — it prevents dependent services from starting too early.
#!/usr/bin/env bash # Daemon that uses sd_notify to report readiness # systemd-notify is part of systemd; it writes to the notification socket. sd_notify() { # NOTIFY_SOCKET is set by systemd. If not set, we're not under systemd # — silently do nothing so the script works standalone too. [[ -S "${NOTIFY_SOCKET:-}" ]] || return 0 systemd-notify "$@" } log() { printf '%s\n' "$*"; } # systemd captures stdout to the journal # ── Initialisation phase ───────────────────────────────────────── log "Starting up..." sd_notify "STATUS=Initialising..." # Simulate slow startup (load DB, open sockets, etc.) sleep 2 log "Init complete" # Tell systemd we are ready sd_notify "READY=1" "STATUS=Running" # ── Signal handling ────────────────────────────────────────────── RUNNING=1 trap 'RUNNING=0' TERM INT # ── Main loop ──────────────────────────────────────────────────── COUNT=0 while (( RUNNING )); do (( COUNT++ )) || true sd_notify "STATUS=Processed $COUNT items" # Watchdog: tell systemd we are still alive sd_notify "WATCHDOG=1" sleep 10 done sd_notify "STOPPING=1" "STATUS=Shutting down" log "Shutdown complete"
# Service unit for Type=notify with watchdog [Service] Type=notify NotifyAccess=main # only main PID may send notifications WatchdogSec=30s # kill and restart if no WATCHDOG=1 within 30s ExecStart=/usr/local/bin/myapp
Useful systemctl and journalctl commands
# Service lifecycle systemctl daemon-reload # re-read unit files after editing systemctl enable myapp # start at boot systemctl disable myapp # do not start at boot systemctl start myapp systemctl stop myapp systemctl restart myapp systemctl reload myapp # sends ExecReload signal systemctl status myapp # one-line summary + recent log systemctl is-active myapp # exits 0 if running systemctl is-enabled myapp # exits 0 if enabled at boot # Journal (logs) journalctl -u myapp # all logs for the unit journalctl -u myapp -f # follow (tail -f equivalent) journalctl -u myapp --since "1 hour ago" journalctl -u myapp -n 100 # last 100 lines journalctl -u myapp -p err # errors and above only # Inspect a unit systemctl cat myapp # show the unit file systemctl show myapp # all properties as key=value systemctl list-dependencies myapp # dependency tree
8 — Transient Systemd Units with systemd-run
You can launch a one-off or temporary service without writing a unit file using systemd-run. This is useful for testing, for wrapping cron jobs with resource limits, and for running scripts under systemd's cgroup management without permanent installation.
# Run a command as a transient service (foreground, output to terminal) systemd-run --pty --same-dir --wait --collect /path/to/script.sh # Run in background (service unit auto-named) systemd-run --unit=my-job /path/to/script.sh systemctl status my-job journalctl -u my-job -f # With resource limits — run at idle priority, max 512 MB RAM systemd-run --nice=19 \ --property=MemoryMax=512M \ --property=CPUWeight=10 \ /path/to/heavy_job.sh # Timer: run at a specific time (replaces cron) systemd-run --on-calendar="*-*-* 02:00:00" \ --unit=nightly-backup \ /usr/local/bin/backup.sh
Exercises
Exercise 1 — Write a robust PID-file library
Implement a sourced Bash library lib/pidfile.sh providing these
five functions:
pidfile_acquire PIDFILE— write our PID, fail if another process holds a lockpidfile_release PIDFILE— remove the PID file and release the lockpidfile_read PIDFILE— print the PID; return 1 if file absent or corruptpidfile_is_running PIDFILE— return 0 if the process is alivepidfile_stale PIDFILE— return 0 if the file exists but the process is dead
Use flock for the exclusive lock so the lock is released automatically on abnormal exit. Include a usage example that registers pidfile_release on the EXIT trap.
#!/usr/bin/env bash # lib/pidfile.sh — source this file; do not execute it directly # Internal: FD used for flock (one per pidfile path via name mangling) _pidfile_fd_for() { # Map a path to a stable FD number (200–254 range) # For simplicity, use a single global FD; real code would use an assoc array. printf '200' } pidfile_acquire() { local pidfile="$1" local fd; fd=$(_pidfile_fd_for "$pidfile") # Open the file on the chosen FD (creates it if absent) eval "exec ${fd}>'${pidfile}'" # Try an exclusive, non-blocking lock if ! flock -n "$fd"; then local other_pid other_pid=$(pidfile_read "$pidfile" 2>/dev/null) printf '%s: already running (pid %s)\n' \ "$pidfile" "${other_pid:-(unknown)}" >&2 return 1 fi # Write our PID into the file local tmp; tmp="$pidfile.tmp.$$" printf '%d\n' "$$" > "$tmp" mv -f "$tmp" "$pidfile" } pidfile_release() { local pidfile="$1" local fd; fd=$(_pidfile_fd_for "$pidfile") rm -f "$pidfile" eval "exec ${fd}>&-" # close FD — OS releases the flock } pidfile_read() { local pidfile="$1" [[ -f "$pidfile" ]] || return 1 local pid read -r pid < "$pidfile" [[ $pid =~ ^[0-9]+$ ]] || { printf 'corrupt pidfile: %s\n' "$pidfile" >&2; return 1; } printf '%d' "$pid" } pidfile_is_running() { local pidfile="$1" local pid pid=$(pidfile_read "$pidfile") || return 1 kill -0 "$pid" 2>/dev/null } pidfile_stale() { local pidfile="$1" [[ -f "$pidfile" ]] || return 1 # no file → not stale pidfile_is_running "$pidfile" && return 1 # still running → not stale return 0 # file exists, process dead → stale } # ── Usage example ──────────────────────────────────────────────── # # source lib/pidfile.sh # PIDFILE=/run/myapp.pid # pidfile_acquire "$PIDFILE" || exit 1 # trap 'pidfile_release "$PIDFILE"' EXIT # # ... main logic ...
Exercise 2 — Complete init script with all LSB actions
Write a complete /etc/init.d/myworker init script for a daemon
called myworker that runs as user worker. The script
must implement all LSB actions: start, stop,
restart, reload, force-reload,
status, and try-restart (restart only if currently
running). Use start-stop-daemon for start. Implement correct LSB
exit codes for status. The stop action must wait for the process to exit
gracefully (SIGTERM, 10 s timeout) before sending SIGKILL.
#!/usr/bin/env bash ### BEGIN INIT INFO # Provides: myworker # Required-Start: $network $local_fs # Required-Stop: $network $local_fs # Default-Start: 2 3 4 5 # Default-Stop: 0 1 6 # Short-Description: My Worker Daemon ### END INIT INFO NAME=myworker DAEMON=/usr/local/bin/$NAME PIDFILE=/run/${NAME}.pid LOGFILE=/var/log/${NAME}.log RUNAS=worker ARGS="--config /etc/myworker.conf" STOP_TIMEOUT=10 get_pid() { [[ -f "$PIDFILE" ]] || return 1 local p; read -r p < "$PIDFILE" [[ $p =~ ^[0-9]+$ ]] && printf '%d' "$p" } is_running() { local pid; pid=$(get_pid) || return 1 kill -0 "$pid" 2>/dev/null } do_start() { if is_running; then echo "$NAME already running (pid $(get_pid))"; return 0 fi echo -n "Starting $NAME... " start-stop-daemon --start --quiet --background \ --make-pidfile --pidfile "$PIDFILE" \ --chuid "$RUNAS" --exec "$DAEMON" -- $ARGS \ >> "$LOGFILE" 2&1 || { echo "FAILED"; return 1; } local i; for i in $(seq 1 20); do is_running && { echo "OK"; return 0; }; sleep 0.5 done echo "FAILED (did not start)"; return 1 } do_stop() { if ! is_running; then echo "$NAME is not running"; rm -f "$PIDFILE"; return 0 fi local pid; pid=$(get_pid) echo -n "Stopping $NAME (pid $pid)... " kill -TERM "$pid" local i for i in $(seq 1 $(( STOP_TIMEOUT * 2 ))); do is_running || { rm -f "$PIDFILE"; echo "OK"; return 0; } sleep 0.5 done echo -n "(SIGKILL) " kill -KILL "$pid" 2>/dev/null rm -f "$PIDFILE"; echo "OK" } do_status() { if is_running; then echo "$NAME running (pid $(get_pid))"; return 0 elif [[ -f "$PIDFILE" ]]; then echo "$NAME dead but PID file exists"; return 1 else echo "$NAME not running"; return 3 fi } case "${1:-}" in start) do_start ;; stop) do_stop ;; restart) do_stop; do_start ;; reload|force-reload) is_running || { echo "not running"; exit 1; } kill -HUP "$(get_pid)"; echo "reloaded" ;; status) do_status ;; try-restart) is_running && { do_stop; do_start; } || true ;; *) printf 'Usage: %s {start|stop|restart|reload|force-reload|status|try-restart}\n' \ "$0" >&2; exit 2 ;; esac
Exercise 3 — Foreground daemon with sd_notify and watchdog
Write a script bin/poller.sh that:
- Loops every 30 seconds, reading URLs from
/etc/poller/urls.confand checking each withcurl -sf - Logs results to the journal (stdout, one line per URL)
- Sends READY=1 after completing its first poll round
- Sends WATCHDOG=1 after each round (only if $WATCHDOG_USEC is set)
- Updates STATUS= with the count of passing/failing URLs
- Handles SIGHUP to re-read the URL list without restarting
- Handles SIGTERM/SIGINT for clean shutdown
Also write the accompanying poller.service unit file with
Type=notify, WatchdogSec=120s, and reasonable
security hardening directives.
#!/usr/bin/env bash # bin/poller.sh set -uo pipefail CONFIG="${POLLER_CONFIG:-/etc/poller/urls.conf}" INTERVAL=30 RUNNING=1 RELOAD=0 declare -a URLS=() log() { printf '%s\n' "$*"; } # systemd captures stdout to journal sd_notify() { [[ -S "${NOTIFY_SOCKET:-}" ]] || return 0 systemd-notify "$@" } watchdog_ping() { [[ -n "${WATCHDOG_USEC:-}" ]] && sd_notify "WATCHDOG=1" } load_config() { URLS=() if [[ ! -f "$CONFIG" ]]; then log "WARN: config not found: $CONFIG"; return fi while IFS= read -r line; do [[ "$line" =~ ^[[:space:]]*(#|$) ]] && continue URLS+=("$line") done < "$CONFIG" log "Loaded ${#URLS[@]} URLs from $CONFIG" } poll_urls() { local ok=0 fail=0 local url for url in "${URLS[@]}"; do if curl -sf --max-time 10 "$url" >/dev/null; then log "OK $url"; (( ok++ )) || true else log "FAIL $url"; (( fail++ )) || true fi done sd_notify "STATUS=OK:$ok FAIL:$fail" } # ── Signals ────────────────────────────────────────────────────── trap 'RUNNING=0' TERM INT trap 'RELOAD=1' HUP # ── Startup ────────────────────────────────────────────────────── load_config log "Poller starting — $INTERVALs interval" poll_urls # first poll before READY=1 sd_notify "READY=1" "STATUS=Running" # ── Main loop ──────────────────────────────────────────────────── while (( RUNNING )); do if (( RELOAD )); then load_config; RELOAD=0 fi sleep "$INTERVAL" (( RUNNING )) || break poll_urls watchdog_ping done sd_notify "STOPPING=1" log "Poller stopped"
# poller.service [Unit] Description=URL Health Poller After=network-online.target Wants=network-online.target [Service] Type=notify NotifyAccess=main User=poller ExecStart=/usr/local/bin/poller.sh ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure RestartSec=10s WatchdogSec=120s TimeoutStartSec=60s TimeoutStopSec=15s # Hardening NoNewPrivileges=true PrivateTmp=true ProtectSystem=strict ProtectHome=true ReadOnlyPaths=/etc/poller CapabilityBoundingSet= # drop all capabilities [Install] WantedBy=multi-user.target
Exercise 4 — Supervisor with BATS tests
Write a supervisor script bin/supervise.sh COMMAND [ARGS...]
that:
- Starts the given command as a child process
- Restarts it automatically if it exits with non-zero status (up to 5 times total, with exponential back-off: 1s, 2s, 4s, 8s, 16s)
- Exits with status 0 if the command exits cleanly (status 0)
- Exits with status 1 after 5 failed restarts
- Propagates SIGTERM to the child and exits cleanly when it receives SIGTERM itself
Then write a BATS test file covering all four cases (clean exit, restart on failure, max-retries exhausted, SIGTERM propagation) using PATH-mocked helper scripts as the supervised command.
#!/usr/bin/env bash # bin/supervise.sh set -uo pipefail (( $# >= 1 )) || { printf 'Usage: supervise.sh COMMAND [ARGS...]\n' >&2; exit 2; } MAX_RETRIES=5 CHILD_PID="" TERMINATING=0 log() { printf '[supervise] %s\n' "$*"; } handle_term() { TERMINATING=1 log "SIGTERM received — stopping child $CHILD_PID" [[ -n "$CHILD_PID" ]] && kill -TERM "$CHILD_PID" 2>/dev/null } trap handle_term TERM INT attempt=0 while (( attempt <= MAX_RETRIES )); do (( TERMINATING )) && { log "terminated cleanly"; exit 0; } (( attempt > 0 )) && log "restart attempt $attempt/$MAX_RETRIES" "$@" & CHILD_PID="$!" wait "$CHILD_PID" rc="$?" CHILD_PID="" (( TERMINATING )) && { log "terminated cleanly"; exit 0; } (( rc == 0 )) && { log "command exited cleanly"; exit 0; } log "command exited with rc=$rc" (( attempt >= MAX_RETRIES )) && break local wait_s; wait_s=$(( 1 << attempt )) # 1,2,4,8,16 log "backing off $wait_ss" sleep "$wait_s" (( attempt++ )) done log "max retries ($MAX_RETRIES) exhausted — giving up" exit 1
#!/usr/bin/env bats # test/integration/supervise_test.bats setup() { load '../test_helper/bats-support/load' load '../test_helper/bats-assert/load' SCRIPT="${BATS_TEST_DIRNAME}/../../bin/supervise.sh" MOCK_DIR="${BATS_TEST_TMPDIR}/mock" mkdir -p "$MOCK_DIR" } @test "clean exit: supervisor exits 0 immediately" { # Command that exits 0 printf '#!/usr/bin/env bash\nexit 0\n' > "$MOCK_DIR/success_cmd" chmod +x "$MOCK_DIR/success_cmd" run bash "$SCRIPT" "$MOCK_DIR/success_cmd" assert_success assert_output --partial "exited cleanly" } @test "restart on failure: command fails twice then succeeds" { # Fails first 2 attempts, succeeds on 3rd printf '#!/usr/bin/env bash C="%s/c.count" n=$(cat "$C" 2>/dev/null || echo 0) echo $(( n+1 )) > "$C" (( n < 2 )) && exit 1 || exit 0 ' "$MOCK_DIR" > "$MOCK_DIR/flaky" chmod +x "$MOCK_DIR/flaky" # Patch out sleep to speed up the test printf '#!/usr/bin/env bash\nexit 0\n' > "$MOCK_DIR/sleep" chmod +x "$MOCK_DIR/sleep" export PATH="$MOCK_DIR:$PATH" run bash "$SCRIPT" "$MOCK_DIR/flaky" assert_success assert_output --partial "restart attempt" } @test "max retries exhausted: exits 1" { printf '#!/usr/bin/env bash\nexit 1\n' > "$MOCK_DIR/always_fail" chmod +x "$MOCK_DIR/always_fail" printf '#!/usr/bin/env bash\nexit 0\n' > "$MOCK_DIR/sleep" chmod +x "$MOCK_DIR/sleep" export PATH="$MOCK_DIR:$PATH" run bash "$SCRIPT" "$MOCK_DIR/always_fail" assert_failure assert_output --partial "exhausted" } @test "SIGTERM propagates to child and supervisor exits 0" { # Long-running child that records whether it got SIGTERM printf '#!/usr/bin/env bash trap "touch %s/got_term; exit 0" TERM sleep 60 & wait ' "$MOCK_DIR" > "$MOCK_DIR/long_cmd" chmod +x "$MOCK_DIR/long_cmd" # Start supervisor in background, send SIGTERM after short delay bash "$SCRIPT" "$MOCK_DIR/long_cmd" & SUP_PID="$!" sleep 0.5 kill -TERM "$SUP_PID" wait "$SUP_PID" sup_rc="$?" assert_equal "$sup_rc" "0" assert_file_exists "$MOCK_DIR/got_term" || true # best-effort }