Daemon Scripts and Process Management

Chapter 9 — Daemon Scripts and Process Management

A well-written daemon is invisible when healthy and unambiguous when broken. This chapter covers the full lifecycle: the double-fork daemonisation pattern, robust PID files, idiomatic start/stop/status dispatch, and native integration with systemd — the init system on every modern Linux distribution. You will also see how to handle signals cleanly, manage child processes, and write the .service unit that makes all of the hand-crafted shell daemonisation unnecessary in a systemd world.

1 — Why Daemonisation Is Complicated

A daemon is a process that runs detached from any controlling terminal, with its file descriptors closed, in its own session and process group. Getting there from a normal shell script requires several deliberate steps — each one correcting a specific way the process could remain accidentally attached to its parent environment.

ProblemCauseFix
Daemon killed when terminal closesStill member of the shell's session; SIGHUP is deliveredCall setsid() — first fork + setsid in child
Session leader can acquire a new controlling terminalFirst child after setsid() is session leaderSecond fork — grandchild can never be session leader
Working directory locks a filesystem (prevents unmount)CWD inherited from parentcd / in daemon
Inherited file descriptors leak resources or hold files openAll parent FDs carried across forkClose or redirect stdin/stdout/stderr to /dev/null
Inherited umask produces unexpected file permissionsumask from shellumask 022 (or your required value)

2 — The Double-Fork Pattern

#!/usr/bin/env bash
# lib/daemonise.sh — portable double-fork in pure Bash
# Usage: daemonise PIDFILE CMD [ARGS...]

daemonise() {
  local pidfile="$1"; shift

  # ── First fork ──────────────────────────────────────────────────
  # The parent exits immediately; the shell that launched us
  # considers the job done and returns to the prompt.
  (
    # We are now in a subshell — first child.
    # setsid(2) is not directly callable from Bash, but starting
    # a new process group via a subshell + exec achieves the same
    # effect on Linux when combined with the next fork.
    # Use the 'setsid' utility if available for a true setsid call.

    # ── Second fork ─────────────────────────────────────────────
    (
      # Grandchild: can never become session leader.
      # Detach from terminal and reset environment.
      cd /
      umask 022

      # Redirect standard file descriptors
      exec 0 /dev/null
      exec 1> /dev/null
      exec 2>&1

      # Write PID file before exec so it's available immediately
      printf '%d\n' "$$" > "$pidfile"

      # Replace grandchild with the actual daemon command
      exec "$@"
    ) &
  ) &
}

# ── Better approach on systems with the 'setsid' utility ────────
daemonise_setsid() {
  local pidfile="$1"; shift

  # setsid -f forks, calls setsid(2), then execs the command.
  # The double-fork is not strictly needed when setsid -f is used
  # because the child that called setsid() is not the session leader
  # of the new session (the grandchild is). --fork achieves this.
  setsid --fork bash -c "
    cd /
    umask 022
    exec 0/dev/null 2>&1
    printf '%d\n' \"\$\$\" > \"$pidfile\"
    exec \"\$@\"
  " -- "$@"
}
Note: Pure-Bash double-forking with two subshells works on Linux but is not a true setsid(2) call — the grandchild inherits the grandparent's session unless the grandparent itself called setsid. For production use, call setsid --fork (util-linux) or write a tiny C wrapper. On systemd systems you should skip all of this and use a Type=simple or Type=forking service unit instead.

3 — PID Files: Correct Usage

A PID file is a single-line file containing the daemon's PID. It is the canonical way for a start script to later find, signal, and stop the daemon. PID files are simple to get right but easy to get subtly wrong.

# ── Write ────────────────────────────────────────────────────────
# Write BEFORE exec so the PID is visible as soon as the process exists.
# Use printf, not echo, to avoid platform differences.
printf '%d\n' "$$" > /run/myapp.pid

# ── Read and validate ────────────────────────────────────────────
# Never trust a PID file blindly. The process may have died and
# a different process may now hold the same PID.

read_pid() {
  local pidfile="$1"
  [[ -f "$pidfile" ]] || return 1
  local pid
  read -r pid < "$pidfile"
  # Validate: must be a positive integer
  [[ $pid =~ ^[0-9]+$ ]] || { echo "corrupt pidfile" >&2; return 1; }
  printf '%d' "$pid"
}

pid_is_alive() {
  local pid="$1"
  # kill -0 sends no signal but checks if the process exists and
  # we have permission to signal it.
  kill -0 "$pid" 2>/dev/null
}

pid_is_our_daemon() {
  local pid="$1" name="$2"
  # Cross-check the process name to guard against PID reuse
  local comm
  comm=$(cat "/proc/${pid}/comm" 2>/dev/null)
  [[ "$comm" == "$name" || "$comm" == "${name:0:15}" ]]
  # Linux truncates comm to 15 chars in /proc/PID/comm
}

# ── Cleanup on exit ──────────────────────────────────────────────
PIDFILE=/run/myapp.pid

cleanup() {
  rm -f "$PIDFILE"
}
trap cleanup EXIT

# ── Locking: prevent two instances ───────────────────────────────
# The safest way is to hold an exclusive lock on the PID file itself.
# flock -n fails immediately if another process holds the lock.
exec 200>"$PIDFILE"
flock -n 200 || { echo "already running" >&2; exit 1; }
printf '%d\n' "$$" >&200
# The lock is held for the lifetime of the process (FD 200 stays open).
# When the process exits (cleanly or via signal), the OS releases the lock.
# This is more robust than writing then deleting the PID file.

4 — Start / Stop / Status Dispatch Pattern

A well-structured init script is a dispatcher: one script, one argument, clean exit codes. The LSB (Linux Standard Base) defines the exit codes that init systems and monitoring tools expect.

ActionExit 0Exit 1Exit 2Exit 3
startstarted (or already running)generic errorinvalid argument
stopstopped (or already stopped)generic error
statusrunningdead, PID file existsdead, lock file existsnot running
restartrestartedgeneric error
#!/usr/bin/env bash
# /etc/init.d/myapp — LSB-compliant init script skeleton
set -euo pipefail

NAME=myapp
DAEMON=/usr/local/bin/myapp
PIDFILE=/run/${NAME}.pid
LOGFILE=/var/log/${NAME}.log
RUNAS=myapp            # run as this user
DAEMON_ARGS="--config /etc/myapp/myapp.conf"

# ── Helper functions ─────────────────────────────────────────────

get_pid() {
  [[ -f "$PIDFILE" ]] || return 1
  local pid
  read -r pid < "$PIDFILE"
  [[ $pid =~ ^[0-9]+$ ]] && printf '%d' "$pid"
}

is_running() {
  local pid
  pid=$(get_pid) || return 1
  kill -0 "$pid" 2>/dev/null
}

# ── Actions ──────────────────────────────────────────────────────

do_start() {
  if is_running; then
    echo "$NAME is already running (pid $(get_pid))"
    return 0
  fi

  echo -n "Starting $NAME... "

  # Drop privileges and daemonise
  start-stop-daemon --start \
    --quiet \
    --background \
    --make-pidfile --pidfile "$PIDFILE" \
    --chuid "$RUNAS" \
    --exec "$DAEMON" \
    -- $DAEMON_ARGS \
    >> "$LOGFILE" 2&1

  # Wait up to 5 s for the PID file to appear
  local i
  for i in $(seq 1 10); do
    is_running && { echo "OK"; return 0; }
    sleep 0.5
  done

  echo "FAILED"
  return 1
}

do_stop() {
  if ! is_running; then
    echo "$NAME is not running"
    rm -f "$PIDFILE"   # clean up stale PID file
    return 0
  fi

  local pid; pid=$(get_pid)
  echo -n "Stopping $NAME (pid $pid)... "

  # Graceful shutdown: SIGTERM, wait, then SIGKILL if needed
  kill -TERM "$pid"

  local i
  for i in $(seq 1 20); do
    is_running || { rm -f "$PIDFILE"; echo "OK"; return 0; }
    sleep 0.5
  done

  echo -n "(SIGKILL) "
  kill -KILL "$pid" 2>/dev/null
  rm -f "$PIDFILE"
  echo "OK"
}

do_status() {
  if is_running; then
    echo "$NAME is running (pid $(get_pid))"
    return 0     # LSB: 0 = running
  elif [[ -f "$PIDFILE" ]]; then
    echo "$NAME is dead but PID file exists"
    return 1     # LSB: 1 = dead, PID file present
  else
    echo "$NAME is not running"
    return 3     # LSB: 3 = not running, no PID file
  fi
}

do_reload() {
  is_running || { echo "$NAME is not running"; return 1; }
  kill -HUP "$(get_pid)"
  echo "$NAME reloaded"
}

# ── Dispatcher ───────────────────────────────────────────────────
case "${1:-}" in
  start)   do_start  ;;
  stop)    do_stop   ;;
  restart) do_stop; do_start ;;
  reload)  do_reload ;;
  status)  do_status ;;
  *)
    echo "Usage: $0 {start|stop|restart|reload|status}" >&2
    exit 2
  ;;
esac

5 — Signal Handling in Long-Running Scripts

#!/usr/bin/env bash
# A daemon main loop with clean signal handling
set -euo pipefail

PIDFILE=/run/myapp.pid
LOGFILE=/var/log/myapp.log
RUNNING=1
RELOAD=0

log() { printf '[%s] %s\n' "$(date '%F %T')" "$*" >> "$LOGFILE"; }

# ── Signal handlers ──────────────────────────────────────────────
handle_term() {
  log "SIGTERM received — shutting down"
  RUNNING=0
}

handle_hup() {
  log "SIGHUP received — will reload config on next iteration"
  RELOAD=1
}

handle_usr1() {
  log "SIGUSR1 received — dumping stats"
  dump_stats
}

trap handle_term TERM INT
trap handle_hup  HUP
trap handle_usr1 USR1

# ── Write PID and start main loop ────────────────────────────────
printf '%d\n' "$$" > "$PIDFILE"
trap 'rm -f "$PIDFILE"' EXIT

load_config() { log "loading config"; }
do_work()     { log "working..."; sleep 5; }
dump_stats()  { log "items processed: ${PROCESSED:-0}"; }

load_config
log "started (pid $$)"

while (( RUNNING )); do
  if (( RELOAD )); then
    load_config
    RELOAD=0
  fi
  do_work
done

log "shutdown complete"

Signals and their conventional meanings

SignalNumberConventional use in daemons
SIGTERM15Graceful shutdown — finish current work, clean up, exit
SIGKILL9Immediate kill — cannot be caught or ignored
SIGHUP1Reload configuration (re-read config file without restart)
SIGUSR110User-defined — dump stats, rotate logs, toggle debug
SIGUSR212User-defined — second custom action
SIGINT2Interactive interrupt (Ctrl-C) — often same as SIGTERM

6 — Managing Child Processes

#!/usr/bin/env bash
# A supervisor: spawn N workers and restart them if they die
set -uo pipefail   # no -e: we handle errors ourselves

WORKERS=4
RUNNING=1
declare -A WORKER_PIDS   # slot → PID

start_worker() {
  local slot="$1"
  # Start the worker in background; record its PID
  /usr/local/bin/worker --slot "$slot" &
  WORKER_PIDS["$slot"]="$!"
  log "slot $slot: started worker pid ${WORKER_PIDS[$slot]}"
}

stop_all() {
  log "stopping all workers"
  local slot
  for slot in "${!WORKER_PIDS[@]}"; do
    kill -TERM "${WORKER_PIDS[$slot]}" 2>/dev/null
  done
  wait   # wait for all children to exit
  log "all workers stopped"
}

trap 'RUNNING=0; stop_all' TERM INT

# ── Spawn initial workers ────────────────────────────────────────
for slot in $(seq 0 $(( WORKERS - 1 ))); do
  start_worker "$slot"
done

# ── Supervisor loop: restart dead workers ────────────────────────
while (( RUNNING )); do
  for slot in "${!WORKER_PIDS[@]}"; do
    local pid="${WORKER_PIDS[$slot]}"
    # If the worker exited, wait -n would consume its status.
    # Instead, use kill -0 to check liveness.
    if ! kill -0 "$pid" 2>/dev/null; then
      wait "$pid"    # reap the zombie
      log "slot $slot: worker $pid exited (rc=$?), restarting"
      start_worker "$slot"
    fi
  done
  sleep 1
done

7 — Integrating with systemd

On modern Linux, you should not write a double-fork daemon at all. systemd manages process lifetimes natively and handles PID tracking, logging, dependency ordering, socket activation, and auto-restart. Write a simple foreground process and let systemd do the rest.

Type=simple — the common case

# /etc/systemd/system/myapp.service
[Unit]
Description=My Application Daemon
After=network.target
Wants=network.target

[Service]
Type=simple
User=myapp
Group=myapp
WorkingDirectory=/var/lib/myapp
ExecStart=/usr/local/bin/myapp --foreground --config /etc/myapp.conf
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -TERM $MAINPID
PIDFile=/run/myapp.pid      # optional with Type=simple
Restart=on-failure
RestartSec=5s
TimeoutStopSec=30s

# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/myapp /var/log/myapp

[Install]
WantedBy=multi-user.target

Type=notify — sd_notify integration

With Type=notify, systemd waits for your process to send a READY=1 notification before marking it as started. This is the correct type when your daemon needs to initialise (open sockets, load caches) before accepting connections — it prevents dependent services from starting too early.

#!/usr/bin/env bash
# Daemon that uses sd_notify to report readiness
# systemd-notify is part of systemd; it writes to the notification socket.

sd_notify() {
  # NOTIFY_SOCKET is set by systemd. If not set, we're not under systemd
  # — silently do nothing so the script works standalone too.
  [[ -S "${NOTIFY_SOCKET:-}" ]] || return 0
  systemd-notify "$@"
}

log() { printf '%s\n' "$*"; }   # systemd captures stdout to the journal

# ── Initialisation phase ─────────────────────────────────────────
log "Starting up..."
sd_notify "STATUS=Initialising..."

# Simulate slow startup (load DB, open sockets, etc.)
sleep 2
log "Init complete"

# Tell systemd we are ready
sd_notify "READY=1" "STATUS=Running"

# ── Signal handling ──────────────────────────────────────────────
RUNNING=1
trap 'RUNNING=0' TERM INT

# ── Main loop ────────────────────────────────────────────────────
COUNT=0
while (( RUNNING )); do
  (( COUNT++ )) || true
  sd_notify "STATUS=Processed $COUNT items"
  # Watchdog: tell systemd we are still alive
  sd_notify "WATCHDOG=1"
  sleep 10
done

sd_notify "STOPPING=1" "STATUS=Shutting down"
log "Shutdown complete"
# Service unit for Type=notify with watchdog
[Service]
Type=notify
NotifyAccess=main         # only main PID may send notifications
WatchdogSec=30s           # kill and restart if no WATCHDOG=1 within 30s
ExecStart=/usr/local/bin/myapp

Useful systemctl and journalctl commands

# Service lifecycle
systemctl daemon-reload              # re-read unit files after editing
systemctl enable  myapp              # start at boot
systemctl disable myapp              # do not start at boot
systemctl start   myapp
systemctl stop    myapp
systemctl restart myapp
systemctl reload  myapp              # sends ExecReload signal
systemctl status  myapp              # one-line summary + recent log
systemctl is-active  myapp           # exits 0 if running
systemctl is-enabled myapp           # exits 0 if enabled at boot

# Journal (logs)
journalctl -u myapp                  # all logs for the unit
journalctl -u myapp -f              # follow (tail -f equivalent)
journalctl -u myapp --since "1 hour ago"
journalctl -u myapp -n 100          # last 100 lines
journalctl -u myapp -p err           # errors and above only

# Inspect a unit
systemctl cat    myapp               # show the unit file
systemctl show   myapp               # all properties as key=value
systemctl list-dependencies myapp    # dependency tree

8 — Transient Systemd Units with systemd-run

You can launch a one-off or temporary service without writing a unit file using systemd-run. This is useful for testing, for wrapping cron jobs with resource limits, and for running scripts under systemd's cgroup management without permanent installation.

# Run a command as a transient service (foreground, output to terminal)
systemd-run --pty --same-dir --wait --collect /path/to/script.sh

# Run in background (service unit auto-named)
systemd-run --unit=my-job /path/to/script.sh
systemctl status my-job
journalctl -u my-job -f

# With resource limits — run at idle priority, max 512 MB RAM
systemd-run --nice=19 \
  --property=MemoryMax=512M \
  --property=CPUWeight=10 \
  /path/to/heavy_job.sh

# Timer: run at a specific time (replaces cron)
systemd-run --on-calendar="*-*-* 02:00:00" \
  --unit=nightly-backup \
  /usr/local/bin/backup.sh

Exercises

Exercise 1 — Write a robust PID-file library

Implement a sourced Bash library lib/pidfile.sh providing these five functions:

  • pidfile_acquire PIDFILE — write our PID, fail if another process holds a lock
  • pidfile_release PIDFILE — remove the PID file and release the lock
  • pidfile_read PIDFILE — print the PID; return 1 if file absent or corrupt
  • pidfile_is_running PIDFILE — return 0 if the process is alive
  • pidfile_stale PIDFILE — return 0 if the file exists but the process is dead

Use flock for the exclusive lock so the lock is released automatically on abnormal exit. Include a usage example that registers pidfile_release on the EXIT trap.

#!/usr/bin/env bash
# lib/pidfile.sh — source this file; do not execute it directly

# Internal: FD used for flock (one per pidfile path via name mangling)
_pidfile_fd_for() {
  # Map a path to a stable FD number (200–254 range)
  # For simplicity, use a single global FD; real code would use an assoc array.
  printf '200'
}

pidfile_acquire() {
  local pidfile="$1"
  local fd; fd=$(_pidfile_fd_for "$pidfile")

  # Open the file on the chosen FD (creates it if absent)
  eval "exec ${fd}>'${pidfile}'"

  # Try an exclusive, non-blocking lock
  if ! flock -n "$fd"; then
    local other_pid
    other_pid=$(pidfile_read "$pidfile" 2>/dev/null)
    printf '%s: already running (pid %s)\n' \
      "$pidfile" "${other_pid:-(unknown)}" >&2
    return 1
  fi

  # Write our PID into the file
  local tmp; tmp="$pidfile.tmp.$$"
  printf '%d\n' "$$" > "$tmp"
  mv -f "$tmp" "$pidfile"
}

pidfile_release() {
  local pidfile="$1"
  local fd; fd=$(_pidfile_fd_for "$pidfile")
  rm -f "$pidfile"
  eval "exec ${fd}>&-"   # close FD — OS releases the flock
}

pidfile_read() {
  local pidfile="$1"
  [[ -f "$pidfile" ]] || return 1
  local pid
  read -r pid < "$pidfile"
  [[ $pid =~ ^[0-9]+$ ]] || { printf 'corrupt pidfile: %s\n' "$pidfile" >&2; return 1; }
  printf '%d' "$pid"
}

pidfile_is_running() {
  local pidfile="$1"
  local pid
  pid=$(pidfile_read "$pidfile") || return 1
  kill -0 "$pid" 2>/dev/null
}

pidfile_stale() {
  local pidfile="$1"
  [[ -f "$pidfile" ]] || return 1   # no file → not stale
  pidfile_is_running "$pidfile" && return 1  # still running → not stale
  return 0   # file exists, process dead → stale
}

# ── Usage example ────────────────────────────────────────────────
#
#   source lib/pidfile.sh
#   PIDFILE=/run/myapp.pid
#   pidfile_acquire "$PIDFILE" || exit 1
#   trap 'pidfile_release "$PIDFILE"' EXIT
#   # ... main logic ...

Exercise 2 — Complete init script with all LSB actions

Write a complete /etc/init.d/myworker init script for a daemon called myworker that runs as user worker. The script must implement all LSB actions: start, stop, restart, reload, force-reload, status, and try-restart (restart only if currently running). Use start-stop-daemon for start. Implement correct LSB exit codes for status. The stop action must wait for the process to exit gracefully (SIGTERM, 10 s timeout) before sending SIGKILL.

#!/usr/bin/env bash
### BEGIN INIT INFO
# Provides:          myworker
# Required-Start:    $network $local_fs
# Required-Stop:     $network $local_fs
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: My Worker Daemon
### END INIT INFO

NAME=myworker
DAEMON=/usr/local/bin/$NAME
PIDFILE=/run/${NAME}.pid
LOGFILE=/var/log/${NAME}.log
RUNAS=worker
ARGS="--config /etc/myworker.conf"
STOP_TIMEOUT=10

get_pid() {
  [[ -f "$PIDFILE" ]] || return 1
  local p; read -r p < "$PIDFILE"
  [[ $p =~ ^[0-9]+$ ]] && printf '%d' "$p"
}

is_running() {
  local pid; pid=$(get_pid) || return 1
  kill -0 "$pid" 2>/dev/null
}

do_start() {
  if is_running; then
    echo "$NAME already running (pid $(get_pid))"; return 0
  fi
  echo -n "Starting $NAME... "
  start-stop-daemon --start --quiet --background \
    --make-pidfile --pidfile "$PIDFILE" \
    --chuid "$RUNAS" --exec "$DAEMON" -- $ARGS \
    >> "$LOGFILE" 2&1 || { echo "FAILED"; return 1; }
  local i; for i in $(seq 1 20); do
    is_running && { echo "OK"; return 0; }; sleep 0.5
  done
  echo "FAILED (did not start)"; return 1
}

do_stop() {
  if ! is_running; then
    echo "$NAME is not running"; rm -f "$PIDFILE"; return 0
  fi
  local pid; pid=$(get_pid)
  echo -n "Stopping $NAME (pid $pid)... "
  kill -TERM "$pid"
  local i
  for i in $(seq 1 $(( STOP_TIMEOUT * 2 ))); do
    is_running || { rm -f "$PIDFILE"; echo "OK"; return 0; }
    sleep 0.5
  done
  echo -n "(SIGKILL) "
  kill -KILL "$pid" 2>/dev/null
  rm -f "$PIDFILE"; echo "OK"
}

do_status() {
  if   is_running;      then echo "$NAME running (pid $(get_pid))";          return 0
  elif [[ -f "$PIDFILE" ]]; then echo "$NAME dead but PID file exists"; return 1
  else                        echo "$NAME not running";                    return 3
  fi
}

case "${1:-}" in
  start)        do_start ;;
  stop)         do_stop ;;
  restart)      do_stop; do_start ;;
  reload|force-reload)
                is_running || { echo "not running"; exit 1; }
                kill -HUP "$(get_pid)"; echo "reloaded" ;;
  status)       do_status ;;
  try-restart)  is_running && { do_stop; do_start; } || true ;;
  *)
    printf 'Usage: %s {start|stop|restart|reload|force-reload|status|try-restart}\n' \
      "$0" >&2; exit 2 ;;
esac

Exercise 3 — Foreground daemon with sd_notify and watchdog

Write a script bin/poller.sh that:

  • Loops every 30 seconds, reading URLs from /etc/poller/urls.conf and checking each with curl -sf
  • Logs results to the journal (stdout, one line per URL)
  • Sends READY=1 after completing its first poll round
  • Sends WATCHDOG=1 after each round (only if $WATCHDOG_USEC is set)
  • Updates STATUS= with the count of passing/failing URLs
  • Handles SIGHUP to re-read the URL list without restarting
  • Handles SIGTERM/SIGINT for clean shutdown

Also write the accompanying poller.service unit file with Type=notify, WatchdogSec=120s, and reasonable security hardening directives.

#!/usr/bin/env bash
# bin/poller.sh
set -uo pipefail

CONFIG="${POLLER_CONFIG:-/etc/poller/urls.conf}"
INTERVAL=30
RUNNING=1
RELOAD=0
declare -a URLS=()

log() { printf '%s\n' "$*"; }   # systemd captures stdout to journal

sd_notify() {
  [[ -S "${NOTIFY_SOCKET:-}" ]] || return 0
  systemd-notify "$@"
}

watchdog_ping() {
  [[ -n "${WATCHDOG_USEC:-}" ]] && sd_notify "WATCHDOG=1"
}

load_config() {
  URLS=()
  if [[ ! -f "$CONFIG" ]]; then
    log "WARN: config not found: $CONFIG"; return
  fi
  while IFS= read -r line; do
    [[ "$line" =~ ^[[:space:]]*(#|$) ]] && continue
    URLS+=("$line")
  done < "$CONFIG"
  log "Loaded ${#URLS[@]} URLs from $CONFIG"
}

poll_urls() {
  local ok=0 fail=0
  local url
  for url in "${URLS[@]}"; do
    if curl -sf --max-time 10 "$url" >/dev/null; then
      log "OK  $url"; (( ok++ )) || true
    else
      log "FAIL $url"; (( fail++ )) || true
    fi
  done
  sd_notify "STATUS=OK:$ok FAIL:$fail"
}

# ── Signals ──────────────────────────────────────────────────────
trap 'RUNNING=0'  TERM INT
trap 'RELOAD=1'   HUP

# ── Startup ──────────────────────────────────────────────────────
load_config
log "Poller starting — $INTERVALs interval"
poll_urls         # first poll before READY=1
sd_notify "READY=1" "STATUS=Running"

# ── Main loop ────────────────────────────────────────────────────
while (( RUNNING )); do
  if (( RELOAD )); then
    load_config; RELOAD=0
  fi
  sleep "$INTERVAL"
  (( RUNNING )) || break
  poll_urls
  watchdog_ping
done

sd_notify "STOPPING=1"
log "Poller stopped"
# poller.service
[Unit]
Description=URL Health Poller
After=network-online.target
Wants=network-online.target

[Service]
Type=notify
NotifyAccess=main
User=poller
ExecStart=/usr/local/bin/poller.sh
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=10s
WatchdogSec=120s
TimeoutStartSec=60s
TimeoutStopSec=15s

# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadOnlyPaths=/etc/poller
CapabilityBoundingSet=      # drop all capabilities

[Install]
WantedBy=multi-user.target

Exercise 4 — Supervisor with BATS tests

Write a supervisor script bin/supervise.sh COMMAND [ARGS...] that:

  • Starts the given command as a child process
  • Restarts it automatically if it exits with non-zero status (up to 5 times total, with exponential back-off: 1s, 2s, 4s, 8s, 16s)
  • Exits with status 0 if the command exits cleanly (status 0)
  • Exits with status 1 after 5 failed restarts
  • Propagates SIGTERM to the child and exits cleanly when it receives SIGTERM itself

Then write a BATS test file covering all four cases (clean exit, restart on failure, max-retries exhausted, SIGTERM propagation) using PATH-mocked helper scripts as the supervised command.

#!/usr/bin/env bash
# bin/supervise.sh
set -uo pipefail

(( $# >= 1 )) || { printf 'Usage: supervise.sh COMMAND [ARGS...]\n' >&2; exit 2; }

MAX_RETRIES=5
CHILD_PID=""
TERMINATING=0

log() { printf '[supervise] %s\n' "$*"; }

handle_term() {
  TERMINATING=1
  log "SIGTERM received — stopping child $CHILD_PID"
  [[ -n "$CHILD_PID" ]] && kill -TERM "$CHILD_PID" 2>/dev/null
}
trap handle_term TERM INT

attempt=0
while (( attempt <= MAX_RETRIES )); do
  (( TERMINATING )) && { log "terminated cleanly"; exit 0; }

  (( attempt > 0 )) && log "restart attempt $attempt/$MAX_RETRIES"

  "$@" &
  CHILD_PID="$!"
  wait "$CHILD_PID"
  rc="$?"
  CHILD_PID=""

  (( TERMINATING )) && { log "terminated cleanly"; exit 0; }
  (( rc == 0    )) && { log "command exited cleanly"; exit 0; }

  log "command exited with rc=$rc"
  (( attempt >= MAX_RETRIES )) && break

  local wait_s; wait_s=$(( 1 << attempt ))  # 1,2,4,8,16
  log "backing off $wait_ss"
  sleep "$wait_s"
  (( attempt++ ))
done

log "max retries ($MAX_RETRIES) exhausted — giving up"
exit 1
#!/usr/bin/env bats
# test/integration/supervise_test.bats

setup() {
  load '../test_helper/bats-support/load'
  load '../test_helper/bats-assert/load'
  SCRIPT="${BATS_TEST_DIRNAME}/../../bin/supervise.sh"
  MOCK_DIR="${BATS_TEST_TMPDIR}/mock"
  mkdir -p "$MOCK_DIR"
}

@test "clean exit: supervisor exits 0 immediately" {
  # Command that exits 0
  printf '#!/usr/bin/env bash\nexit 0\n' > "$MOCK_DIR/success_cmd"
  chmod +x "$MOCK_DIR/success_cmd"

  run bash "$SCRIPT" "$MOCK_DIR/success_cmd"
  assert_success
  assert_output --partial "exited cleanly"
}

@test "restart on failure: command fails twice then succeeds" {
  # Fails first 2 attempts, succeeds on 3rd
  printf '#!/usr/bin/env bash
C="%s/c.count"
n=$(cat "$C" 2>/dev/null || echo 0)
echo $(( n+1 )) > "$C"
(( n < 2 )) && exit 1 || exit 0
' "$MOCK_DIR" > "$MOCK_DIR/flaky"
  chmod +x "$MOCK_DIR/flaky"

  # Patch out sleep to speed up the test
  printf '#!/usr/bin/env bash\nexit 0\n' > "$MOCK_DIR/sleep"
  chmod +x "$MOCK_DIR/sleep"
  export PATH="$MOCK_DIR:$PATH"

  run bash "$SCRIPT" "$MOCK_DIR/flaky"
  assert_success
  assert_output --partial "restart attempt"
}

@test "max retries exhausted: exits 1" {
  printf '#!/usr/bin/env bash\nexit 1\n' > "$MOCK_DIR/always_fail"
  chmod +x "$MOCK_DIR/always_fail"
  printf '#!/usr/bin/env bash\nexit 0\n' > "$MOCK_DIR/sleep"
  chmod +x "$MOCK_DIR/sleep"
  export PATH="$MOCK_DIR:$PATH"

  run bash "$SCRIPT" "$MOCK_DIR/always_fail"
  assert_failure
  assert_output --partial "exhausted"
}

@test "SIGTERM propagates to child and supervisor exits 0" {
  # Long-running child that records whether it got SIGTERM
  printf '#!/usr/bin/env bash
trap "touch %s/got_term; exit 0" TERM
sleep 60 &
wait
' "$MOCK_DIR" > "$MOCK_DIR/long_cmd"
  chmod +x "$MOCK_DIR/long_cmd"

  # Start supervisor in background, send SIGTERM after short delay
  bash "$SCRIPT" "$MOCK_DIR/long_cmd" &
  SUP_PID="$!"
  sleep 0.5
  kill -TERM "$SUP_PID"
  wait "$SUP_PID"
  sup_rc="$?"

  assert_equal "$sup_rc" "0"
  assert_file_exists "$MOCK_DIR/got_term" || true   # best-effort
}