Chapter 6 — Process Management & Control
Understanding processes in depth is what separates an administrator who reacts to problems from one who controls them. This chapter covers finding, inspecting, and terminating processes safely — including the cases where the obvious approach (kill -9) doesn't work and you need to understand why.
What this chapter covers: Reading ps output and process trees. pgrep and pstree for finding processes. Signals — SIGTERM, SIGKILL, SIGHUP, SIGUSR — what they mean and in what order to use them. Process groups and why killing the parent isn't always enough. pkill -P for process trees. strace and lsof -p for inspecting stuck processes. systemctl's signal control. Scenario 1: a script spawned 200 workers. Scenario 2: a process ignores SIGTERM. Scenario 3: kill -9 did nothing — uninterruptible D-state explained.
ps — Reading Process Output
ps snapshots the current process table. Unlike top/htop it doesn't update — it captures the state at the moment you run it, making it useful for scripting and for examining a specific process without the noise of a live display.
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.1 167852 9248 ? Ss Jun01 0:04 /sbin/init
postgres 2341 0.3 2.1 312148 87321 ? Ss Jun01 12:44 postgres: writer process
www-data 8821 98.0 0.4 12341 4182 ? R 14:30 0:42 python3 worker.py
$ ps faux
USER PID %CPU %MEM VSZ RSS STAT COMMAND
root 1234 0.0 0.1 8234 4121 Ss bash run_jobs.sh
root 1235 0.4 0.2 9341 4822 S \_ python3 worker.py --id=1
root 1236 0.4 0.2 9341 4822 S \_ python3 worker.py --id=2
root 1237 0.4 0.2 9341 4822 S \_ python3 worker.py --id=3
$ ps -o pid,ppid,stat,rss,comm -p 8821
PID PPID STAT RSS COMM
8821 1234 R 4182 python3
$ ps aux --sort=-%cpu | head -10
$ ps aux --sort=-%mem | head -10
Process states — the STAT column
| Code | State | What it means | Action if stuck |
| R | Running | Currently executing on CPU, or runnable and waiting for a CPU core. High %CPU in top correlates with R. | Normal — investigate only if unexpected |
| S | Sleeping | Waiting for an event (timer, network, user input). Most idle processes sit here. Interruptible — signals are delivered normally. | Normal idle state |
| D | Uninterruptible sleep | Waiting for a kernel I/O operation to complete. Cannot receive signals — SIGKILL does nothing. Usually brief; if persistent, the I/O source is hung. | Fix the hung resource — often NFS or failing disk. May require reboot. |
| Z | Zombie | Process has exited but parent hasn't called wait() to collect exit status. No resources used (no memory, no CPU). Just a row in the process table. | Kill or restart the parent. kill -9 has no effect on the zombie itself. |
| T | Stopped | Suspended — received SIGSTOP or Ctrl+Z. Not running, not consuming CPU. Waiting to be resumed (SIGCONT) or terminated. | Send SIGCONT to resume, or SIGKILL to terminate |
| s | Session leader | Lowercase modifier. This process is the leader of its session (usually the shell that started a job group). | Informational |
| l | Multi-threaded | Lowercase L. Process has multiple threads. Kill the process PID, not individual thread PIDs. | Informational |
pgrep and pstree
$ pgrep python3
$ pgrep -c python3
$ pgrep -l python3
$ pgrep -a python3
$ pgrep -f "worker.py"
$ pgrep -u www-data python3
$ pstree -p
$ pstree -p 1234
bash(1234)─┬─python3(1235)
├─python3(1236)
└─python3(1237)
Signals — What They Are and Which to Use
A signal is a software interrupt sent to a process. The kernel delivers it asynchronously — the process can be in the middle of any operation when it arrives. Each signal has a default action, but a process can install a custom handler (catching the signal and doing something else) or explicitly ignore it. The one exception is SIGKILL — it cannot be caught, blocked, or ignored under any circumstances.
15
SIGTERM
Polite termination request. The default signal sent by kill PID and systemctl stop. A well-behaved process catches this, finishes in-flight work, flushes buffers, removes PID files, and exits cleanly. A process can ignore or delay responding to SIGTERM — it is just a request.
9
SIGKILL
Unconditional force kill. Cannot be caught, blocked, or ignored — the kernel enforces it directly. The process has no chance to clean up: open files are closed without flushing, database transactions are abandoned, PID files are left on disk. Use as a last resort after SIGTERM fails.
1
SIGHUP
Hangup / reload config. Originally meant "the terminal disconnected." Modern daemons (nginx, sshd, rsyslog) use it as "re-read your configuration file without restarting." Check the application's documentation before sending — some processes do treat it as terminate.
2
SIGINT
Interrupt (Ctrl+C). Sent to a foreground process when you press Ctrl+C in the terminal. Interruptible — a process can catch it. Many programs treat SIGINT the same as SIGTERM but exit immediately rather than draining work.
10/12
SIGUSR1 / SIGUSR2
User-defined — application-specific. No default behaviour. Each application decides what these mean. nginx: SIGUSR1 = reopen log files (use for log rotation). Apache: SIGUSR1 = graceful restart. PostgreSQL: SIGUSR1 = various. Always check the application's docs.
18/19
SIGCONT / SIGSTOP
Resume / Suspend. SIGSTOP (like Ctrl+Z) suspends a process — it enters T state and consumes no CPU. SIGCONT resumes it from where it stopped. Useful for temporarily pausing a CPU-intensive process. SIGSTOP also cannot be caught or ignored.
The correct order — try SIGTERM before SIGKILL
1. Try SIGTERM first (always)
kill PID — sends SIGTERM (signal 15 by default)
Wait 10–30 seconds. A well-behaved process will:
— Finish the current request
— Flush write buffers
— Close database connections cleanly
— Remove PID/lock files
— Exit with a meaningful exit code
For services, systemctl stop nginx sends SIGTERM and waits up to 90 seconds before giving up.
2. Only then — SIGKILL
kill -9 PID or kill -SIGKILL PID
Use only after SIGTERM has not worked after a reasonable wait. Consequences:
— Partially written files may be corrupt
— Database WAL/journals may need recovery on restart
— PID files, lock files, and socket files may be left on disk
— Temporary files in /tmp may not be cleaned up
If SIGKILL also has no effect → the process is in D state (see Scenario 3).
$ kill PID
$ kill -9 PID
$ kill -1 PID
$ kill -SIGTERM PID
$ kill -l
$ pkill python3
$ pkill -9 python3
$ pkill -f "worker.py"
$ pkill -u www-data
$ kill -0 PID
bash: kill: (8821) - No such process
$ ps -p PID
Process Groups — Killing a Whole Tree
Every process belongs to a process group (PGID). When you run a script, the script and all the processes it spawns typically share the same process group. This makes it possible to terminate an entire job tree in one command without having to enumerate every child PID.
Process group PGID=1234
┌─────────────────────────────────────┐
│ │
│ bash run_jobs.sh PID=1234 │
│ │ │
│ ├── python3 worker.py 1235 │
│ ├── python3 worker.py 1236 │
│ │ └── gzip 1298 │
│ └── python3 worker.py 1237 │
│ │
└─────────────────────────────────────┘
kill 1234 → SIGTERM to the bash script only (parent)
pkill -P 1234 → SIGTERM to direct children of PID 1234 (1235, 1236, 1237)
kill -- -1234 → SIGTERM to every process in PGID 1234 (all of the above)
kill -9 -- -1234 → SIGKILL to entire process group
$ ps -o pid,pgid,ppid,comm -p 1235
PID PGID PPID COMM
1235 1234 1234 python3
$ pkill -P 1234
$ pkill -9 -P 1234
$ kill -- -1234
$ kill -9 -- -1234
$ systemctl stop myapp
$ systemctl kill --signal=SIGHUP nginx
Why killing the parent often isn't enough: If you kill the parent script with SIGTERM, the parent exits but does not automatically kill its children — they become orphaned and are re-parented to PID 1 (systemd). They keep running. Use pkill -P or kill -- -PGID to reach the whole tree.
strace and lsof -p — Inspecting a Stuck Process
$ strace -p 8821
strace: Process 8821 attached
read(5, ^ ← blocked on read() of file descriptor 5 — waiting for data
$ ls -la /proc/8821/fd/5
lrwxrwxrwx 1 www-data www-data 0 Jun 14 14:30 /proc/8821/fd/5 -> socket:[234821]
$ strace -p 8821 -T 2>&1 | head -30
futex(0x7f..., FUTEX_WAIT, 1, NULL) = 0 <0.000001>
futex(0x7f..., FUTEX_WAIT, 1, NULL) = 0 <0.000001>
$ lsof -p 8821
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python3 8821 www cwd DIR 8,1 4096 2 /opt/app
python3 8821 www txt REG 8,1 19504 2345678 /usr/bin/python3
python3 8821 www mem REG 8,1 167984 1234567 /lib/x86_64-linux-gnu/libc.so
python3 8821 www 0u CHR 1,3 0t0 5 /dev/null
python3 8821 www 1u REG 8,1 45890 9876543 /var/log/app/app.log
python3 8821 www 4u IPv4 234821 0t0 TCP 10.0.0.5:44321->10.0.0.10:5432 (ESTABLISHED)
Scenario 1 — A Script Spawned 200 Workers
1
Confirm what you're dealing with.
$ pgrep -c -f "worker.py"
203
$ pgrep -af "worker.py" | head -10
1235 python3 /opt/app/worker.py --job-id=5a12b --queue=default
1236 python3 /opt/app/worker.py --job-id=5a12b --queue=default
1237 python3 /opt/app/worker.py --job-id=5a12b --queue=default
$ ps -o pid,ppid,stat,comm -p 1235
PID PPID STAT COMM
1235 1234 S python3
$ ps -o pid,ppid,stat,comm -p 1234
PID PPID STAT COMM
1234 1 S run_jobs.sh
2
Check resource impact before deciding on an approach.
$ ps aux --sort=-%cpu | grep worker.py | awk '{cpu+=$3; mem+=$4; count++} END {print count " workers using " cpu "% CPU and " mem "% MEM"}'
203 workers using 812.4% CPU and 6.2% MEM
3
Terminate the parent first — give it a chance to clean up.
$ kill 1234
$ sleep 10
$ pgrep -c -f "worker.py"
203
4
Terminate all workers with pkill — match on the full command line.
$ pkill -TERM -f "worker.py --job-id=5a12b"
$ sleep 15
$ pgrep -c -f "worker.py"
12
$ pkill -9 -f "worker.py --job-id=5a12b"
$ pgrep -c -f "worker.py"
0
5
Alternative: if they all share a process group, one command kills the tree.
$ ps -o pgid= -p 1235 | head -1
1234
$ kill -- -1234
$ sleep 10
$ kill -9 -- -1234
Always use -f with pkill to match against the full command line — matching just the process name risks catching unrelated processes with the same name (e.g., pkill python3 would kill all Python processes on the server). The more specific your match, the safer.
Scenario 2 — Process Ignores SIGTERM
1
Check if the process is actually in D state — if so, SIGKILL won't help either.
$ ps -o pid,stat,comm -p 8821
PID STAT COMM
8821 S python3
2
Use strace to see what it's doing — is it stuck or deliberately ignoring?
$ strace -p 8821 -T 2>&1 | head -20
read(5, ← blocked waiting for data on fd 5 (a socket — database connection)
3
If it's genuinely ignoring SIGTERM and needs to be stopped now — use SIGKILL.
$ kill -9 8821
$ sleep 2
$ kill -0 8821
bash: kill: (8821) - No such process
$ find /var/run -name "*.pid" -newer /proc/uptime 2>/dev/null
$ find /var/lock -name "*.lock" 2>/dev/null
$ ls -lt /tmp | head -20
4
For services managed by systemd — let systemctl handle the sequence for you.
$ systemctl stop myapp
$ systemctl status myapp
$ systemctl stop --timeout=10 myapp
$ journalctl -u myapp --since "5 minutes ago" | grep -E "stop|kill|timeout"
Many well-written services intentionally delay SIGTERM — they finish an in-flight transaction or drain a queue before exiting. This is correct behaviour. The question to ask before reaching for SIGKILL is: "is this process making progress, or is it genuinely stuck?" strace answers that in seconds.
Scenario 3 — kill -9 Did Nothing
1
Confirm D state — SIGKILL is undeliverable to a process in uninterruptible sleep.
$ ps -o pid,stat,comm -p 8821
PID STAT COMM
8821 D python3
$ ps -o pid,stat,etimes,comm -p 8821
PID STAT ELAPSED COMM
8821 D 1842 python3
2
Identify what I/O the process is waiting on.
$ cat /proc/8821/wchan
nfs_wait_on_request
$ dmesg -T | grep -E "NFS|nfs|hung|timeout" | tail -20
[Jun14 14:01] nfs: server 192.168.50.100 not responding, timed out
[Jun14 14:05] nfs: server 192.168.50.100 not responding, timed out
3
Try to resolve the underlying resource — the process will then exit on its own.
$ watch -n 2 'ps -o pid,stat,comm -p 8821'
$ umount -l /mnt/nfs
$ sleep 5
$ ps -o stat -p 8821 | head -2
STAT
S
$ umount -f -l /mnt/nfs
4
If the resource cannot be recovered — a reboot is the only remaining option.
$ systemctl list-units --state=failed
$ who
$ lsof /mnt/nfs 2>/dev/null | grep -v COMMAND
$ shutdown -r +5 "NFS mount hung, rebooting in 5 minutes. Please save your work."
5
Prevention — mount NFS with a timeout so this can't happen again.
192.168.50.100:/exports /mnt/nfs nfs defaults 0 0
192.168.50.100:/exports /mnt/nfs nfs soft,timeo=30,retrans=3,_netdev 0 0
D-state processes stuck on local disk I/O (not NFS) are rarer and usually indicate a failing drive or a kernel bug. Check dmesg for I/O errors, run smartctl -a /dev/sdX for SMART data, and check if the filesystem has remounted read-only due to errors (dmesg | grep "remounting read-only").
Quick Reference — Chapter 6 Commands
| Command | Purpose | Notes |
| ps faux | Process list with forest/tree view showing parent→child relationships | ps -o pid,ppid,stat,comm -p PID for a specific process with PPID |
| pgrep -c -f "pattern" | Count processes matching a full command line pattern | -a show full command line · -l PID + name · -u user filter by owner |
| pstree -p PID | Visual process tree rooted at a specific PID | Omit PID to see the full system tree |
| kill PID | SIGTERM (15) — polite termination request. Try first, always. | kill -0 PID checks existence without sending a signal |
| kill -9 PID | SIGKILL — unconditional force kill. Use only after SIGTERM fails. | Cannot kill D-state processes. Look for leftover PID/lock files after. |
| kill -1 PID | SIGHUP — reload configuration without restarting (for daemons) | Check application docs — not universal. nginx/sshd/rsyslog all support it. |
| pkill -f "pattern" | SIGTERM by full command line match — safer than name match alone | pkill -9 -f for SIGKILL · pkill -P PPID for direct children only |
| kill -- -PGID | SIGTERM to entire process group — kills parent + all descendants | kill -9 -- -PGID for SIGKILL · find PGID with ps -o pgid= -p PID |
| systemctl stop service | SIGTERM → wait TimeoutStopSec (90s) → SIGKILL. Covers the full cgroup. | --timeout=10 to override wait · systemctl kill --signal=SIGHUP for custom signal |
| strace -p PID | Attach to running process and show every system call in real time | -T show time per call · -e trace=network filter to network calls only · adds CPU overhead |
| lsof -p PID | List all open files, sockets, and devices held by a process | lsof -p PID | wc -l count FDs · grep deleted find open-but-deleted files |
| cat /proc/PID/wchan | Which kernel function a D-state process is waiting on | nfs_wait_on_request = NFS hang · jbd2_log_wait_commit = disk journal · blk_wait_io = block I/O |
| umount -l /mnt/path | Lazy unmount — detaches mount once all current I/O completes | Often frees D-state processes stuck on NFS · -f adds force (risk of data loss) |