Process Management & Control

Chapter 6 — Process Management & Control

Understanding processes in depth is what separates an administrator who reacts to problems from one who controls them. This chapter covers finding, inspecting, and terminating processes safely — including the cases where the obvious approach (kill -9) doesn't work and you need to understand why.

What this chapter covers: Reading ps output and process trees. pgrep and pstree for finding processes. Signals — SIGTERM, SIGKILL, SIGHUP, SIGUSR — what they mean and in what order to use them. Process groups and why killing the parent isn't always enough. pkill -P for process trees. strace and lsof -p for inspecting stuck processes. systemctl's signal control. Scenario 1: a script spawned 200 workers. Scenario 2: a process ignores SIGTERM. Scenario 3: kill -9 did nothing — uninterruptible D-state explained.

ps — Reading Process Output

ps snapshots the current process table. Unlike top/htop it doesn't update — it captures the state at the moment you run it, making it useful for scripting and for examining a specific process without the noise of a live display.

# The standard all-process listing $ ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.1 167852 9248 ? Ss Jun01 0:04 /sbin/init postgres 2341 0.3 2.1 312148 87321 ? Ss Jun01 12:44 postgres: writer process www-data 8821 98.0 0.4 12341 4182 ? R 14:30 0:42 python3 worker.py # VSZ = virtual memory size (total address space allocated) # RSS = resident set size (memory actually in RAM) ← the useful number # STAT = process state (see state column below) # TIME = total CPU time consumed since process started (not "right now") # Forest view — shows parent/child relationships visually $ ps faux USER PID %CPU %MEM VSZ RSS STAT COMMAND root 1234 0.0 0.1 8234 4121 Ss bash run_jobs.sh root 1235 0.4 0.2 9341 4822 S \_ python3 worker.py --id=1 root 1236 0.4 0.2 9341 4822 S \_ python3 worker.py --id=2 root 1237 0.4 0.2 9341 4822 S \_ python3 worker.py --id=3 # The \_ indentation shows these python3 workers are children of the bash script (PID 1234) # Get a specific PID's details — including its parent $ ps -o pid,ppid,stat,rss,comm -p 8821 PID PPID STAT RSS COMM 8821 1234 R 4182 python3 # PPID=1234: this python3 process's parent is PID 1234 (the bash script) # Sort by CPU or memory usage $ ps aux --sort=-%cpu | head -10 # top 10 by CPU $ ps aux --sort=-%mem | head -10 # top 10 by memory

Process states — the STAT column

Code	State	What it means	Action if stuck
R	Running	Currently executing on CPU, or runnable and waiting for a CPU core. High %CPU in top correlates with R.	Normal — investigate only if unexpected
S	Sleeping	Waiting for an event (timer, network, user input). Most idle processes sit here. Interruptible — signals are delivered normally.	Normal idle state
D	Uninterruptible sleep	Waiting for a kernel I/O operation to complete. Cannot receive signals — SIGKILL does nothing. Usually brief; if persistent, the I/O source is hung.	Fix the hung resource — often NFS or failing disk. May require reboot.
Z	Zombie	Process has exited but parent hasn't called wait() to collect exit status. No resources used (no memory, no CPU). Just a row in the process table.	Kill or restart the parent. kill -9 has no effect on the zombie itself.
T	Stopped	Suspended — received SIGSTOP or Ctrl+Z. Not running, not consuming CPU. Waiting to be resumed (SIGCONT) or terminated.	Send SIGCONT to resume, or SIGKILL to terminate
s	Session leader	Lowercase modifier. This process is the leader of its session (usually the shell that started a job group).	Informational
l	Multi-threaded	Lowercase L. Process has multiple threads. Kill the process PID, not individual thread PIDs.	Informational

pgrep and pstree

# pgrep — find PIDs by process name (faster than ps | grep) $ pgrep python3 # PIDs of all processes named python3 $ pgrep -c python3 # count of matching processes $ pgrep -l python3 # PID + name $ pgrep -a python3 # PID + full command line (like ps -f) $ pgrep -f "worker.py" # match against full command line, not just name $ pgrep -u www-data python3 # only processes owned by www-data # pstree — visual process tree $ pstree -p # full tree with PIDs $ pstree -p 1234 # tree rooted at PID 1234 bash(1234)─┬─python3(1235) ├─python3(1236) └─python3(1237)

Signals — What They Are and Which to Use

A signal is a software interrupt sent to a process. The kernel delivers it asynchronously — the process can be in the middle of any operation when it arrives. Each signal has a default action, but a process can install a custom handler (catching the signal and doing something else) or explicitly ignore it. The one exception is SIGKILL — it cannot be caught, blocked, or ignored under any circumstances.

15 SIGTERM

Polite termination request. The default signal sent by kill PID and systemctl stop. A well-behaved process catches this, finishes in-flight work, flushes buffers, removes PID files, and exits cleanly. A process can ignore or delay responding to SIGTERM — it is just a request.

9 SIGKILL

Unconditional force kill. Cannot be caught, blocked, or ignored — the kernel enforces it directly. The process has no chance to clean up: open files are closed without flushing, database transactions are abandoned, PID files are left on disk. Use as a last resort after SIGTERM fails.

1 SIGHUP

Hangup / reload config. Originally meant "the terminal disconnected." Modern daemons (nginx, sshd, rsyslog) use it as "re-read your configuration file without restarting." Check the application's documentation before sending — some processes do treat it as terminate.

2 SIGINT

Interrupt (Ctrl+C). Sent to a foreground process when you press Ctrl+C in the terminal. Interruptible — a process can catch it. Many programs treat SIGINT the same as SIGTERM but exit immediately rather than draining work.

10/12 SIGUSR1 / SIGUSR2

User-defined — application-specific. No default behaviour. Each application decides what these mean. nginx: SIGUSR1 = reopen log files (use for log rotation). Apache: SIGUSR1 = graceful restart. PostgreSQL: SIGUSR1 = various. Always check the application's docs.

18/19 SIGCONT / SIGSTOP

Resume / Suspend. SIGSTOP (like Ctrl+Z) suspends a process — it enters T state and consumes no CPU. SIGCONT resumes it from where it stopped. Useful for temporarily pausing a CPU-intensive process. SIGSTOP also cannot be caught or ignored.

The correct order — try SIGTERM before SIGKILL

1. Try SIGTERM first (always)

kill PID — sends SIGTERM (signal 15 by default)

Wait 10–30 seconds. A well-behaved process will:
— Finish the current request
— Flush write buffers
— Close database connections cleanly
— Remove PID/lock files
— Exit with a meaningful exit code

For services, systemctl stop nginx sends SIGTERM and waits up to 90 seconds before giving up.

2. Only then — SIGKILL

kill -9 PID or kill -SIGKILL PID

Use only after SIGTERM has not worked after a reasonable wait. Consequences:
— Partially written files may be corrupt
— Database WAL/journals may need recovery on restart
— PID files, lock files, and socket files may be left on disk
— Temporary files in /tmp may not be cleaned up

If SIGKILL also has no effect → the process is in D state (see Scenario 3).

# Sending signals $ kill PID # SIGTERM (15) — the polite default $ kill -9 PID # SIGKILL — force kill $ kill -1 PID # SIGHUP — reload config $ kill -SIGTERM PID # same as kill PID, but explicit $ kill -l # list all signal names and numbers # pkill — send signal by name instead of PID $ pkill python3 # SIGTERM to all processes named python3 $ pkill -9 python3 # SIGKILL to all processes named python3 $ pkill -f "worker.py" # match full command line $ pkill -u www-data # terminate all processes owned by www-data # Verify a process is gone $ kill -0 PID # "signal 0" — no signal sent, just checks if PID exists bash: kill: (8821) - No such process ← process is gone $ ps -p PID # no output = process no longer exists

Process Groups — Killing a Whole Tree

Every process belongs to a process group (PGID). When you run a script, the script and all the processes it spawns typically share the same process group. This makes it possible to terminate an entire job tree in one command without having to enumerate every child PID.

Process group PGID=1234 ┌─────────────────────────────────────┐ │ │ │ bash run_jobs.sh PID=1234 │ │ │ │ │ ├── python3 worker.py 1235 │ │ ├── python3 worker.py 1236 │ │ │ └── gzip 1298 │ │ └── python3 worker.py 1237 │ │ │ └─────────────────────────────────────┘ kill 1234 → SIGTERM to the bash script only (parent) pkill -P 1234 → SIGTERM to direct children of PID 1234 (1235, 1236, 1237) kill -- -1234 → SIGTERM to every process in PGID 1234 (all of the above) kill -9 -- -1234 → SIGKILL to entire process group

# Find the process group ID (PGID) of a process $ ps -o pid,pgid,ppid,comm -p 1235 PID PGID PPID COMM 1235 1234 1234 python3 # PGID=1234: same as the parent bash script. Killing PGID 1234 kills everything. # Kill direct children of a parent PID (one level deep) $ pkill -P 1234 # SIGTERM to all children of PID 1234 $ pkill -9 -P 1234 # SIGKILL to all children of PID 1234 # Kill entire process group (parent + all descendants at all levels) $ kill -- -1234 # the -- separates options from the negative PGID $ kill -9 -- -1234 # SIGKILL to entire group # For a service: systemctl knows the full cgroup tree $ systemctl stop myapp # cleanly stops all processes in the service's cgroup $ systemctl kill --signal=SIGHUP nginx # send custom signal to a service

Why killing the parent often isn't enough: If you kill the parent script with SIGTERM, the parent exits but does not automatically kill its children — they become orphaned and are re-parented to PID 1 (systemd). They keep running. Use pkill -P or kill -- -PGID to reach the whole tree.

strace and lsof -p — Inspecting a Stuck Process

strace intercepts and logs every system call (read, write, open, connect, etc.) a process makes. Attach to a running process with -p. Useful for answering "why is this process hung — what is it waiting for?"

strace -p PID — attach to running PID
strace -p PID -e trace=network — only network calls
strace -p PID -e trace=file — only file-related calls
strace -p PID -T — show time spent in each call
strace -p PID -tt — show wall-clock timestamp per call

Important: strace adds overhead (~20–30% CPU cost) to the traced process. Don't leave it attached on production long-term.

lsof (list open files) lists every file descriptor a process has open — including regular files, directories, sockets, pipes, and devices.

lsof -p PID — all open files for one PID
lsof -p PID | wc -l — count open file descriptors
lsof -p PID | grep IPv4 — network connections only
lsof -p PID | grep REG — regular files only
lsof -p PID | grep deleted — open-but-deleted files

For "too many open files" errors: lsof -p PID | wc -l vs /proc/PID/limits | grep "open files" to see the hard limit.

# strace — diagnosing why a process is stuck $ strace -p 8821 strace: Process 8821 attached read(5, ^ ← blocked on read() of file descriptor 5 — waiting for data # Which file is fd=5? $ ls -la /proc/8821/fd/5 lrwxrwxrwx 1 www-data www-data 0 Jun 14 14:30 /proc/8821/fd/5 -> socket:[234821] # fd 5 is a socket — the process is blocked waiting for data from a network connection. # This could be a database query that never returned, or a hung API call. $ strace -p 8821 -T 2>&1 | head -30 futex(0x7f..., FUTEX_WAIT, 1, NULL) = 0 <0.000001> futex(0x7f..., FUTEX_WAIT, 1, NULL) = 0 <0.000001> # futex (fast mutex) calls taking ~0ms each = thread waiting on a lock. # This is normal for multi-threaded idle processes. # lsof — inventory of what a process has open $ lsof -p 8821 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python3 8821 www cwd DIR 8,1 4096 2 /opt/app python3 8821 www txt REG 8,1 19504 2345678 /usr/bin/python3 python3 8821 www mem REG 8,1 167984 1234567 /lib/x86_64-linux-gnu/libc.so python3 8821 www 0u CHR 1,3 0t0 5 /dev/null python3 8821 www 1u REG 8,1 45890 9876543 /var/log/app/app.log python3 8821 www 4u IPv4 234821 0t0 TCP 10.0.0.5:44321->10.0.0.10:5432 (ESTABLISHED) # fd 4 is a TCP connection to 10.0.0.10:5432 — PostgreSQL. Still ESTABLISHED.

Scenario 1 — A Script Spawned 200 Workers

Confirm what you're dealing with.

# How many processes match? $ pgrep -c -f "worker.py" 203 # See their full command lines — are they all the same job? $ pgrep -af "worker.py" | head -10 1235 python3 /opt/app/worker.py --job-id=5a12b --queue=default 1236 python3 /opt/app/worker.py --job-id=5a12b --queue=default 1237 python3 /opt/app/worker.py --job-id=5a12b --queue=default # Same --job-id. All 203 belong to the same batch job. # Who is the parent? (look at one PID's PPID) $ ps -o pid,ppid,stat,comm -p 1235 PID PPID STAT COMM 1235 1234 S python3 $ ps -o pid,ppid,stat,comm -p 1234 PID PPID STAT COMM 1234 1 S run_jobs.sh # Parent PID is 1234 (a bash script), which itself is owned by PID 1 (systemd). # The original shell that launched it has already exited.

Check resource impact before deciding on an approach.

$ ps aux --sort=-%cpu | grep worker.py | awk '{cpu+=$3; mem+=$4; count++} END {print count " workers using " cpu "% CPU and " mem "% MEM"}' 203 workers using 812.4% CPU and 6.2% MEM # 812% CPU across all 203 workers on a 16-core machine = 51% of total CPU capacity # They're consuming significant resources. Safe to terminate the job.

Terminate the parent first — give it a chance to clean up.

# Send SIGTERM to the parent script — it may clean up children itself $ kill 1234 $ sleep 10 $ pgrep -c -f "worker.py" 203 # Parent exited but workers are still running — they ignored the parent dying. # The workers are now orphaned (adopted by PID 1 / systemd).

Terminate all workers with pkill — match on the full command line.

# SIGTERM to all matching workers first (polite) $ pkill -TERM -f "worker.py --job-id=5a12b" $ sleep 15 # wait for graceful shutdown $ pgrep -c -f "worker.py" 12 # 12 processes didn't respond to SIGTERM # SIGKILL the stragglers $ pkill -9 -f "worker.py --job-id=5a12b" $ pgrep -c -f "worker.py" 0 # all gone

Alternative: if they all share a process group, one command kills the tree.

# Check the process group ID $ ps -o pgid= -p 1235 | head -1 1234 # Kill the entire process group in one shot $ kill -- -1234 # SIGTERM to all 203 workers + parent at once $ sleep 10 $ kill -9 -- -1234 # SIGKILL stragglers

Always use -f with pkill to match against the full command line — matching just the process name risks catching unrelated processes with the same name (e.g., pkill python3 would kill all Python processes on the server). The more specific your match, the safer.

Scenario 2 — Process Ignores SIGTERM

Check if the process is actually in D state — if so, SIGKILL won't help either.

$ ps -o pid,stat,comm -p 8821 PID STAT COMM 8821 S python3 # STAT=S (Sleeping) — it CAN receive signals. It's choosing not to respond to SIGTERM. # This is different from D state (which physically cannot receive signals).

Use strace to see what it's doing — is it stuck or deliberately ignoring?

$ strace -p 8821 -T 2>&1 | head -20 read(5, ← blocked waiting for data on fd 5 (a socket — database connection) # The process is waiting on a slow database query. It installed a SIGTERM handler # that sets a flag, but the flag is only checked AFTER the current read() returns. # Once the database responds (or times out), the process will honour the SIGTERM. # Wait longer — or check if the database is hung (and fix that instead).

If it's genuinely ignoring SIGTERM and needs to be stopped now — use SIGKILL.

$ kill -9 8821 $ sleep 2 $ kill -0 8821 bash: kill: (8821) - No such process # SIGKILL worked — process is gone. # Now check for leftover resources that it didn't clean up: # PID file? $ find /var/run -name "*.pid" -newer /proc/uptime 2>/dev/null # Lock files? $ find /var/lock -name "*.lock" 2>/dev/null # Temp files? $ ls -lt /tmp | head -20 # recently modified files in /tmp

For services managed by systemd — let systemctl handle the sequence for you.

# systemctl stop sends SIGTERM, waits TimeoutStopSec (default 90s), then sends SIGKILL $ systemctl stop myapp $ systemctl status myapp # check result — should say "inactive (dead)" # If 90 seconds is too long to wait, reduce the timeout for this stop: $ systemctl stop --timeout=10 myapp # Check how long systemd waited (the logs will show): $ journalctl -u myapp --since "5 minutes ago" | grep -E "stop|kill|timeout"

Many well-written services intentionally delay SIGTERM — they finish an in-flight transaction or drain a queue before exiting. This is correct behaviour. The question to ask before reaching for SIGKILL is: "is this process making progress, or is it genuinely stuck?" strace answers that in seconds.

Scenario 3 — kill -9 Did Nothing

Confirm D state — SIGKILL is undeliverable to a process in uninterruptible sleep.

$ ps -o pid,stat,comm -p 8821 PID STAT COMM 8821 D python3 # STAT=D: Uninterruptible sleep. The process is executing a kernel I/O operation # and cannot be interrupted until the kernel gives control back. # SIGKILL has been queued — it will be delivered the moment the process # leaves D state. If it never leaves D state, the kill never lands. # How long has it been in D state? $ ps -o pid,stat,etimes,comm -p 8821 PID STAT ELAPSED COMM 8821 D 1842 python3 # ELAPSED=1842 seconds (~30 minutes) in D state — this I/O will never complete.

Identify what I/O the process is waiting on.

# Check what the kernel backtrace shows — what syscall is stuck? $ cat /proc/8821/wchan nfs_wait_on_request # wchan = "wait channel" — the kernel function where the process is blocked. # "nfs_wait_on_request" = waiting for an NFS server to respond. Classic D-state cause. # Other common wchan values: # jbd2_log_wait_commit → waiting for ext4 journal commit (disk problem) # blk_wait_io → waiting for block device I/O (disk problem) # __refrigerator → process is being frozen (systemd suspend/cgroup freeze) # futex_wait → waiting on a mutex (normal for threads) # Also check dmesg for related messages $ dmesg -T | grep -E "NFS|nfs|hung|timeout" | tail -20 [Jun14 14:01] nfs: server 192.168.50.100 not responding, timed out [Jun14 14:05] nfs: server 192.168.50.100 not responding, timed out # NFS server at 192.168.50.100 is unreachable. The process is stuck # waiting for it — and will stay stuck until the NFS situation resolves.

Try to resolve the underlying resource — the process will then exit on its own.

# Option 1: If the NFS server came back, the process will resume and the # queued SIGKILL will be delivered. Monitor with watch: $ watch -n 2 'ps -o pid,stat,comm -p 8821' # Option 2: Lazy unmount the NFS filesystem — tells the kernel to disconnect # the mount once all current I/O is released (safer than umount -f) $ umount -l /mnt/nfs # -l = lazy unmount $ sleep 5 $ ps -o stat -p 8821 | head -2 STAT S # back to Sleeping — queued SIGKILL now delivered, process gone in seconds # Option 3: Force unmount (can cause data loss on any open files on the mount) $ umount -f -l /mnt/nfs # -f force + -l lazy — use only when data loss is acceptable

If the resource cannot be recovered — a reboot is the only remaining option.

# Confirm nothing else depends on this stuck process before rebooting $ systemctl list-units --state=failed # anything already failed? $ who # anyone else logged in? $ lsof /mnt/nfs 2>/dev/null | grep -v COMMAND # other processes using the mount? # Schedule a clean reboot (waits for current logins to end or timeout) $ shutdown -r +5 "NFS mount hung, rebooting in 5 minutes. Please save your work."

Prevention — mount NFS with a timeout so this can't happen again.

# In /etc/fstab, change a hard NFS mount to soft with a timeout: # BAD (default — hard mount, hangs indefinitely if NFS dies): 192.168.50.100:/exports /mnt/nfs nfs defaults 0 0 # BETTER — soft mount with timeout (returns EIO to the process after 30 deciseconds): 192.168.50.100:/exports /mnt/nfs nfs soft,timeo=30,retrans=3,_netdev 0 0 # soft → returns error to the process instead of hanging forever # timeo → timeout per retry in deciseconds (30 = 3 seconds per retry) # retrans=3 → retry 3 times before giving up # _netdev → wait for network before mounting (important on boot)

D-state processes stuck on local disk I/O (not NFS) are rarer and usually indicate a failing drive or a kernel bug. Check dmesg for I/O errors, run smartctl -a /dev/sdX for SMART data, and check if the filesystem has remounted read-only due to errors (dmesg | grep "remounting read-only").

Quick Reference — Chapter 6 Commands

Command	Purpose	Notes
ps faux	Process list with forest/tree view showing parent→child relationships	`ps -o pid,ppid,stat,comm -p PID` for a specific process with PPID
pgrep -c -f "pattern"	Count processes matching a full command line pattern	`-a` show full command line · `-l` PID + name · `-u user` filter by owner
pstree -p PID	Visual process tree rooted at a specific PID	Omit PID to see the full system tree
kill PID	SIGTERM (15) — polite termination request. Try first, always.	`kill -0 PID` checks existence without sending a signal
kill -9 PID	SIGKILL — unconditional force kill. Use only after SIGTERM fails.	Cannot kill D-state processes. Look for leftover PID/lock files after.
kill -1 PID	SIGHUP — reload configuration without restarting (for daemons)	Check application docs — not universal. nginx/sshd/rsyslog all support it.
pkill -f "pattern"	SIGTERM by full command line match — safer than name match alone	`pkill -9 -f` for SIGKILL · `pkill -P PPID` for direct children only
kill -- -PGID	SIGTERM to entire process group — kills parent + all descendants	`kill -9 -- -PGID` for SIGKILL · find PGID with `ps -o pgid= -p PID`
systemctl stop service	SIGTERM → wait TimeoutStopSec (90s) → SIGKILL. Covers the full cgroup.	`--timeout=10` to override wait · `systemctl kill --signal=SIGHUP` for custom signal
strace -p PID	Attach to running process and show every system call in real time	`-T` show time per call · `-e trace=network` filter to network calls only · adds CPU overhead
lsof -p PID	List all open files, sockets, and devices held by a process	`lsof -p PID \| wc -l` count FDs · `grep deleted` find open-but-deleted files
cat /proc/PID/wchan	Which kernel function a D-state process is waiting on	`nfs_wait_on_request` = NFS hang · `jbd2_log_wait_commit` = disk journal · `blk_wait_io` = block I/O
umount -l /mnt/path	Lazy unmount — detaches mount once all current I/O completes	Often frees D-state processes stuck on NFS · `-f` adds force (risk of data loss)