Performance Overview & Essential Tools
Chapter 1 — Performance Overview & Essential Tools
Before you can fix a performance problem, you need to know what kind of problem you have. A server that "feels slow" could be running out of CPU, exhausting its RAM, waiting on a disk, throttled by the network, or it might just be a caching layer warming up. This chapter gives you a framework for thinking about performance and a toolkit for finding out which one it actually is.
Load Average — What It Actually Means
The first number you look at on a sluggish server is load average. Run uptime or look at the top line of top or htop:
The three numbers are the average number of processes that were either running or waiting to run (runnable) or waiting on I/O (uninterruptible) over the last 1, 5, and 15 minutes. They are not percentages.
Interpreting load average against CPU count
A load average of 4.0 means very different things on a single-core machine versus a 16-core machine. The meaningful figure is load per core: divide load average by the number of logical CPUs.
Reading the trend — the numbers tell a story
| 1-min | 15-min | What it means | Action |
|---|---|---|---|
| 0.5 | 0.5 | Quiet and stable. System is idle. | No action needed. |
| 3.0 | 0.5 | Recent spike, settling back down. | Check what ran recently: dmesg -T | tail, check cron logs. |
| 0.5 | 3.0 | Something busy before, calmed now. | Check logs for earlier in the window. May be resolved. |
| 8.0 | 8.0 | Sustained high load (on an 8-core machine: 100%). | Investigate CPU and I/O immediately. |
| 20.0 | 2.0 | Sudden severe spike right now. | Open htop immediately — something just went wrong. |
The USE Method — A Diagnostic Framework
Created by Brendan Gregg (the foremost authority on Linux performance), the USE method gives you a structured way to check every system resource rather than guessing. For each resource — CPU, memory, disk, network — ask three questions:
CPU: `top` → %CPU
Disk: `iostat -x` → %util
Network: `iftop` → bandwidth %
CPU: load average above core count
Disk: `iostat` avgqu-sz > 1
Memory: swap in use and growing
`dmesg -T | tail -20`
`journalctl -p err -n 50`
NIC: `ip -s link` → errors/dropped
The Core Diagnostic Toolkit
These tools ship with or are available on almost every Linux distribution. Between them they cover the entire USE method across all major resources.
top and htop.Alternative:
cat /proc/loadavg for scripting.
htop — open interactive viewhtop -u username — filter by userhtop -p PID — watch a specific processKeys:
F6 sort · F4 filter · F5 tree · t tree toggle · k send signal
top — default interactive viewtop -b -n 1 — single snapshot (good for scripts)Keys:
P sort CPU · M sort memory · 1 show per-core · k kill · q quit
vmstat 1 — update every secondvmstat 1 10 — 10 updates then stopKey columns:
r run queue · b blocked · si/so swap in/out · wa I/O wait
-h flag makes it human-readable. Simple but often misread — see the memory chapter for why "available" is the number to watch, not "free".free -h — human-readable (MB/GB)free -h -s 2 — refresh every 2 secondsWatch:
available column, not free. Watch swap used growing over time.
df -h — human-readable sizesdf -i — inode usage (not space)df -hT — include filesystem typeRed flag: any filesystem at 100% Use%.
dmesg -T — include human-readable timestampsdmesg -T | tail -30 — recent kernel messagesdmesg -T | grep -i error — filter for errorsdmesg -T | grep -i oom — OOM killer events
sysstat package.iostat -x 1 — extended stats, every secondiostat -xh 1 — human-readable sizesKey columns:
%util (disk busy %) · await (avg wait ms) · r/s and w/s (ops per sec)
sysstat service running.sar -u 1 5 — CPU: 5 readings, 1 sec apartsar -r 1 5 — memory statssar -b 1 5 — I/O statssar -u -f /var/log/sysstat/sa14 — historical (day 14)
w — logged-in users, their processes, load averagewho — simpler: just login infolast — login history (who logged in when)Useful when you see unexpectedly high CPU and want to know if a developer is running a build job.
Reading htop — What Every Column Means
htop is the most information-dense single screen you'll look at when diagnosing a performance issue. Here's what you're seeing:
| Column | Meaning | What to watch for |
|---|---|---|
| PID | Process ID — unique number assigned by the kernel | Use this to target commands: kill PID, strace -p PID |
| USER | The user the process is running as | Unexpected users (root running web processes, or www-data running shells) are a red flag |
| PRI / NI | Priority and Nice value. NI ranges −20 (highest priority) to +19 (lowest) | NI = 0 is normal. Negative = high priority. Use renice to adjust |
| VIRT | Total virtual memory the process has mapped (includes shared libs, mmap'd files). Usually large, usually misleading. | Rarely the number to worry about |
| RES | Resident Set Size — RAM actually in physical memory right now. The real memory figure. | This is the one to watch. A process with 8GB RES is using 8GB of RAM |
| SHR | Shared memory — portion of RES that's shared with other processes (shared libs) | Real exclusive memory ≈ RES − SHR |
| S | Process state: R running, S sleeping (interruptible), D sleeping (uninterruptible/disk wait), Z zombie, T stopped | Many D state processes = disk or NFS problem. Z zombies = a parent isn't reaping children |
| CPU% | CPU usage over the last sampling interval, across all cores. Can exceed 100% on multi-core systems (e.g. 800% = 8 full cores) | A single process at 100% is using one full core. Look for processes unexpectedly high |
| MEM% | Percentage of total physical RAM used by this process (based on RES) | Quick relative view. Absolute RES figure is more useful |
| TIME+ | Total CPU time consumed since the process started (not wall clock time) | A process that's been running 3 minutes but shows 2h45m of CPU time is CPU-intensive |
| Command | The command that launched the process | Press F5 in htop to switch to tree view — shows parent/child relationships |
vmstat — Seeing Everything at Once
vmstat 1 gives a continuous stream of system-wide stats that lets you see how resources are moving together. The first line after the header is an average since boot — ignore it. Watch the subsequent lines.
| Column | Meaning | Alert when… |
|---|---|---|
| r | Processes runnable (waiting for CPU time) | Consistently above number of CPUs = CPU saturation |
| b | Processes in uninterruptible sleep (blocked on I/O) | Greater than 2–3 sustained = disk or NFS problem |
| swpd | Swap space in use (KB) | Any value above 0 when memory should be adequate |
| si / so | Swap in / Swap out (KB/sec) | Any non-zero sustained value = active swap thrashing |
| bi / bo | Block in / Block out (blocks/sec) — disk reads/writes | Very high sustained bi or bo = disk-heavy workload |
| wa | % of time CPU spent waiting for I/O | Above 10–15% sustained = I/O is slowing things down |
| us / sy | User CPU% / System (kernel) CPU% | High sy (kernel%) = driver issue, syscall-heavy app, NFS |
| id | CPU idle % | If this is high but the server feels slow, the problem is elsewhere (disk, network, lock contention) |
Scenario — "The Server Feels Slow"
A user contacts you: the web application is responding slowly. Nobody has deployed anything recently. You have SSH access. Here's the systematic first-response approach.
uptime the moment you log in. Note the load average and how it compares to the CPU count (nproc).
dmesg -T | tail -20. Look for OOM kills, disk errors (I/O errors, sector failures), or driver warnings. If the kernel is reporting errors, that's often the root cause before you look anywhere else.
F6 → CPU%. Is one process consuming an unusual amount? Look at the state column — lots of D (uninterruptible) processes means disk or NFS. Lots of R means CPU contention. Check if the top processes make sense (your app, database) or are unexpected (a cron job, a compiler, a rogue script).
Swp bar at the top of htop shows swap usage. Any swap in use on a server that should have enough RAM is a signal.
vmstat 1 for 10 seconds. Watch the wa (I/O wait) column. If it's consistently above 10%, the bottleneck is disk — move to the disk chapter's tools (iostat -x 1, iotop). If si/so (swap in/out) are non-zero, the bottleneck is memory.
df -h — any filesystem at 100% will cause write failures and very strange application behaviour that looks like a performance problem but is actually a hard error. Also run df -i to check inodes separately.
w shows logged-in users and what they're running. A developer running a find / -name "*.log" or a database dump at the wrong time can saturate I/O for everyone else. Friendly conversation often resolves this faster than tuning.
High CPU% → Chapter 2 (CPU) · High MEM / swap activity → Chapter 3 (Memory) · High wa / disk errors → Chapter 4 (Disk) · All normal here but app slow → Chapter 5 (Network)
uptime → dmesg -T | tail -20 → htop (CPU sort, then MEM sort) → vmstat 1 (10 seconds) → df -h && df -i → w. Run these in order and you'll have a confident hypothesis within 3 minutes of logging in.
Quick Reference — Chapter 1 Commands
| Command | Purpose | Key flags |
|---|---|---|
| uptime | Load average + uptime at a glance | None needed |
| nproc | Number of logical CPU cores (context for load average) | None needed |
| htop | Interactive process viewer — best all-round first look | -u user · -p PID · F5 tree · F6 sort · F4 filter |
| top | Classic process viewer — always available | -b -n 1 (snapshot) · P sort CPU · M sort mem · 1 per-core |
| vmstat 1 | System-wide stats: CPU, memory, swap, I/O, all in one | vmstat 1 10 — 10 readings then stop |
| free -h | Memory and swap usage in human-readable form | -s 2 refresh every 2s · watch the "available" column |
| df -h | Disk space per mounted filesystem | -i inode usage · -T include FS type |
| dmesg -T | Kernel messages — OOM kills, hardware errors, driver issues | | tail -30 · | grep -i oom · | grep -i error |
| iostat -x 1 | Disk I/O stats per device (needs sysstat package) | -h human-readable · watch %util and await |
| sar -u 1 5 | CPU history — 5 readings, 1 second apart (needs sysstat) | -r memory · -b I/O · -f /var/log/.../saNN historical |
| w | Logged-in users + what they're running + load average | No flags. last for login history |