Performance Overview & Essential Tools

Chapter 1 — Performance Overview & Essential Tools

Before you can fix a performance problem, you need to know what kind of problem you have. A server that "feels slow" could be running out of CPU, exhausting its RAM, waiting on a disk, throttled by the network, or it might just be a caching layer warming up. This chapter gives you a framework for thinking about performance and a toolkit for finding out which one it actually is.

What this chapter covers: Load average — what it really means. The USE method — a structured diagnostic framework. The core toolkit (uptime, top, htop, vmstat, free, df, dmesg). How to read htop's columns. A practical scenario: a user reports the server is slow — where do you start?

Load Average — What It Actually Means

The first number you look at on a sluggish server is load average. Run uptime or look at the top line of top or htop:

$ uptime 14:32:08 up 42 days, 3:21, 2 users, load average: 0.85, 1.42, 2.10

The three numbers are the average number of processes that were either running or waiting to run (runnable) or waiting on I/O (uninterruptible) over the last 1, 5, and 15 minutes. They are not percentages.

1 minute

0.85

Most recent snapshot. Spikes here but not in the 15-minute figure suggest a short burst, not a sustained problem.

5 minutes

1.42

The most useful number for spotting a developing trend. Rising from 5-min to 1-min means load is increasing.

15 minutes

2.10

The long-term baseline. If this is high you have a sustained problem, not a blip. Compare against CPU count.

Interpreting load average against CPU count

A load average of 4.0 means very different things on a single-core machine versus a 16-core machine. The meaningful figure is load per core: divide load average by the number of logical CPUs.

$ nproc 8 # On this 8-core machine, a load of 8.0 means 100% utilisation — every core is busy # A load of 4.0 means 50% — half the cores are occupied on average # A load of 12.0 means 150% — 4 processes are actively queuing for CPU time

Load average includes I/O wait, not just CPU. A process blocked waiting for a slow disk is still counted in load average even though it's not using any CPU. This is why a machine with a dying hard drive can show a load of 20 with CPU usage at just 5%. High load + low CPU = look at disk or network first.

Reading the trend — the numbers tell a story

1-min	15-min	What it means	Action
0.5	0.5	Quiet and stable. System is idle.	No action needed.
3.0	0.5	Recent spike, settling back down.	Check what ran recently: `dmesg -T \| tail`, check cron logs.
0.5	3.0	Something busy before, calmed now.	Check logs for earlier in the window. May be resolved.
8.0	8.0	Sustained high load (on an 8-core machine: 100%).	Investigate CPU and I/O immediately.
20.0	2.0	Sudden severe spike right now.	Open `htop` immediately — something just went wrong.

The USE Method — A Diagnostic Framework

Created by Brendan Gregg (the foremost authority on Linux performance), the USE method gives you a structured way to check every system resource rather than guessing. For each resource — CPU, memory, disk, network — ask three questions:

What percentage of time is this resource busy? High utilisation isn't always a problem, but at 100% there's no headroom left.

CPU: `top` → %CPU
Disk: `iostat -x` → %util
Network: `iftop` → bandwidth %

Is the resource overloaded — are things queuing up waiting for it? Saturation means the resource can't keep up with demand, even if utilisation looks manageable.

CPU: load average above core count
Disk: `iostat` avgqu-sz > 1
Memory: swap in use and growing

Are there any error conditions being reported? Errors often explain performance problems that utilisation numbers miss entirely.

`dmesg -T | tail -20`
`journalctl -p err -n 50`
NIC: `ip -s link` → errors/dropped

Use the USE method as your opening move. When a server is slow, check each major resource (CPU, memory, disk, network) against all three questions before diving into specifics. It prevents the trap of spending an hour optimising the wrong thing.

The Core Diagnostic Toolkit

These tools ship with or are available on almost every Linux distribution. Between them they cover the entire USE method across all major resources.

Essential

uptime

Prints system uptime and the three load average figures. The fastest possible first check — single line, immediate output.

No flags needed. Also appears on the first line of top and htop.
Alternative: cat /proc/loadavg for scripting.

Essential

htop

Interactive process viewer with colour-coded bars, tree view, mouse support, and easy sorting/filtering. The best all-round first look at a system under load.

htop — open interactive view
htop -u username — filter by user
htop -p PID — watch a specific process
Keys: F6 sort · F4 filter · F5 tree · t tree toggle · k send signal

Essential

top

The classic real-time process viewer. Always available, no install needed. Less friendly than htop but works everywhere including minimal servers.

top — default interactive view
top -b -n 1 — single snapshot (good for scripts)
Keys: P sort CPU · M sort memory · 1 show per-core · k kill · q quit

Essential

vmstat

Reports on processes, memory, swap, I/O, interrupts, and CPU — all in one compact line updated at an interval. Excellent for spotting the relationship between resources.

vmstat 1 — update every second
vmstat 1 10 — 10 updates then stop
Key columns: r run queue · b blocked · si/so swap in/out · wa I/O wait

Essential

free

Shows memory and swap usage. The -h flag makes it human-readable. Simple but often misread — see the memory chapter for why "available" is the number to watch, not "free".

free -h — human-readable (MB/GB)
free -h -s 2 — refresh every 2 seconds
Watch: available column, not free. Watch swap used growing over time.

Essential

Shows disk space usage per mounted filesystem. The first check for "disk full" problems. Don't forget to check inodes too.

df -h — human-readable sizes
df -i — inode usage (not space)
df -hT — include filesystem type
Red flag: any filesystem at 100% Use%.

Useful

dmesg

Prints the kernel ring buffer — messages from the kernel itself about hardware events, errors, and OOM kills. Invaluable for finding hardware problems and driver errors.

dmesg -T — include human-readable timestamps
dmesg -T | tail -30 — recent kernel messages
dmesg -T | grep -i error — filter for errors
dmesg -T | grep -i oom — OOM killer events

Useful

iostat

Reports CPU statistics and I/O statistics for block devices. The go-to tool for identifying disk bottlenecks. Usually needs the sysstat package.

iostat -x 1 — extended stats, every second
iostat -xh 1 — human-readable sizes
Key columns: %util (disk busy %) · await (avg wait ms) · r/s and w/s (ops per sec)

Useful

sar

The historical performance recorder. Logs CPU, memory, disk and network metrics over time. Essential for answering "was it slow yesterday too?" Needs sysstat service running.

sar -u 1 5 — CPU: 5 readings, 1 sec apart
sar -r 1 5 — memory stats
sar -b 1 5 — I/O stats
sar -u -f /var/log/sysstat/sa14 — historical (day 14)

Useful

w / who

Shows who is logged in and what they're doing, plus the load average. Quick check for whether a human user is running something unexpected on the server.

w — logged-in users, their processes, load average
who — simpler: just login info
last — login history (who logged in when)
Useful when you see unexpectedly high CPU and want to know if a developer is running a build job.

Reading htop — What Every Column Means

htop is the most information-dense single screen you'll look at when diagnosing a performance issue. Here's what you're seeing:

htop — Annotated Layout ┌─────────────────────────────────────────────────────────────────────────────┐ │ CPU[||||||||||||||||||||||||| 45.2%] Tasks: 142, 312 thr; 2 running │ │ CPU[||||||||||||||||||||||||||||||||||||||||||||||||||||||| 98.0%] Load average: 3.82 2.10 1.45 │ │ CPU[|||||| 12.1%] Uptime: 42 days, 03:21:07 │ │ CPU[|||| 9.4%] │ │ Mem[|||||||||||||||||||||||||||||||||||||| 11.2G/16.0G] │ │ Swp[ 0K/2.00G] │ ├─────────────────────────────────────────────────────────────────────────────┤ │ PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command │ │ 4821 apache 20 0 520M 85M 18M R 98.0 0.5 0:03.12 python3 worker.py │ │ 1234 mysql 20 0 2.5G 800M 200M S 5.0 5.0 45:23.12 mysqld │ │ 891 root 20 0 185M 12M 9M S 0.5 0.1 0:12.05 sshd │ └─────────────────────────────────────────────────────────────────────────────┘ ▲ CPU bars: blue=normal · green=low priority · red=kernel · orange=IRQ

Column	Meaning	What to watch for
PID	Process ID — unique number assigned by the kernel	Use this to target commands: `kill PID`, `strace -p PID`
USER	The user the process is running as	Unexpected users (root running web processes, or www-data running shells) are a red flag
PRI / NI	Priority and Nice value. NI ranges −20 (highest priority) to +19 (lowest)	NI = 0 is normal. Negative = high priority. Use `renice` to adjust
VIRT	Total virtual memory the process has mapped (includes shared libs, mmap'd files). Usually large, usually misleading.	Rarely the number to worry about
RES	Resident Set Size — RAM actually in physical memory right now. The real memory figure.	This is the one to watch. A process with 8GB RES is using 8GB of RAM
SHR	Shared memory — portion of RES that's shared with other processes (shared libs)	Real exclusive memory ≈ RES − SHR
S	Process state: R running, S sleeping (interruptible), D sleeping (uninterruptible/disk wait), Z zombie, T stopped	Many D state processes = disk or NFS problem. Z zombies = a parent isn't reaping children
CPU%	CPU usage over the last sampling interval, across all cores. Can exceed 100% on multi-core systems (e.g. 800% = 8 full cores)	A single process at 100% is using one full core. Look for processes unexpectedly high
MEM%	Percentage of total physical RAM used by this process (based on RES)	Quick relative view. Absolute RES figure is more useful
TIME+	Total CPU time consumed since the process started (not wall clock time)	A process that's been running 3 minutes but shows 2h45m of CPU time is CPU-intensive
Command	The command that launched the process	Press `F5` in htop to switch to tree view — shows parent/child relationships

htop colour legend (CPU bars): Blue = normal user processes. Green = low-priority (niced) processes. Red = kernel threads. Orange = hardware interrupts (IRQ). A mostly-red CPU bar means the kernel itself is very busy — often a driver issue, not an application problem.

vmstat — Seeing Everything at Once

vmstat 1 gives a continuous stream of system-wide stats that lets you see how resources are moving together. The first line after the header is an average since boot — ignore it. Watch the subsequent lines.

$ vmstat 1 #procs --------memory(KB)-------- --swap-- ---io--- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 4823400 201480 8432100 0 0 0 12 312 1821 12 3 84 1 0 6 2 0 4012300 201480 8432100 0 0 0 8420 2100 4200 45 12 40 3 0 2 8 2048 312000 201480 8432100 1200 800 9800 8420 3100 5200 20 8 12 60 0

Column	Meaning	Alert when…
r	Processes runnable (waiting for CPU time)	Consistently above number of CPUs = CPU saturation
b	Processes in uninterruptible sleep (blocked on I/O)	Greater than 2–3 sustained = disk or NFS problem
swpd	Swap space in use (KB)	Any value above 0 when memory should be adequate
si / so	Swap in / Swap out (KB/sec)	Any non-zero sustained value = active swap thrashing
bi / bo	Block in / Block out (blocks/sec) — disk reads/writes	Very high sustained bi or bo = disk-heavy workload
wa	% of time CPU spent waiting for I/O	Above 10–15% sustained = I/O is slowing things down
us / sy	User CPU% / System (kernel) CPU%	High sy (kernel%) = driver issue, syscall-heavy app, NFS
id	CPU idle %	If this is high but the server feels slow, the problem is elsewhere (disk, network, lock contention)

Scenario — "The Server Feels Slow"

A user contacts you: the web application is responding slowly. Nobody has deployed anything recently. You have SSH access. Here's the systematic first-response approach.

Get the overview in 5 seconds. Run uptime the moment you log in. Note the load average and how it compares to the CPU count (nproc).

Load 0.9 on 8 cores → CPU is idle, look elsewhere. Load 6.2 on 8 cores → moderate, dig into CPU. Load 24 on 8 cores → severe saturation, act quickly.

Check for kernel errors immediately. dmesg -T | tail -20. Look for OOM kills, disk errors (I/O errors, sector failures), or driver warnings. If the kernel is reporting errors, that's often the root cause before you look anywhere else.

Open htop and sort by CPU%. Press F6 → CPU%. Is one process consuming an unusual amount? Look at the state column — lots of D (uninterruptible) processes means disk or NFS. Lots of R means CPU contention. Check if the top processes make sense (your app, database) or are unexpected (a cron job, a compiler, a rogue script).

Sort htop by MEM% (press F6 → MEM%). Is any process consuming a disproportionate amount of RAM? Check whether the system is using swap: the Swp bar at the top of htop shows swap usage. Any swap in use on a server that should have enough RAM is a signal.

Run vmstat 1 for 10 seconds. Watch the wa (I/O wait) column. If it's consistently above 10%, the bottleneck is disk — move to the disk chapter's tools (iostat -x 1, iotop). If si/so (swap in/out) are non-zero, the bottleneck is memory.

Check disk space. df -h — any filesystem at 100% will cause write failures and very strange application behaviour that looks like a performance problem but is actually a hard error. Also run df -i to check inodes separately.

Check who else is on the server. w shows logged-in users and what they're running. A developer running a find / -name "*.log" or a database dump at the wrong time can saturate I/O for everyone else. Friendly conversation often resolves this faster than tuning.

Identify the bottleneck type and move to the right chapter. By this point you should know roughly what you're dealing with:

High CPU% → Chapter 2 (CPU) · High MEM / swap activity → Chapter 3 (Memory) · High wa / disk errors → Chapter 4 (Disk) · All normal here but app slow → Chapter 5 (Network)

The diagnostic flow as a one-liner sequence: uptime → dmesg -T | tail -20 → htop (CPU sort, then MEM sort) → vmstat 1 (10 seconds) → df -h && df -i → w. Run these in order and you'll have a confident hypothesis within 3 minutes of logging in.

Quick Reference — Chapter 1 Commands

Command	Purpose	Key flags
uptime	Load average + uptime at a glance	None needed
nproc	Number of logical CPU cores (context for load average)	None needed
htop	Interactive process viewer — best all-round first look	`-u user` · `-p PID` · F5 tree · F6 sort · F4 filter
top	Classic process viewer — always available	`-b -n 1` (snapshot) · `P` sort CPU · `M` sort mem · `1` per-core
vmstat 1	System-wide stats: CPU, memory, swap, I/O, all in one	`vmstat 1 10` — 10 readings then stop
free -h	Memory and swap usage in human-readable form	`-s 2` refresh every 2s · watch the "available" column
df -h	Disk space per mounted filesystem	`-i` inode usage · `-T` include FS type
dmesg -T	Kernel messages — OOM kills, hardware errors, driver issues	`\| tail -30` · `\| grep -i oom` · `\| grep -i error`
iostat -x 1	Disk I/O stats per device (needs sysstat package)	`-h` human-readable · watch `%util` and `await`
sar -u 1 5	CPU history — 5 readings, 1 second apart (needs sysstat)	`-r` memory · `-b` I/O · `-f /var/log/.../saNN` historical
w	Logged-in users + what they're running + load average	No flags. `last` for login history