Chapter 3 — Memory Bottlenecks
Memory is the most misread resource on Linux. The default output of free routinely alarms people who see only 200 MB "free" on a 16 GB server — when in reality that server has 14 GB readily available and is performing perfectly. This chapter starts by correcting that misreading, then covers the cases where memory really is a problem: swap thrashing, OOM kills, and memory leaks.
What this chapter covers: Why "free" memory isn't what it looks like. The available vs free distinction. Buffers and page cache explained. Reading /proc/meminfo. Swap and vm.swappiness. Scenario 1: swap climbing overnight. Scenario 2: the OOM killer fired — reading dmesg. Scenario 3: a service's memory keeps growing — identifying a leak. The OOM score system. When (and when not) to clear the page cache.
The Fundamental Misreading — "Free" Is Not "Available"
Linux uses all available RAM productively. Memory that isn't being used by processes is used as a disk cache — so pages read from disk recently are kept in RAM and served from there on subsequent reads, making the system faster. This cached memory is immediately reclaimed when a process needs it. It shows up as "used" in many tools, but it's not really in use in a way that matters.
$ free -h
total used free shared buff/cache available
Mem: 15Gi 5.2Gi 1.1Gi 342Mi 9.2Gi 9.8Gi
Swap: 2.0Gi 0B 2.0Gi
How 16 GB is actually allocated (example above)
5.2 GB used (processes)
9.2 GB buff/cache (disk cache — reclaimable)
1.1 GB free
What "available" actually looks like
5.2 GB committed to processes
9.8 GB available (free + reclaimable cache)
The number to watch is "available", not "free". When available drops toward zero (especially below 200–300 MB on a production server), then you have a memory pressure problem. Until then, the kernel is just being efficient with your RAM.
What are buffers and cache?
📂
Page Cache
Contents of files that have been read from disk, kept in RAM for fast re-access. If your app reads the same log file 1,000 times, after the first read it comes from RAM at memory speed. The kernel evicts oldest pages when processes need more RAM. This is the majority of "buff/cache" on a typical server.
🗂️
Buffer Cache
Filesystem metadata: directory entries (dentries), inode information, and raw block device buffers. Smaller than the page cache on most systems, but important for workloads that do many small file operations (millions of tiny files, email servers).
⚡
Anonymous Memory
Process heap, stack, and privately allocated memory that isn't backed by a file. This is the memory processes actually "own" and that cannot be reclaimed without going to swap. It's what RSS measures. free calls this the "used" column (roughly).
/proc/meminfo — The Full Picture
free is a summary. When you need more detail, /proc/meminfo is the authoritative source — the kernel's own memory accounting. The fields that matter most for performance diagnosis:
$ cat /proc/meminfo
MemTotal: 16384000 kB
MemFree: 1126400 kB
MemAvailable: 10035200 kB
Buffers: 206080 kB
Cached: 9214976 kB
SwapCached: 0 kB
SwapTotal: 2097152 kB
SwapFree: 2097152 kB
Active: 6291456 kB
Inactive: 3670016 kB
Dirty: 49152 kB
Writeback: 0 kB
AnonPages: 5324800 kB
Mapped: 1638400 kB
Shmem: 350208 kB
Slab: 524288 kB
SReclaimable: 393216 kB
CommitLimit: 10289152 kB
Committed_AS: 8192000 kB
| Field | What it means | Alert when… |
| MemAvailable | Memory a new process can use without swap. The most useful single number. | Drops below ~200 MB on a production server |
| SwapFree | Unused swap. Swap used = SwapTotal − SwapFree. | Swap in use at all — investigate why |
| Dirty | Data written by processes but not yet flushed to disk. Normal in bursts. | Persistently high (GB range) — disk can't keep up with writes |
| AnonPages | Process heap/stack memory. Cannot be reclaimed without swap. | Growing steadily with no new processes starting — possible leak |
| Committed_AS | Total memory promised to all processes (including not-yet-used allocations). | Approaching CommitLimit — system is over-committed, OOM risk |
| Slab | Kernel object caches. Can grow large on servers with many files/sockets. | Several GB with nothing to explain it — possible kernel memory leak |
Swap — What It Is and When It Becomes a Problem
Swap is disk space used as overflow when physical RAM is exhausted. When the kernel needs to free RAM for a new allocation and can't reclaim enough page cache, it evicts anonymous process memory (heap/stack pages) to the swap device. If that memory is needed again, it's read back from disk — this is called swapping in.
The problem: modern NVMe SSDs deliver ~7 GB/s. RAM delivers ~50 GB/s. Even on the fastest storage, swap I/O is 7× slower than RAM at best — and swap on a spinning HDD is catastrophic. A system actively swapping (non-zero si/so in vmstat) will feel sluggish even if CPU load is low.
$ free -h | grep Swap
Swap: 2.0Gi 1.4Gi 614Mi
$ vmstat 1 5 | awk 'NR==1{print} NR>2{print}'
r b swpd free buff cache si so bi bo in cs us sy id wa
3 1 1433M 614M 201M 2.1G 840 320 9800 8420 3100 5200 20 8 12 60
vm.swappiness — controlling swap eagerness
0 — Avoid swap, prefer to reclaim cache
60 — Default
200 — Swap aggressively
060100200
$ cat /proc/sys/vm/swappiness
60
$ sysctl -w vm.swappiness=10
$ echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-swappiness.conf
$ sysctl -p /etc/sysctl.d/99-swappiness.conf
vm.swappiness=0 does not disable swap. It tells the kernel to avoid swapping anonymous memory and prefer to reclaim page cache instead — but the kernel will still swap if there's genuinely no other choice. Setting it to 0 can actually cause OOM kills on workloads where swapping out cold pages would have been better. A value of 10 is a reasonable middle ground for servers that have enough RAM.
Scenario 1 — Swap Is Climbing Overnight
1
Confirm swap is in use and check whether it's actively swapping. Used swap alone isn't urgent. Active swapping (si/so in vmstat) is.
$ free -h
Swap: 2.0Gi 1.8Gi 204Mi
$ vmstat 1 3 | awk 'NR==1{print} NR>2{print}'
r b swpd free buff cache si so bi bo in cs
1 0 1843M 204M 180M 420M 0 0 12 8 210 840
2
Find the memory hogs with ps, sorted by RSS (resident memory).
$ ps aux --sort=-%mem | head -10
USER PID %CPU %MEM VSZ RSS STAT START TIME COMMAND
postgres 2341 0.5 18.4 850000 3014656 S Mon09 8:42 postgres: worker
app 8821 0.2 12.1 650000 1982464 S Mon09 3:14 node /opt/app/server.js
app 8834 0.1 11.8 640000 1933312 S Mon09 3:05 node /opt/app/server.js
$ ps aux --sort=-%mem | awk 'NR==1 || NR<=11 {printf "%-12s %6s %6.1f MB %s\n", $1, $2, $6/1024, $11}'
3
Use smem for a more accurate per-process view — it shows Unique Set Size (USS), which excludes memory shared with other processes and gives the real exclusive memory cost.
$ smem -r -k | head -15
PID User Command Swap USS PSS RSS
2341 postgres postgres: worker 1.4G 2.8G 2.9G 3.0G
8821 app node /opt/app/server.js 180M 1.7G 1.8G 1.9G
8834 app node /opt/app/server.js 160M 1.6G 1.7G 1.9G
4
Determine if this is a growth problem or a sizing problem. If processes are at a stable high-water mark and swap is stable, the server may simply need more RAM. If memory is growing, investigate a leak (Scenario 3). Check what ran overnight:
$ grep "$(date -d yesterday '+%b %e')" /var/log/syslog | grep -i cron
$ journalctl --since yesterday --until "6 hours ago" -u cron
5
Resolution options:
- Short term: If the system is stable (si/so = 0), leave it. Pages in swap that aren't needed won't be read back, and the system is coping.
- If actively swapping and sluggish: Identify the largest non-essential process (step 2) and restart it during a maintenance window — this frees its RSS and its swap pages. Do not kill a database or web server without preparation.
- Medium term: Lower
vm.swappiness to make the kernel prefer reclaiming page cache over swapping, giving processes more RAM before they're pushed to swap.
- Long term: Add RAM, or reduce the memory footprint of the services (connection pool sizes, worker counts, JVM heap limits).
Never run production without swap entirely. If a workload occasionally spikes, swap is the safety net that prevents an OOM kill. The goal is a system that rarely uses swap, not one that has none.
Scenario 2 — The OOM Killer Fired Overnight
1
Check dmesg for OOM kill events — this is the first thing to look at.
$ dmesg -T | grep -i "oom\|killed process\|out of memory"
[Mon Jun 16 03:47:22 2025] Out of memory: Kill process 8821 (node) score 482 or sacrifice child
[Mon Jun 16 03:47:22 2025] Killed process 8821 (node) total-vm:655360kB, anon-rss:1843200kB, file-rss:90112kB, shmem-rss:0kB
2
Read the full OOM event for context — dmesg logs the memory state of the whole system at the moment of the kill.
$ dmesg -T | grep -A 30 "Out of memory" | head -40
[Mon Jun 16 03:47:21 2025] node invoked oom-killer: gfp_mask=0x..., order=0, oom_score_adj=0
[Mon Jun 16 03:47:21 2025] Mem-Info:
[Mon Jun 16 03:47:21 2025] active_anon:458752 inactive_anon:12288 isolated_anon:0
[Mon Jun 16 03:47:21 2025] active_file:256 inactive_file:384 isolated_file:0
[Mon Jun 16 03:47:21 2025] MemFree: 12288kB ← Only 12 MB truly free at time of kill
3
Also check journalctl — on systemd systems, OOM kills appear in the journal too, often with more context about which service was affected.
$ journalctl -k --since "2025-06-16 03:40" --until "2025-06-16 04:00"
$ journalctl -u myapp.service --since yesterday | grep -i "kill\|oom\|memory"
4
Understand the OOM score — why was this process chosen? The kernel assigns every process an
oom_score from 0–1000. Higher = more likely to be killed. It's based primarily on memory usage as a fraction of total RAM, adjusted by
oom_score_adj.
$ for PID in $(ps -eo pid --no-headers); do
printf "%6d %5d %s\n" $PID \
$(cat /proc/$PID/oom_score 2>/dev/null) \
"$(cat /proc/$PID/cmdline 2>/dev/null | tr '\0' ' ' | cut -c1-60)"
done | sort -k2 -rn | head -10
8821 482 node /opt/app/server.js
2341 310 postgres: worker
891 45 sshd: root
5
Protect critical processes from OOM kills with oom_score_adj. Setting this to -1000 makes the process immune; 0 is default; +1000 makes it the first target.
$ echo -500 | sudo tee /proc/2341/oom_score_adj
$ sudo systemctl edit postgresql.service
$ echo 500 | sudo tee /proc/8821/oom_score_adj
The OOM killer is a last resort — it only fires when the kernel genuinely cannot find any memory to allocate from any source. If you're seeing regular OOM kills, the fix is either more RAM, reduced process memory footprints, or better swap management — not adjusting oom_score_adj to protect things. That only changes who gets killed, not whether the kill happens.
Scenario 3 — A Service's Memory Keeps Growing
1
Confirm the growth is real, not just page cache being attributed to the process. Watch RSS (not VSZ) over time — RSS is what's actually in physical memory.
$ watch -n 30 'ps -p 8821 -o pid,rss,vsz,etime,comm | awk "{printf \"%s PID:%s RSS:%s MB VSZ:%s MB uptime:%s\n\", \$5,\$1,int(\$2/1024),int(\$3/1024),\$4}"'
$ while true; do
echo "$(date '+%H:%M:%S') $(ps -p 8821 -o rss= | awk '{printf "%.0f MB\n", $1/1024}')"
sleep 60
done | tee /tmp/mem_growth.log
09:00:00 204 MB
09:30:00 261 MB
10:00:00 318 MB
10:30:00 375 MB ← Growing ~57 MB every 30 minutes — linear leak pattern
2
Use pmap to inspect the process's memory map — look for anonymous mappings that are growing, which indicates heap or mmap-based allocations not being freed.
$ pmap -x 8821 | tail -20
Address Kbytes RSS Dirty Mode Mapping
00007f8b2c000000 2097152 2097152 2097152 rw--- [ anon ]
00007f8b4c000000 524288 524288 524288 rw--- [ anon ]
...
---------------- ------- ------ ------
total kB 3276800 3145728 3014656
$ pmap -x 8821 | grep anon | awk '{sum += $2} END {print sum/1024 " MB anon"}'
3
Check for open file descriptors accumulating — sometimes what looks like a memory leak is actually file handles or socket connections not being closed, each consuming a small amount of kernel memory.
$ ls /proc/8821/fd | wc -l
4821
$ lsof -p 8821 | awk '{print $5}' | sort | uniq -c | sort -rn | head -10
4201 IPv4 ← Over 4000 open network connections — likely connection leak
300 REG ← Regular files
200 sock ← Sockets
4
Document, report, and mitigate. A genuine memory leak is a code bug that requires a fix. In the meantime:
- Document the growth rate — how long before it hits the danger zone? This defines your maintenance window.
- Set up a scheduled restart —
systemctl restart myapp.service via cron at off-peak hours buys time while the fix is developed.
- Set a memory limit via systemd —
MemoryMax=2G in the service unit will trigger an OOM kill of the leaking service (not the whole system) if it exceeds 2 GB, protecting other services.
- Report to developers — include your pmap output, the growth rate log, and the lsof fd count. This is the evidence needed to find the leak.
Memory leaks in interpreted languages (Node.js, Python, Ruby) are often event listener accumulation, cache objects that grow without bounds, or closures keeping references alive. In C/C++ services, tools like Valgrind or AddressSanitizer are used during development to find the leaking allocation.
OOM Score — Who Gets Killed First
-1000
Never kill
Completely protected from OOM killer. Set for init/systemd, critical infrastructure. Use with extreme care — if this process leaks, the system will OOM-kill everything else first.
-500
Strongly protected
Good for databases (PostgreSQL, MySQL, Redis). Unlikely to be killed unless the system is truly desperate. Set via systemd OOMScoreAdjust or /proc/PID/oom_score_adj.
0
Default
All processes start here. Final oom_score is calculated from this base plus memory usage. Large-RSS processes end up with higher scores.
+500 to +1000
Kill me first
Useful for disposable worker processes or batch jobs — if the system runs low, kill this before touching production services. Systemd sets +100 for most user services by default on some distros.
Page Cache — When to Clear It (Rarely)
The page cache is self-managing. The kernel evicts the oldest, least-used pages automatically when processes need RAM. You almost never need to clear it manually on a production server.
Clearing the page cache on a production server causes a performance cliff. Every file access that was being served from RAM now hits disk — databases, web servers, and application servers all slow dramatically for minutes until the cache warms up again. Only clear it for benchmarking (to get a cold-cache baseline) or on a test system.
$ sync
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ echo 2 | sudo tee /proc/sys/vm/drop_caches
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
If your monitoring shows "used memory" jumping after a cache clear, that's normal — the graph is just showing the cache being rebuilt. The kernel is not "wasting" memory; it's making your system faster for the next time those files are read.
Quick Reference — Chapter 3 Commands
| Command | Purpose | Key flags / notes |
| free -h | Memory and swap — human-readable. Watch "available", not "free". | -s 2 refresh every 2 seconds |
| cat /proc/meminfo | Full kernel memory accounting — MemAvailable, AnonPages, Committed_AS, Slab | Most reliable source; grep MemAvailable for the key figure |
| vmstat 1 | Watch si/so columns — non-zero = active swapping (bad) | wa column shows % CPU time waiting for I/O (includes swap I/O) |
| ps aux --sort=-%mem | Processes sorted by memory usage (RSS). Quick top-consumers list. | | head -10 · RSS is in KB |
| smem -r -k | Per-process USS/PSS/RSS/Swap — more accurate than ps for real memory cost | -r reverse sort · -k human sizes · may need install |
| pmap -x PID | Memory map of one process — find growing anonymous mappings (leaks) | | grep anon to filter heap · | tail for totals |
| lsof -p PID | Open files and sockets for one process — spot connection or fd leaks | | wc -l for count · | awk '{print $5}' | sort | uniq -c for type breakdown |
| dmesg -T | grep -i oom | Find OOM kill events — shows process killed, its RSS, and oom_score | Also: journalctl -k | grep -i oom |
| cat /proc/PID/oom_score | Current OOM kill score for a process (0–1000, higher = more likely to die) | Loop over all PIDs with ps -eo pid to find highest-scored processes |
| echo N | tee /proc/PID/oom_score_adj | Adjust OOM kill priority: -1000 (immune) to +1000 (kill first) | Persistent via systemd OOMScoreAdjust= in unit file |
| sysctl vm.swappiness | Check swap eagerness (default 60). Lower = prefer reclaiming cache over swapping. | sysctl -w vm.swappiness=10 to change · persist in /etc/sysctl.d/ |
| watch -n 30 'ps -p PID -o rss=' | Monitor RSS of one process every 30s — track memory leak growth rate | Pipe to tee /tmp/mem.log to keep a history |