Memory Bottlenecks

Chapter 3 — Memory Bottlenecks

Memory is the most misread resource on Linux. The default output of free routinely alarms people who see only 200 MB "free" on a 16 GB server — when in reality that server has 14 GB readily available and is performing perfectly. This chapter starts by correcting that misreading, then covers the cases where memory really is a problem: swap thrashing, OOM kills, and memory leaks.

What this chapter covers: Why "free" memory isn't what it looks like. The available vs free distinction. Buffers and page cache explained. Reading /proc/meminfo. Swap and vm.swappiness. Scenario 1: swap climbing overnight. Scenario 2: the OOM killer fired — reading dmesg. Scenario 3: a service's memory keeps growing — identifying a leak. The OOM score system. When (and when not) to clear the page cache.

The Fundamental Misreading — "Free" Is Not "Available"

Linux uses all available RAM productively. Memory that isn't being used by processes is used as a disk cache — so pages read from disk recently are kept in RAM and served from there on subsequent reads, making the system faster. This cached memory is immediately reclaimed when a process needs it. It shows up as "used" in many tools, but it's not really in use in a way that matters.

$ free -h total used free shared buff/cache available Mem: 15Gi 5.2Gi 1.1Gi 342Mi 9.2Gi 9.8Gi Swap: 2.0Gi 0B 2.0Gi # "used" looks high at 5.2 GB — but that includes process memory AND some overhead # "free" is only 1.1 GB — looks alarming # "buff/cache" is 9.2 GB of disk cache — reclaimed whenever processes need it # "available" is 9.8 GB — what a new process can actually use without going to swap # The server has 9.8 GB readily available. It is not in trouble.

How 16 GB is actually allocated (example above)

5.2 GB used (processes)

9.2 GB buff/cache (disk cache — reclaimable)

1.1 GB free

What "available" actually looks like

5.2 GB committed to processes

9.8 GB available (free + reclaimable cache)

Process memory

Buff/cache (reclaimable)

Truly free

Available to processes

The number to watch is "available", not "free". When available drops toward zero (especially below 200–300 MB on a production server), then you have a memory pressure problem. Until then, the kernel is just being efficient with your RAM.

What are buffers and cache?

📂

Page Cache

Contents of files that have been read from disk, kept in RAM for fast re-access. If your app reads the same log file 1,000 times, after the first read it comes from RAM at memory speed. The kernel evicts oldest pages when processes need more RAM. This is the majority of "buff/cache" on a typical server.

🗂️

Buffer Cache

Filesystem metadata: directory entries (dentries), inode information, and raw block device buffers. Smaller than the page cache on most systems, but important for workloads that do many small file operations (millions of tiny files, email servers).

⚡

Anonymous Memory

Process heap, stack, and privately allocated memory that isn't backed by a file. This is the memory processes actually "own" and that cannot be reclaimed without going to swap. It's what RSS measures. free calls this the "used" column (roughly).

/proc/meminfo — The Full Picture

free is a summary. When you need more detail, /proc/meminfo is the authoritative source — the kernel's own memory accounting. The fields that matter most for performance diagnosis:

$ cat /proc/meminfo MemTotal: 16384000 kB # Total physical RAM MemFree: 1126400 kB # Truly unused pages MemAvailable: 10035200 kB # ← The real "can I fit more?" number Buffers: 206080 kB # Block device buffer cache Cached: 9214976 kB # Page cache (file contents) SwapCached: 0 kB # Pages in swap that are also still in RAM SwapTotal: 2097152 kB SwapFree: 2097152 kB # ← Swap used = SwapTotal - SwapFree. Here: 0 used. Active: 6291456 kB # Recently used — less likely to be reclaimed Inactive: 3670016 kB # Less recently used — first candidates for reclaim Dirty: 49152 kB # Written to but not yet flushed to disk Writeback: 0 kB # Currently being written to disk AnonPages: 5324800 kB # Anonymous process memory (heap/stack) — cannot be reclaimed Mapped: 1638400 kB # Files mapped into process address spaces (mmap) Shmem: 350208 kB # Shared memory (tmpfs, IPC) Slab: 524288 kB # Kernel data structures (inodes, dentries, etc.) SReclaimable: 393216 kB # Slab memory that can be reclaimed CommitLimit: 10289152 kB # Maximum memory the kernel will commit to (overcommit limit) Committed_AS: 8192000 kB # How much has been promised. If near CommitLimit, risk of OOM.

Field	What it means	Alert when…
MemAvailable	Memory a new process can use without swap. The most useful single number.	Drops below ~200 MB on a production server
SwapFree	Unused swap. Swap used = SwapTotal − SwapFree.	Swap in use at all — investigate why
Dirty	Data written by processes but not yet flushed to disk. Normal in bursts.	Persistently high (GB range) — disk can't keep up with writes
AnonPages	Process heap/stack memory. Cannot be reclaimed without swap.	Growing steadily with no new processes starting — possible leak
Committed_AS	Total memory promised to all processes (including not-yet-used allocations).	Approaching CommitLimit — system is over-committed, OOM risk
Slab	Kernel object caches. Can grow large on servers with many files/sockets.	Several GB with nothing to explain it — possible kernel memory leak

Swap — What It Is and When It Becomes a Problem

Swap is disk space used as overflow when physical RAM is exhausted. When the kernel needs to free RAM for a new allocation and can't reclaim enough page cache, it evicts anonymous process memory (heap/stack pages) to the swap device. If that memory is needed again, it's read back from disk — this is called swapping in.

The problem: modern NVMe SSDs deliver ~7 GB/s. RAM delivers ~50 GB/s. Even on the fastest storage, swap I/O is 7× slower than RAM at best — and swap on a spinning HDD is catastrophic. A system actively swapping (non-zero si/so in vmstat) will feel sluggish even if CPU load is low.

# Quick swap check $ free -h | grep Swap Swap: 2.0Gi 1.4Gi 614Mi # 1.4 GB in use. Now check if it's actively swapping (worse than just being in use): $ vmstat 1 5 | awk 'NR==1{print} NR>2{print}' r b swpd free buff cache si so bi bo in cs us sy id wa 3 1 1433M 614M 201M 2.1G 840 320 9800 8420 3100 5200 20 8 12 60 # si=840 KB/s (swapping in) + so=320 KB/s (swapping out) = active thrashing # wa=60% — CPU spending 60% of time waiting for disk I/O from swap

vm.swappiness — controlling swap eagerness

0 — Avoid swap, prefer to reclaim cache

60 — Default

200 — Swap aggressively

060100200

# Check current swappiness $ cat /proc/sys/vm/swappiness 60 # Reduce it temporarily (survives until next reboot) $ sysctl -w vm.swappiness=10 # Make it permanent across reboots: $ echo "vm.swappiness=10" | sudo tee /etc/sysctl.d/99-swappiness.conf $ sysctl -p /etc/sysctl.d/99-swappiness.conf

vm.swappiness=0 does not disable swap. It tells the kernel to avoid swapping anonymous memory and prefer to reclaim page cache instead — but the kernel will still swap if there's genuinely no other choice. Setting it to 0 can actually cause OOM kills on workloads where swapping out cold pages would have been better. A value of 10 is a reasonable middle ground for servers that have enough RAM.

Scenario 1 — Swap Is Climbing Overnight

Confirm swap is in use and check whether it's actively swapping. Used swap alone isn't urgent. Active swapping (si/so in vmstat) is.

$ free -h Swap: 2.0Gi 1.8Gi 204Mi # 1.8 GB used — significant $ vmstat 1 3 | awk 'NR==1{print} NR>2{print}' r b swpd free buff cache si so bi bo in cs 1 0 1843M 204M 180M 420M 0 0 12 8 210 840 # si=0, so=0 — not actively swapping right now. Memory was pushed to swap # overnight but the system is stable. Now find what consumed the RAM.

Find the memory hogs with ps, sorted by RSS (resident memory).

$ ps aux --sort=-%mem | head -10 USER PID %CPU %MEM VSZ RSS STAT START TIME COMMAND postgres 2341 0.5 18.4 850000 3014656 S Mon09 8:42 postgres: worker app 8821 0.2 12.1 650000 1982464 S Mon09 3:14 node /opt/app/server.js app 8834 0.1 11.8 640000 1933312 S Mon09 3:05 node /opt/app/server.js # RSS is in KB. 3,014,656 KB = ~2.9 GB for postgres worker # Two node processes each using ~1.9 GB # Total: ~6.7 GB — this server only has 16 GB, so that explains the swap pressure # Human-readable RSS: $ ps aux --sort=-%mem | awk 'NR==1 || NR<=11 {printf "%-12s %6s %6.1f MB %s\n", $1, $2, $6/1024, $11}'

Use smem for a more accurate per-process view — it shows Unique Set Size (USS), which excludes memory shared with other processes and gives the real exclusive memory cost.

# smem may need: apt install smem / yum install smem $ smem -r -k | head -15 PID User Command Swap USS PSS RSS 2341 postgres postgres: worker 1.4G 2.8G 2.9G 3.0G 8821 app node /opt/app/server.js 180M 1.7G 1.8G 1.9G 8834 app node /opt/app/server.js 160M 1.6G 1.7G 1.9G # Swap column shows how much each process has in swap right now # USS = Unique Set Size (truly private memory — the real cost of running this process) # PSS = Proportional Set Size (USS + fair share of shared libs)

Determine if this is a growth problem or a sizing problem. If processes are at a stable high-water mark and swap is stable, the server may simply need more RAM. If memory is growing, investigate a leak (Scenario 3). Check what ran overnight:

# Did a cron job run overnight that caused this? $ grep "$(date -d yesterday '+%b %e')" /var/log/syslog | grep -i cron $ journalctl --since yesterday --until "6 hours ago" -u cron

Resolution options:

Short term: If the system is stable (si/so = 0), leave it. Pages in swap that aren't needed won't be read back, and the system is coping.
If actively swapping and sluggish: Identify the largest non-essential process (step 2) and restart it during a maintenance window — this frees its RSS and its swap pages. Do not kill a database or web server without preparation.
Medium term: Lower vm.swappiness to make the kernel prefer reclaiming page cache over swapping, giving processes more RAM before they're pushed to swap.
Long term: Add RAM, or reduce the memory footprint of the services (connection pool sizes, worker counts, JVM heap limits).

Never run production without swap entirely. If a workload occasionally spikes, swap is the safety net that prevents an OOM kill. The goal is a system that rarely uses swap, not one that has none.

Scenario 2 — The OOM Killer Fired Overnight

Check dmesg for OOM kill events — this is the first thing to look at.

$ dmesg -T | grep -i "oom\|killed process\|out of memory" [Mon Jun 16 03:47:22 2025] Out of memory: Kill process 8821 (node) score 482 or sacrifice child [Mon Jun 16 03:47:22 2025] Killed process 8821 (node) total-vm:655360kB, anon-rss:1843200kB, file-rss:90112kB, shmem-rss:0kB # Time: 03:47 — overnight as suspected # Process: PID 8821, command "node" # anon-rss: 1.76 GB was in RAM when it was killed # score 482 — OOM score at time of kill (higher = more likely to be killed)

Read the full OOM event for context — dmesg logs the memory state of the whole system at the moment of the kill.

$ dmesg -T | grep -A 30 "Out of memory" | head -40 [Mon Jun 16 03:47:21 2025] node invoked oom-killer: gfp_mask=0x..., order=0, oom_score_adj=0 [Mon Jun 16 03:47:21 2025] Mem-Info: [Mon Jun 16 03:47:21 2025] active_anon:458752 inactive_anon:12288 isolated_anon:0 [Mon Jun 16 03:47:21 2025] active_file:256 inactive_file:384 isolated_file:0 # inactive_file (reclaimable page cache) is tiny: 384 pages = 1.5 MB # The kernel had almost no cache left to reclaim before turning to swap/OOM # This means the system was already very memory-pressured before the kill [Mon Jun 16 03:47:21 2025] MemFree: 12288kB ← Only 12 MB truly free at time of kill

Also check journalctl — on systemd systems, OOM kills appear in the journal too, often with more context about which service was affected.

$ journalctl -k --since "2025-06-16 03:40" --until "2025-06-16 04:00" # -k = kernel messages only (same as dmesg) $ journalctl -u myapp.service --since yesterday | grep -i "kill\|oom\|memory"

Understand the OOM score — why was this process chosen? The kernel assigns every process an oom_score from 0–1000. Higher = more likely to be killed. It's based primarily on memory usage as a fraction of total RAM, adjusted by oom_score_adj.

# Check the OOM score of running processes $ for PID in $(ps -eo pid --no-headers); do printf "%6d %5d %s\n" $PID \ $(cat /proc/$PID/oom_score 2>/dev/null) \ "$(cat /proc/$PID/cmdline 2>/dev/null | tr '\0' ' ' | cut -c1-60)" done | sort -k2 -rn | head -10 8821 482 node /opt/app/server.js 2341 310 postgres: worker 891 45 sshd: root

Protect critical processes from OOM kills with oom_score_adj. Setting this to -1000 makes the process immune; 0 is default; +1000 makes it the first target.

# Protect a running process (e.g., your database) $ echo -500 | sudo tee /proc/2341/oom_score_adj # For a systemd service — persistent across restarts: $ sudo systemctl edit postgresql.service # Add under [Service]: # OOMScoreAdjust=-500 # Make a disposable worker MORE likely to be killed first (protecting everything else) $ echo 500 | sudo tee /proc/8821/oom_score_adj

The OOM killer is a last resort — it only fires when the kernel genuinely cannot find any memory to allocate from any source. If you're seeing regular OOM kills, the fix is either more RAM, reduced process memory footprints, or better swap management — not adjusting oom_score_adj to protect things. That only changes who gets killed, not whether the kill happens.

Scenario 3 — A Service's Memory Keeps Growing

Confirm the growth is real, not just page cache being attributed to the process. Watch RSS (not VSZ) over time — RSS is what's actually in physical memory.

# Watch RSS of a specific PID every 30 seconds $ watch -n 30 'ps -p 8821 -o pid,rss,vsz,etime,comm | awk "{printf \"%s PID:%s RSS:%s MB VSZ:%s MB uptime:%s\n\", \$5,\$1,int(\$2/1024),int(\$3/1024),\$4}"' # Or log it to a file to track the trend: $ while true; do echo "$(date '+%H:%M:%S') $(ps -p 8821 -o rss= | awk '{printf "%.0f MB\n", $1/1024}')" sleep 60 done | tee /tmp/mem_growth.log # After an hour, check the trend: 09:00:00 204 MB 09:30:00 261 MB 10:00:00 318 MB 10:30:00 375 MB ← Growing ~57 MB every 30 minutes — linear leak pattern

Use pmap to inspect the process's memory map — look for anonymous mappings that are growing, which indicates heap or mmap-based allocations not being freed.

$ pmap -x 8821 | tail -20 Address Kbytes RSS Dirty Mode Mapping 00007f8b2c000000 2097152 2097152 2097152 rw--- [ anon ] 00007f8b4c000000 524288 524288 524288 rw--- [ anon ] ... ---------------- ------- ------ ------ total kB 3276800 3145728 3014656 # Large anonymous mappings with RSS == Kbytes == Dirty means # these pages are allocated, in memory, and have been modified — # but nothing is freeing them. Classic heap leak pattern. # Run pmap again 10 minutes later and compare anon totals: $ pmap -x 8821 | grep anon | awk '{sum += $2} END {print sum/1024 " MB anon"}'

Check for open file descriptors accumulating — sometimes what looks like a memory leak is actually file handles or socket connections not being closed, each consuming a small amount of kernel memory.

# Count open file descriptors for the process $ ls /proc/8821/fd | wc -l 4821 # 4,821 open file descriptors — suspicious for most apps. # What are they? $ lsof -p 8821 | awk '{print $5}' | sort | uniq -c | sort -rn | head -10 4201 IPv4 ← Over 4000 open network connections — likely connection leak 300 REG ← Regular files 200 sock ← Sockets

Document, report, and mitigate. A genuine memory leak is a code bug that requires a fix. In the meantime:

Document the growth rate — how long before it hits the danger zone? This defines your maintenance window.
Set up a scheduled restart — systemctl restart myapp.service via cron at off-peak hours buys time while the fix is developed.
Set a memory limit via systemd — MemoryMax=2G in the service unit will trigger an OOM kill of the leaking service (not the whole system) if it exceeds 2 GB, protecting other services.
Report to developers — include your pmap output, the growth rate log, and the lsof fd count. This is the evidence needed to find the leak.

Memory leaks in interpreted languages (Node.js, Python, Ruby) are often event listener accumulation, cache objects that grow without bounds, or closures keeping references alive. In C/C++ services, tools like Valgrind or AddressSanitizer are used during development to find the leaking allocation.

OOM Score — Who Gets Killed First

-1000

Never kill

Completely protected from OOM killer. Set for init/systemd, critical infrastructure. Use with extreme care — if this process leaks, the system will OOM-kill everything else first.

-500

Strongly protected

Good for databases (PostgreSQL, MySQL, Redis). Unlikely to be killed unless the system is truly desperate. Set via systemd OOMScoreAdjust or /proc/PID/oom_score_adj.

Default

All processes start here. Final oom_score is calculated from this base plus memory usage. Large-RSS processes end up with higher scores.

+500 to +1000

Kill me first

Useful for disposable worker processes or batch jobs — if the system runs low, kill this before touching production services. Systemd sets +100 for most user services by default on some distros.

Page Cache — When to Clear It (Rarely)

The page cache is self-managing. The kernel evicts the oldest, least-used pages automatically when processes need RAM. You almost never need to clear it manually on a production server.

Clearing the page cache on a production server causes a performance cliff. Every file access that was being served from RAM now hits disk — databases, web servers, and application servers all slow dramatically for minutes until the cache warms up again. Only clear it for benchmarking (to get a cold-cache baseline) or on a test system.

# If you genuinely need to clear the page cache (benchmarking, test systems ONLY): $ sync # Flush dirty pages to disk first $ echo 1 | sudo tee /proc/sys/vm/drop_caches # 1=page cache only $ echo 2 | sudo tee /proc/sys/vm/drop_caches # 2=dentries+inodes $ echo 3 | sudo tee /proc/sys/vm/drop_caches # 3=everything # Effect is immediate but temporary — cache rebuilds as soon as files are accessed again. # This does NOT help with process memory (AnonPages) — only file cache is affected.

If your monitoring shows "used memory" jumping after a cache clear, that's normal — the graph is just showing the cache being rebuilt. The kernel is not "wasting" memory; it's making your system faster for the next time those files are read.

Quick Reference — Chapter 3 Commands

Command	Purpose	Key flags / notes
free -h	Memory and swap — human-readable. Watch "available", not "free".	`-s 2` refresh every 2 seconds
cat /proc/meminfo	Full kernel memory accounting — MemAvailable, AnonPages, Committed_AS, Slab	Most reliable source; `grep MemAvailable` for the key figure
vmstat 1	Watch si/so columns — non-zero = active swapping (bad)	wa column shows % CPU time waiting for I/O (includes swap I/O)
ps aux --sort=-%mem	Processes sorted by memory usage (RSS). Quick top-consumers list.	`\| head -10` · RSS is in KB
smem -r -k	Per-process USS/PSS/RSS/Swap — more accurate than ps for real memory cost	`-r` reverse sort · `-k` human sizes · may need install
pmap -x PID	Memory map of one process — find growing anonymous mappings (leaks)	`\| grep anon` to filter heap · `\| tail` for totals
lsof -p PID	Open files and sockets for one process — spot connection or fd leaks	`\| wc -l` for count · `\| awk '{print $5}' \| sort \| uniq -c` for type breakdown
dmesg -T \| grep -i oom	Find OOM kill events — shows process killed, its RSS, and oom_score	Also: `journalctl -k \| grep -i oom`
cat /proc/PID/oom_score	Current OOM kill score for a process (0–1000, higher = more likely to die)	Loop over all PIDs with `ps -eo pid` to find highest-scored processes
echo N \| tee /proc/PID/oom_score_adj	Adjust OOM kill priority: -1000 (immune) to +1000 (kill first)	Persistent via systemd `OOMScoreAdjust=` in unit file
sysctl vm.swappiness	Check swap eagerness (default 60). Lower = prefer reclaiming cache over swapping.	`sysctl -w vm.swappiness=10` to change · persist in `/etc/sysctl.d/`
watch -n 30 'ps -p PID -o rss='	Monitor RSS of one process every 30s — track memory leak growth rate	Pipe to `tee /tmp/mem.log` to keep a history