Disk I/O & Storage Bottlenecks

Chapter 4 — Disk I/O & Storage Bottlenecks

Disk problems come in two distinct flavours that require completely different approaches. The first is space — the filesystem is full and nothing can be written. The second is I/O performance — the disk has space but is so busy that read and write operations queue up, causing the system to grind. This chapter covers both, including the trap where df and du disagree and the reason why is a hidden process.

What this chapter covers: Why df and du can disagree — the deleted-but-open file trap. Reading iostat's extended output. Scenario 1: root filesystem at 100%. Scenario 2: inode exhaustion — disk has space but files can't be created. Scenario 3: high I/O wait — finding which process is hammering the disk. Log file management. ionice for throttling disk-intensive processes. Mount options that affect performance.

df vs du — Why They Sometimes Disagree

df reads disk usage from the filesystem's own accounting (the superblock). du walks the directory tree and adds up file sizes. These two methods give the same answer unless a file has been deleted from the directory tree while a process still holds it open — in which case df still counts the space (the inode and its blocks are still allocated) while du doesn't (the directory entry is gone, so the walk misses it).

The Deleted-But-Open File Problem Normal state: Directory entry → Inode → Data blocks on disk (du finds this) (df counts this) After "rm large_log.log" while nginx still has it open: ~~Directory entry~~ → Inode → Data blocks on disk (du: file gone!) (df: still counted! inode still allocated) Effect: df says filesystem is 8 GB used. du / reports only 4 GB. Missing 4 GB = deleted-but-open files. Fix: find the process holding the file open, restart it or send it SIGHUP to reopen its log files. The OS then releases the inode and the space is reclaimed.

# Step 1: spot the discrepancy $ df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 48G 2.0G 96% / $ du -sh /* 2>/dev/null | sort -h | tail -5 12G /usr 8G /var 4G /home 2G /opt # Total from du: ~26 GB. df says 48 GB used. ~22 GB unaccounted for. # Step 2: find deleted-but-open files and their size $ lsof | grep -i deleted | awk '{print $7, $1, $2, $NF}' | sort -rn | head -10 22548342912 nginx 1234 /var/log/nginx/access.log (deleted) 4194304000 python 5678 /tmp/cache_dump.bin (deleted) # nginx is holding a 21 GB deleted log file open. That's the missing space. # Step 3a: send SIGHUP to nginx — it reopens its log files, releasing the old inode $ kill -HUP $(pgrep nginx) # or: nginx -s reopen # Step 3b: if SIGHUP isn't enough, restart the service $ systemctl restart nginx # Verify space is reclaimed: $ df -h / Filesystem Size Used Avail Use% /dev/sda1 50G 26G 24G 52%

Most well-behaved services (nginx, Apache, PostgreSQL, syslog) support SIGHUP to reopen their log files without dropping connections or restarting. This is also how logrotate works — it renames the log file, sends SIGHUP to the service, and the service starts writing to a new file at the original path. The old file (now renamed) can then be compressed or deleted.

Reading iostat — The Columns That Matter

iostat -x 1 gives per-device I/O statistics updated every second. It's the primary tool for identifying which disk is saturated and whether the problem is reads, writes, throughput, or latency.

$ iostat -xh 1 Device r/s w/s rkB/s wkB/s await r_await w_await aqu-sz %util sda 12.0 8.0 480.0 320.0 2.1 1.8 2.5 0.04 3.2 nvme0n1 45.0 320.0 1800.0 12800.0 4.8 2.1 5.2 0.82 42.0 sdb 2.0 1840.0 8.0 73600.0 180.0 12.0 182.0 24.8 98.6 # sda: healthy — low utilisation, low await # nvme0n1: busy but not saturated — 42% util, await <5ms is fine for NVMe # sdb: SATURATED — 98.6% util, queue depth 24.8, await 180ms (should be <10ms on SSD)

%util

Utilisation

Percentage of time the device was busy. At 100% the device is saturated — I/O requests are queuing.

Above 80% sustained on HDD, or 95%+ on SSD — investigate

await

Average I/O Wait (ms)

Average time from I/O request submission to completion. Includes queue time plus actual disk service time.

HDD: >20ms. SSD: >5ms. NVMe: >1ms. Consistently above these = saturation

aqu-sz

Average Queue Depth

Average number of I/O requests waiting in the device queue. The clearest saturation signal.

Sustained above 1–2 on a single disk = more requests than the device can handle

r/s, w/s

Operations / second

Read and write IOPS. Spins disks are limited to ~100–200 IOPS. SSDs handle 10,000–100,000+. Tells you whether the workload is I/O-count bound.

HDD: r/s + w/s consistently near 150–200 = IOPS saturated

rkB/s, wkB/s

Throughput (KB/s)

Actual data transferred per second. Compare against the device's rated throughput. Large sequential files show high kB/s with low IOPS.

HDD: above ~100–150 MB/s. SSD: above rated speed. Sustained = throughput bound.

r_await / w_await

Read / Write await separately

Splits await into read-side and write-side. If w_await is high but r_await is fine, writes are queuing. Useful for mixed workloads.

Asymmetry between r_await and w_await points to write-heavy saturation

iotop — Finding Which Process Is Doing the I/O

iostat tells you the device is saturated. iotop tells you who is responsible. It requires root (or CAP_NET_ADMIN) to read kernel I/O accounting.

$ iotop -o # -o = only show processes with active I/O Total DISK READ: 1.20 M/s | Total DISK WRITE: 73.60 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 9821 be/4 mysql 0.00 B/s 68.40 M/s 0.00 % 98.72 % mysqld 4812 be/4 www-data 1.20 M/s 5.20 M/s 0.00 % 1.02 % php-fpm: pool www # mysqld is responsible for 68 MB/s of writes and 98.7% of I/O time # IO> column shows % of time the process was blocked waiting on I/O # Batch mode — useful for logging or running from scripts $ iotop -o -b -n 5 # -b = batch mode, -n 5 = 5 iterations then exit # Once you have the PID, find what files it's accessing $ lsof -p 9821 | grep -E "REG|DIR" | head -20 mysqld 9821 mysql 4u REG 8,2 2147483648 /var/lib/mysql/ibdata1 mysqld 9821 mysql 5u REG 8,2 524288000 /var/lib/mysql/ib_logfile0 # InnoDB is writing heavily to its redo log — normal under high write load, # but if await is 180ms, either the disk is inadequate or innodb_flush_log_at_trx_commit=1 # is flushing on every transaction (safe but expensive)

Scenario 1 — Root Filesystem at 100%

Confirm which filesystem is full and check for the df/du discrepancy first. If they disagree, a deleted-but-open file may be the cause — the fastest fix with no risk of deleting the wrong thing.

$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 50G 0 100% / /dev/sdb1 500G 120G 380G 24% /data $ lsof | grep -i deleted | awk '{printf "%s MB\t%s\t%s\n", $7/1048576, $1, $NF}' | sort -rn | head -5 21504 MB nginx /var/log/nginx/access.log (deleted) # Found it immediately. nginx is holding 21 GB of deleted log space. $ kill -HUP $(pgrep -x nginx) $ df -h / Filesystem Size Used Avail Use% /dev/sda1 50G 29G 21G 58% ← Immediate recovery, no deletions needed

If lsof shows nothing significant, find the largest directories. Work top-down — start at root, drill into the biggest directory at each level.

$ du -sh /* 2>/dev/null | sort -h 12G /usr 38G /var # ← biggest, investigate this next 4G /home $ du -sh /var/* 2>/dev/null | sort -h 1G /var/cache 2G /var/lib 35G /var/log # ← 35 GB in logs $ du -sh /var/log/* 2>/dev/null | sort -h | tail -10 1G /var/log/syslog 30G /var/log/myapp # ← application log directory $ ls -lh /var/log/myapp/ -rw-r--r-- 1 app app 30G Jun 14 14:22 debug.log # Debug logging left enabled in production. Common cause.

Use ncdu for interactive exploration — faster than repeated du commands when you don't know where to look.

# ncdu may need: apt install ncdu / yum install ncdu $ ncdu /var # Opens a curses interface. Arrow keys navigate, Enter descends, # d deletes (with confirmation), q quits. # Shows size of each directory sorted largest first — much faster than du for exploration.

Safe quick wins to recover space — in order of risk:

Zero risk: Clear the systemd journal — journalctl --vacuum-size=500M
Zero risk: Clear package manager cache — apt clean or yum clean all
Zero risk: Remove old kernel packages — apt autoremove (keeps current kernel)
Low risk: Truncate (not delete) the offending log, leaving the file handle intact — truncate -s 0 /var/log/myapp/debug.log
Check before removing: Core dumps in /var/crash or /tmp — find /tmp /var/crash -name "core*" -size +100M
Check before removing: Old log rotations — find /var/log -name "*.gz" -mtime +30

Truncate vs delete — why truncate is safer for open log files. If you rm a log file that a running process has open, the process continues writing to the now-deleted inode — the space isn't freed (the deleted-but-open problem again). truncate -s 0 zeros the file's content while leaving the inode intact — the process's file handle remains valid and the space is freed immediately.

# Safe: truncate the file while the process keeps writing to it $ truncate -s 0 /var/log/myapp/debug.log # The file now has 0 bytes. The running process continues writing to it normally. # Alternatively: > /var/log/myapp/debug.log (shell redirection truncate) # Unsafe: rm while process is writing # rm /var/log/myapp/debug.log ← process keeps the deleted inode open; space not freed

Root cause and prevention. After recovering space, address why it filled up:

Set up logrotate for application logs (or ensure it's configured correctly)
Disable debug logging in production application config
Set a journal size cap: SystemMaxUse=500M in /etc/systemd/journald.conf
Set up disk space monitoring alerts (before it reaches 100%)

On systems where / is separate from /var, /tmp, and /home, a full /var won't affect the system's ability to run — only to write logs and package updates. On single-partition systems, a full / can prevent logins and crash running services. Always know your partition layout before an incident.

Scenario 2 — Inode Exhaustion

Check inode usage immediately — this is the tell. Inodes are pre-allocated metadata slots (one per file or directory). When they run out, no new files can be created even if disk space is available.

$ df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 6553600 6553600 0 100% / /dev/sdb1 6553600 412000 6141600 6% /data # / has zero inodes remaining. This explains the error despite available disk space.

Find which directory contains the most files. Inode exhaustion is almost always caused by one directory containing millions of tiny files — mail queues, PHP session files, cache directories, or a logging system that creates one file per event.

# Count files per top-level directory (can take a while on a full system) $ for dir in /*; do count=$(find "$dir" -xdev 2>/dev/null | wc -l) echo "$count $dir" done | sort -rn | head -10 5821432 /var 412000 /usr 98000 /home # Drill into /var $ find /var -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -10 5819200 /var/spool/postfix/deferred 1200 /var/log # The postfix deferred queue has 5.8 million messages. Classic inode exhaustion cause.

Recovery — delete the tiny files. With millions of files, rm -rf /path/* fails ("argument list too long"). Use find instead:

# For a mail queue — flush or delete deferred messages $ postsuper -d ALL deferred # Postfix: delete all deferred messages properly # Generic: delete files in a directory with too many for rm to handle $ find /var/spool/postfix/deferred -type f -delete # For PHP session files (another common culprit): $ find /var/lib/php/sessions -type f -mtime +1 -delete # -mtime +1 = only files older than 1 day — leaves active sessions intact # Verify recovery: $ df -i / Filesystem Inodes IUsed IFree IUse% /dev/sda1 6553600 734400 5819200 11%

Prevention. Add inode monitoring to your alerting alongside disk space — many monitoring systems skip it. Also consider: if your application legitimately creates millions of small files, use a directory structure that spreads them across subdirectories (e.g., by first two characters of the filename: ca/cache_abc123) rather than a flat directory. Some filesystems (ext4 with dir_index) handle large directories better than others.

You cannot add inodes to an existing ext4 filesystem without reformatting. The inode count is set at creation time with mkfs.ext4 -i bytes-per-inode. If inode exhaustion is a recurring problem on a partition you can't reformat, consider moving the high-file-count directory to its own partition or filesystem (like tmpfs for session files).

Scenario 3 — High I/O Wait, Server Sluggish

Confirm it's disk I/O wait and not NFS or swap. High wa covers all uninterruptible I/O — disk, NFS mounts, and swap all contribute.

$ vmstat 1 5 r b swpd free buff cache si so bi bo in cs us sy id wa 1 4 0 2.1G 200M 4.2G 0 0 1200 73800 2100 3400 8 4 43 45 # b=4: four processes in uninterruptible sleep (D state) # si=0 so=0: no swap activity — not a memory problem # bo=73800 KB/s: 72 MB/s of writes — high for a spinning disk # id=43, wa=45: CPU is idle nearly half the time, but blocked on I/O the other half

Identify the saturated device with iostat.

$ iostat -xh 1 3 Device r/s w/s rkB/s wkB/s await aqu-sz %util nvme0n1 8.0 12.0 320.0 480.0 0.9 0.02 2.1 sdb 2.0 1840.0 8.0 73600.0 165.0 24.4 98.8 # sdb: the spinning HDD. 98.8% utilised, 165ms await (should be <20ms), queue 24. # It's completely saturated by 73 MB/s of writes — well beyond its ~100-150 MB/s limit # for sequential writes, and catastrophic for random writes.

Find which process is responsible with iotop.

$ iotop -o Total DISK READ: 8.0 KB/s | Total DISK WRITE: 73.60 MB/s TID PRIO USER DISK READ DISK WRITE IO> COMMAND 12841 be/4 backup 0.00 B/s 73.50 MB/s 96.2% rsync -av /data /mnt/backup 4812 be/4 www-data 8.00 KB/s 0.10 MB/s 1.8% php-fpm # rsync backup job is consuming essentially all disk write bandwidth. # It's writing to sdb (the backup mount) but the read side (reading /data from nvme0n1) # is creating cache pressure that's causing page evictions on sdb.

Throttle the backup job with ionice without stopping it. ionice changes a process's I/O scheduling class — the disk equivalent of renice for CPU.

# Set rsync to idle I/O class — only gets disk access when nothing else needs it $ ionice -c 3 -p 12841 # Class 3 = idle: process gets I/O only when no other process wants the disk # The backup will slow down significantly but production I/O is protected # Alternatively: best-effort with low priority (0=high, 7=low) $ ionice -c 2 -n 7 -p 12841 # For future backups — launch rsync with low priority from the start: $ ionice -c 3 nice -n 19 rsync -av /data /mnt/backup # Verify the change reduced I/O wait: $ vmstat 1 3 r b swpd free si so bi bo wa 0 0 0 2.3G 0 0 800 8400 5 ← wa dropped from 45% to 5%

If the I/O is from a production service and not a background job — the disk is genuinely undersized for the workload. Options to investigate:

Move the high-I/O service to an NVMe SSD if it's currently on HDD
Add a read cache (bcache, lvmcache) to the slow disk
For databases: tune innodb_buffer_pool_size (MySQL) or shared_buffers (PostgreSQL) to serve more reads from RAM and reduce disk reads
For write-heavy workloads: check if sync() / fsync() is being called too frequently — sometimes a config change (careful: tradeoff with durability) can dramatically reduce write IOPS

Always run iotop at the same time as iostat when diagnosing I/O — iostat identifies the saturated device, iotop identifies the responsible process. Neither alone gives the full picture.

Log Files — The Most Common Disk-Space Culprit

Logs fill disks more reliably than almost anything else. The systemd journal alone can consume dozens of gigabytes without size limits configured.

# How much space is the systemd journal using? $ journalctl --disk-usage Archived and active journals take up 8.3G in the filesystem. # Immediate trim — keep only the most recent 500 MB $ journalctl --vacuum-size=500M # Or: keep only the last 2 weeks $ journalctl --vacuum-time=2weeks # Make limits permanent in /etc/systemd/journald.conf: $ grep -E "SystemMaxUse|MaxRetention" /etc/systemd/journald.conf SystemMaxUse=500M # Hard cap on journal disk usage MaxRetentionSec=2week # Auto-delete entries older than 2 weeks $ systemctl restart systemd-journald # Find large traditional log files $ find /var/log -type f -size +100M -exec ls -lh {} \; -rw-r--r-- 1 root root 30G Jun 14 /var/log/myapp/debug.log -rw-r--r-- 1 root root 4.2G Jun 12 /var/log/syslog # Check logrotate configuration for an application $ cat /etc/logrotate.d/nginx /var/log/nginx/*.log { daily # Rotate daily missingok # Don't error if log is missing rotate 14 # Keep 14 days of logs compress # Compress old logs (.gz) delaycompress # Compress previous rotation, not current notifempty # Don't rotate empty logs sharedscripts postrotate nginx -s reopen # Signal nginx to reopen log files after rotation endscript } # Force an immediate logrotate run (for testing or emergency) $ logrotate -f /etc/logrotate.d/nginx

Mount Options That Affect Performance

The options used when mounting a filesystem can make a measurable difference, particularly for read-heavy or small-file workloads. They're set in /etc/fstab or via the mount command.

noatime

No access time updates

By default, Linux updates the "last accessed" timestamp on every file read. On read-heavy workloads this turns every read into a write, doubling I/O. noatime disables this entirely.

Impact: significant — up to 30% I/O reduction on read-heavy small-file workloads. Safe for most servers. May affect backup tools that use atime to detect changed files.

relatime

Relative access time

A compromise: only updates atime if it's older than mtime (when the file was last modified). This is the default on modern Linux — most systems already have this behaviour without explicitly setting it.

Impact: moderate. Good balance — keeps atime meaningful for tools that use it while eliminating most unnecessary write traffic. The safe default.

errors=remount-ro

Remount read-only on error

If the filesystem encounters an error, remount it read-only instead of continuing to allow writes that could corrupt data. Common default for ext4.

Impact: safety improvement. No performance effect. Prevents data corruption on disk errors — recommended for all production filesystems.

nobarrier

Disable write barriers

Write barriers ensure data is physically on disk before acknowledging a write, protecting against corruption on power loss. Disabling them can improve write performance but risks data loss on power failure.

Impact: dangerous on real hardware. Only safe on VMs where the hypervisor provides equivalent guarantees. Do not use on physical servers without a battery-backed write cache.

# Check current mount options for a filesystem $ findmnt -o TARGET,OPTIONS / TARGET OPTIONS / rw,relatime,errors=remount-ro # Add noatime to /etc/fstab for the root filesystem: $ grep " / " /etc/fstab UUID=abc123 / ext4 rw,relatime,errors=remount-ro 0 1 # Change relatime to noatime: UUID=abc123 / ext4 rw,noatime,errors=remount-ro 0 1 # Remount immediately without rebooting: $ mount -o remount,noatime /

Quick Reference — Chapter 4 Commands

Command	Purpose	Key flags / notes
df -h	Disk space per filesystem — human-readable	`-i` inode usage · `-T` show filesystem type · any filesystem at 100% = problem
df -i	Inode usage per filesystem — check when "no space" despite available space	100% inodes = no new files can be created regardless of disk space
du -sh /* 2>/dev/null \| sort -h	Size of each top-level directory, sorted smallest to largest	Drill down iteratively: `du -sh /var/* \| sort -h` etc.
ncdu /path	Interactive disk usage explorer — navigate, sort, delete	Much faster than repeated du. May need: `apt install ncdu`
lsof \| grep deleted	Find deleted-but-open files still consuming disk space	Pipe to `awk '{print $7, $1, $NF}'\| sort -rn` to show size + owner + filename
truncate -s 0 file	Zero a log file's content while leaving file handle intact	Safer than rm for files held open by running processes
kill -HUP PID	Signal a service to reopen its log files after rotation	Works for nginx, Apache, syslog, PostgreSQL and most well-behaved services
iostat -xh 1	Per-device I/O stats — find saturated devices	Watch %util (saturation), await (latency ms), aqu-sz (queue depth)
iotop -o	Per-process I/O — find which process is hammering the disk	`-b -n 5` batch mode · `-a` accumulated totals · requires root
ionice -c 3 -p PID	Set process to idle I/O class — only gets disk when nothing else needs it	`-c 2 -n 7` = best-effort low priority · use for backup jobs, batch work
find /dir -type f -delete	Delete all files in a directory when rm -rf fails (too many args)	Add `-mtime +N` to only delete files older than N days
journalctl --disk-usage	Show how much disk the systemd journal is using	`--vacuum-size=500M` trim immediately · `--vacuum-time=2weeks` by age
logrotate -f /etc/logrotate.d/X	Force an immediate logrotate run for a specific application	Check config with `logrotate --debug`
findmnt -o TARGET,OPTIONS /	Show current mount options for a filesystem	Add `noatime` in /etc/fstab + `mount -o remount,noatime /` to apply live