Disk I/O & Storage Bottlenecks

Chapter 4 — Disk I/O & Storage Bottlenecks

Disk problems come in two distinct flavours that require completely different approaches. The first is space — the filesystem is full and nothing can be written. The second is I/O performance — the disk has space but is so busy that read and write operations queue up, causing the system to grind. This chapter covers both, including the trap where df and du disagree and the reason why is a hidden process.

What this chapter covers: Why df and du can disagree — the deleted-but-open file trap. Reading iostat's extended output. Scenario 1: root filesystem at 100%. Scenario 2: inode exhaustion — disk has space but files can't be created. Scenario 3: high I/O wait — finding which process is hammering the disk. Log file management. ionice for throttling disk-intensive processes. Mount options that affect performance.

df vs du — Why They Sometimes Disagree

df reads disk usage from the filesystem's own accounting (the superblock). du walks the directory tree and adds up file sizes. These two methods give the same answer unless a file has been deleted from the directory tree while a process still holds it open — in which case df still counts the space (the inode and its blocks are still allocated) while du doesn't (the directory entry is gone, so the walk misses it).

The Deleted-But-Open File Problem Normal state: Directory entry → Inode → Data blocks on disk (du finds this) (df counts this) After "rm large_log.log" while nginx still has it open: ~~Directory entry~~ → Inode → Data blocks on disk (du: file gone!) (df: still counted! inode still allocated) Effect: df says filesystem is 8 GB used. du / reports only 4 GB. Missing 4 GB = deleted-but-open files. Fix: find the process holding the file open, restart it or send it SIGHUP to reopen its log files. The OS then releases the inode and the space is reclaimed.
# Step 1: spot the discrepancy $ df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 48G 2.0G 96% / $ du -sh /* 2>/dev/null | sort -h | tail -5 12G /usr 8G /var 4G /home 2G /opt # Total from du: ~26 GB. df says 48 GB used. ~22 GB unaccounted for. # Step 2: find deleted-but-open files and their size $ lsof | grep -i deleted | awk '{print $7, $1, $2, $NF}' | sort -rn | head -10 22548342912 nginx 1234 /var/log/nginx/access.log (deleted) 4194304000 python 5678 /tmp/cache_dump.bin (deleted) # nginx is holding a 21 GB deleted log file open. That's the missing space. # Step 3a: send SIGHUP to nginx — it reopens its log files, releasing the old inode $ kill -HUP $(pgrep nginx) # or: nginx -s reopen # Step 3b: if SIGHUP isn't enough, restart the service $ systemctl restart nginx # Verify space is reclaimed: $ df -h / Filesystem Size Used Avail Use% /dev/sda1 50G 26G 24G 52%
Most well-behaved services (nginx, Apache, PostgreSQL, syslog) support SIGHUP to reopen their log files without dropping connections or restarting. This is also how logrotate works — it renames the log file, sends SIGHUP to the service, and the service starts writing to a new file at the original path. The old file (now renamed) can then be compressed or deleted.

Reading iostat — The Columns That Matter

iostat -x 1 gives per-device I/O statistics updated every second. It's the primary tool for identifying which disk is saturated and whether the problem is reads, writes, throughput, or latency.

$ iostat -xh 1 Device r/s w/s rkB/s wkB/s await r_await w_await aqu-sz %util sda 12.0 8.0 480.0 320.0 2.1 1.8 2.5 0.04 3.2 nvme0n1 45.0 320.0 1800.0 12800.0 4.8 2.1 5.2 0.82 42.0 sdb 2.0 1840.0 8.0 73600.0 180.0 12.0 182.0 24.8 98.6 # sda: healthy — low utilisation, low await # nvme0n1: busy but not saturated — 42% util, await <5ms is fine for NVMe # sdb: SATURATED — 98.6% util, queue depth 24.8, await 180ms (should be <10ms on SSD)
%util
Utilisation
Percentage of time the device was busy. At 100% the device is saturated — I/O requests are queuing.
Above 80% sustained on HDD, or 95%+ on SSD — investigate
await
Average I/O Wait (ms)
Average time from I/O request submission to completion. Includes queue time plus actual disk service time.
HDD: >20ms. SSD: >5ms. NVMe: >1ms. Consistently above these = saturation
aqu-sz
Average Queue Depth
Average number of I/O requests waiting in the device queue. The clearest saturation signal.
Sustained above 1–2 on a single disk = more requests than the device can handle
r/s, w/s
Operations / second
Read and write IOPS. Spins disks are limited to ~100–200 IOPS. SSDs handle 10,000–100,000+. Tells you whether the workload is I/O-count bound.
HDD: r/s + w/s consistently near 150–200 = IOPS saturated
rkB/s, wkB/s
Throughput (KB/s)
Actual data transferred per second. Compare against the device's rated throughput. Large sequential files show high kB/s with low IOPS.
HDD: above ~100–150 MB/s. SSD: above rated speed. Sustained = throughput bound.
r_await / w_await
Read / Write await separately
Splits await into read-side and write-side. If w_await is high but r_await is fine, writes are queuing. Useful for mixed workloads.
Asymmetry between r_await and w_await points to write-heavy saturation

iotop — Finding Which Process Is Doing the I/O

iostat tells you the device is saturated. iotop tells you who is responsible. It requires root (or CAP_NET_ADMIN) to read kernel I/O accounting.

$ iotop -o # -o = only show processes with active I/O Total DISK READ: 1.20 M/s | Total DISK WRITE: 73.60 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 9821 be/4 mysql 0.00 B/s 68.40 M/s 0.00 % 98.72 % mysqld 4812 be/4 www-data 1.20 M/s 5.20 M/s 0.00 % 1.02 % php-fpm: pool www # mysqld is responsible for 68 MB/s of writes and 98.7% of I/O time # IO> column shows % of time the process was blocked waiting on I/O # Batch mode — useful for logging or running from scripts $ iotop -o -b -n 5 # -b = batch mode, -n 5 = 5 iterations then exit # Once you have the PID, find what files it's accessing $ lsof -p 9821 | grep -E "REG|DIR" | head -20 mysqld 9821 mysql 4u REG 8,2 2147483648 /var/lib/mysql/ibdata1 mysqld 9821 mysql 5u REG 8,2 524288000 /var/lib/mysql/ib_logfile0 # InnoDB is writing heavily to its redo log — normal under high write load, # but if await is 180ms, either the disk is inadequate or innodb_flush_log_at_trx_commit=1 # is flushing on every transaction (safe but expensive)

Scenario 1 — Root Filesystem at 100%

Scenario · Chapter 4 · Scenario 1
df shows / at 100%. The application is throwing "No space left on device" errors.
1
Confirm which filesystem is full and check for the df/du discrepancy first. If they disagree, a deleted-but-open file may be the cause — the fastest fix with no risk of deleting the wrong thing.
$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 50G 0 100% / /dev/sdb1 500G 120G 380G 24% /data $ lsof | grep -i deleted | awk '{printf "%s MB\t%s\t%s\n", $7/1048576, $1, $NF}' | sort -rn | head -5 21504 MB nginx /var/log/nginx/access.log (deleted) # Found it immediately. nginx is holding 21 GB of deleted log space. $ kill -HUP $(pgrep -x nginx) $ df -h / Filesystem Size Used Avail Use% /dev/sda1 50G 29G 21G 58% ← Immediate recovery, no deletions needed
2
If lsof shows nothing significant, find the largest directories. Work top-down — start at root, drill into the biggest directory at each level.
$ du -sh /* 2>/dev/null | sort -h 12G /usr 38G /var # ← biggest, investigate this next 4G /home $ du -sh /var/* 2>/dev/null | sort -h 1G /var/cache 2G /var/lib 35G /var/log # ← 35 GB in logs $ du -sh /var/log/* 2>/dev/null | sort -h | tail -10 1G /var/log/syslog 30G /var/log/myapp # ← application log directory $ ls -lh /var/log/myapp/ -rw-r--r-- 1 app app 30G Jun 14 14:22 debug.log # Debug logging left enabled in production. Common cause.
3
Use ncdu for interactive exploration — faster than repeated du commands when you don't know where to look.
# ncdu may need: apt install ncdu / yum install ncdu $ ncdu /var # Opens a curses interface. Arrow keys navigate, Enter descends, # d deletes (with confirmation), q quits. # Shows size of each directory sorted largest first — much faster than du for exploration.
4
Safe quick wins to recover space — in order of risk:
  • Zero risk: Clear the systemd journal — journalctl --vacuum-size=500M
  • Zero risk: Clear package manager cache — apt clean or yum clean all
  • Zero risk: Remove old kernel packages — apt autoremove (keeps current kernel)
  • Low risk: Truncate (not delete) the offending log, leaving the file handle intact — truncate -s 0 /var/log/myapp/debug.log
  • Check before removing: Core dumps in /var/crash or /tmpfind /tmp /var/crash -name "core*" -size +100M
  • Check before removing: Old log rotations — find /var/log -name "*.gz" -mtime +30
5
Truncate vs delete — why truncate is safer for open log files. If you rm a log file that a running process has open, the process continues writing to the now-deleted inode — the space isn't freed (the deleted-but-open problem again). truncate -s 0 zeros the file's content while leaving the inode intact — the process's file handle remains valid and the space is freed immediately.
# Safe: truncate the file while the process keeps writing to it $ truncate -s 0 /var/log/myapp/debug.log # The file now has 0 bytes. The running process continues writing to it normally. # Alternatively: > /var/log/myapp/debug.log (shell redirection truncate) # Unsafe: rm while process is writing # rm /var/log/myapp/debug.log ← process keeps the deleted inode open; space not freed
6
Root cause and prevention. After recovering space, address why it filled up:
  • Set up logrotate for application logs (or ensure it's configured correctly)
  • Disable debug logging in production application config
  • Set a journal size cap: SystemMaxUse=500M in /etc/systemd/journald.conf
  • Set up disk space monitoring alerts (before it reaches 100%)
On systems where / is separate from /var, /tmp, and /home, a full /var won't affect the system's ability to run — only to write logs and package updates. On single-partition systems, a full / can prevent logins and crash running services. Always know your partition layout before an incident.

Scenario 2 — Inode Exhaustion

Scenario · Chapter 4 · Scenario 2
"No space left on device" — but df shows 40% free. Files can't be created anywhere.
1
Check inode usage immediately — this is the tell. Inodes are pre-allocated metadata slots (one per file or directory). When they run out, no new files can be created even if disk space is available.
$ df -i Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 6553600 6553600 0 100% / /dev/sdb1 6553600 412000 6141600 6% /data # / has zero inodes remaining. This explains the error despite available disk space.
2
Find which directory contains the most files. Inode exhaustion is almost always caused by one directory containing millions of tiny files — mail queues, PHP session files, cache directories, or a logging system that creates one file per event.
# Count files per top-level directory (can take a while on a full system) $ for dir in /*; do count=$(find "$dir" -xdev 2>/dev/null | wc -l) echo "$count $dir" done | sort -rn | head -10 5821432 /var 412000 /usr 98000 /home # Drill into /var $ find /var -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -10 5819200 /var/spool/postfix/deferred 1200 /var/log # The postfix deferred queue has 5.8 million messages. Classic inode exhaustion cause.
3
Recovery — delete the tiny files. With millions of files, rm -rf /path/* fails ("argument list too long"). Use find instead:
# For a mail queue — flush or delete deferred messages $ postsuper -d ALL deferred # Postfix: delete all deferred messages properly # Generic: delete files in a directory with too many for rm to handle $ find /var/spool/postfix/deferred -type f -delete # For PHP session files (another common culprit): $ find /var/lib/php/sessions -type f -mtime +1 -delete # -mtime +1 = only files older than 1 day — leaves active sessions intact # Verify recovery: $ df -i / Filesystem Inodes IUsed IFree IUse% /dev/sda1 6553600 734400 5819200 11%
4
Prevention. Add inode monitoring to your alerting alongside disk space — many monitoring systems skip it. Also consider: if your application legitimately creates millions of small files, use a directory structure that spreads them across subdirectories (e.g., by first two characters of the filename: ca/cache_abc123) rather than a flat directory. Some filesystems (ext4 with dir_index) handle large directories better than others.
You cannot add inodes to an existing ext4 filesystem without reformatting. The inode count is set at creation time with mkfs.ext4 -i bytes-per-inode. If inode exhaustion is a recurring problem on a partition you can't reformat, consider moving the high-file-count directory to its own partition or filesystem (like tmpfs for session files).

Scenario 3 — High I/O Wait, Server Sluggish

Scenario · Chapter 4 · Scenario 3
vmstat shows wa=45%. The server is slow but CPU is mostly idle and memory is fine.
1
Confirm it's disk I/O wait and not NFS or swap. High wa covers all uninterruptible I/O — disk, NFS mounts, and swap all contribute.
$ vmstat 1 5 r b swpd free buff cache si so bi bo in cs us sy id wa 1 4 0 2.1G 200M 4.2G 0 0 1200 73800 2100 3400 8 4 43 45 # b=4: four processes in uninterruptible sleep (D state) # si=0 so=0: no swap activity — not a memory problem # bo=73800 KB/s: 72 MB/s of writes — high for a spinning disk # id=43, wa=45: CPU is idle nearly half the time, but blocked on I/O the other half
2
Identify the saturated device with iostat.
$ iostat -xh 1 3 Device r/s w/s rkB/s wkB/s await aqu-sz %util nvme0n1 8.0 12.0 320.0 480.0 0.9 0.02 2.1 sdb 2.0 1840.0 8.0 73600.0 165.0 24.4 98.8 # sdb: the spinning HDD. 98.8% utilised, 165ms await (should be <20ms), queue 24. # It's completely saturated by 73 MB/s of writes — well beyond its ~100-150 MB/s limit # for sequential writes, and catastrophic for random writes.
3
Find which process is responsible with iotop.
$ iotop -o Total DISK READ: 8.0 KB/s | Total DISK WRITE: 73.60 MB/s TID PRIO USER DISK READ DISK WRITE IO> COMMAND 12841 be/4 backup 0.00 B/s 73.50 MB/s 96.2% rsync -av /data /mnt/backup 4812 be/4 www-data 8.00 KB/s 0.10 MB/s 1.8% php-fpm # rsync backup job is consuming essentially all disk write bandwidth. # It's writing to sdb (the backup mount) but the read side (reading /data from nvme0n1) # is creating cache pressure that's causing page evictions on sdb.
4
Throttle the backup job with ionice without stopping it. ionice changes a process's I/O scheduling class — the disk equivalent of renice for CPU.
# Set rsync to idle I/O class — only gets disk access when nothing else needs it $ ionice -c 3 -p 12841 # Class 3 = idle: process gets I/O only when no other process wants the disk # The backup will slow down significantly but production I/O is protected # Alternatively: best-effort with low priority (0=high, 7=low) $ ionice -c 2 -n 7 -p 12841 # For future backups — launch rsync with low priority from the start: $ ionice -c 3 nice -n 19 rsync -av /data /mnt/backup # Verify the change reduced I/O wait: $ vmstat 1 3 r b swpd free si so bi bo wa 0 0 0 2.3G 0 0 800 8400 5 ← wa dropped from 45% to 5%
5
If the I/O is from a production service and not a background job — the disk is genuinely undersized for the workload. Options to investigate:
  • Move the high-I/O service to an NVMe SSD if it's currently on HDD
  • Add a read cache (bcache, lvmcache) to the slow disk
  • For databases: tune innodb_buffer_pool_size (MySQL) or shared_buffers (PostgreSQL) to serve more reads from RAM and reduce disk reads
  • For write-heavy workloads: check if sync() / fsync() is being called too frequently — sometimes a config change (careful: tradeoff with durability) can dramatically reduce write IOPS
Always run iotop at the same time as iostat when diagnosing I/O — iostat identifies the saturated device, iotop identifies the responsible process. Neither alone gives the full picture.

Log Files — The Most Common Disk-Space Culprit

Logs fill disks more reliably than almost anything else. The systemd journal alone can consume dozens of gigabytes without size limits configured.

# How much space is the systemd journal using? $ journalctl --disk-usage Archived and active journals take up 8.3G in the filesystem. # Immediate trim — keep only the most recent 500 MB $ journalctl --vacuum-size=500M # Or: keep only the last 2 weeks $ journalctl --vacuum-time=2weeks # Make limits permanent in /etc/systemd/journald.conf: $ grep -E "SystemMaxUse|MaxRetention" /etc/systemd/journald.conf SystemMaxUse=500M # Hard cap on journal disk usage MaxRetentionSec=2week # Auto-delete entries older than 2 weeks $ systemctl restart systemd-journald # Find large traditional log files $ find /var/log -type f -size +100M -exec ls -lh {} \; -rw-r--r-- 1 root root 30G Jun 14 /var/log/myapp/debug.log -rw-r--r-- 1 root root 4.2G Jun 12 /var/log/syslog # Check logrotate configuration for an application $ cat /etc/logrotate.d/nginx /var/log/nginx/*.log { daily # Rotate daily missingok # Don't error if log is missing rotate 14 # Keep 14 days of logs compress # Compress old logs (.gz) delaycompress # Compress previous rotation, not current notifempty # Don't rotate empty logs sharedscripts postrotate nginx -s reopen # Signal nginx to reopen log files after rotation endscript } # Force an immediate logrotate run (for testing or emergency) $ logrotate -f /etc/logrotate.d/nginx

Mount Options That Affect Performance

The options used when mounting a filesystem can make a measurable difference, particularly for read-heavy or small-file workloads. They're set in /etc/fstab or via the mount command.

noatime
No access time updates
By default, Linux updates the "last accessed" timestamp on every file read. On read-heavy workloads this turns every read into a write, doubling I/O. noatime disables this entirely.
Impact: significant — up to 30% I/O reduction on read-heavy small-file workloads. Safe for most servers. May affect backup tools that use atime to detect changed files.
relatime
Relative access time
A compromise: only updates atime if it's older than mtime (when the file was last modified). This is the default on modern Linux — most systems already have this behaviour without explicitly setting it.
Impact: moderate. Good balance — keeps atime meaningful for tools that use it while eliminating most unnecessary write traffic. The safe default.
errors=remount-ro
Remount read-only on error
If the filesystem encounters an error, remount it read-only instead of continuing to allow writes that could corrupt data. Common default for ext4.
Impact: safety improvement. No performance effect. Prevents data corruption on disk errors — recommended for all production filesystems.
nobarrier
Disable write barriers
Write barriers ensure data is physically on disk before acknowledging a write, protecting against corruption on power loss. Disabling them can improve write performance but risks data loss on power failure.
Impact: dangerous on real hardware. Only safe on VMs where the hypervisor provides equivalent guarantees. Do not use on physical servers without a battery-backed write cache.
# Check current mount options for a filesystem $ findmnt -o TARGET,OPTIONS / TARGET OPTIONS / rw,relatime,errors=remount-ro # Add noatime to /etc/fstab for the root filesystem: $ grep " / " /etc/fstab UUID=abc123 / ext4 rw,relatime,errors=remount-ro 0 1 # Change relatime to noatime: UUID=abc123 / ext4 rw,noatime,errors=remount-ro 0 1 # Remount immediately without rebooting: $ mount -o remount,noatime /

Quick Reference — Chapter 4 Commands

CommandPurposeKey flags / notes
df -hDisk space per filesystem — human-readable-i inode usage · -T show filesystem type · any filesystem at 100% = problem
df -iInode usage per filesystem — check when "no space" despite available space100% inodes = no new files can be created regardless of disk space
du -sh /* 2>/dev/null | sort -hSize of each top-level directory, sorted smallest to largestDrill down iteratively: du -sh /var/* | sort -h etc.
ncdu /pathInteractive disk usage explorer — navigate, sort, deleteMuch faster than repeated du. May need: apt install ncdu
lsof | grep deletedFind deleted-but-open files still consuming disk spacePipe to awk '{print $7, $1, $NF}'| sort -rn to show size + owner + filename
truncate -s 0 fileZero a log file's content while leaving file handle intactSafer than rm for files held open by running processes
kill -HUP PIDSignal a service to reopen its log files after rotationWorks for nginx, Apache, syslog, PostgreSQL and most well-behaved services
iostat -xh 1Per-device I/O stats — find saturated devicesWatch %util (saturation), await (latency ms), aqu-sz (queue depth)
iotop -oPer-process I/O — find which process is hammering the disk-b -n 5 batch mode · -a accumulated totals · requires root
ionice -c 3 -p PIDSet process to idle I/O class — only gets disk when nothing else needs it-c 2 -n 7 = best-effort low priority · use for backup jobs, batch work
find /dir -type f -deleteDelete all files in a directory when rm -rf fails (too many args)Add -mtime +N to only delete files older than N days
journalctl --disk-usageShow how much disk the systemd journal is using--vacuum-size=500M trim immediately · --vacuum-time=2weeks by age
logrotate -f /etc/logrotate.d/XForce an immediate logrotate run for a specific applicationCheck config with logrotate --debug
findmnt -o TARGET,OPTIONS /Show current mount options for a filesystemAdd noatime in /etc/fstab + mount -o remount,noatime / to apply live