Chapter 4 — Disk I/O & Storage Bottlenecks
Disk problems come in two distinct flavours that require completely different approaches. The first is space — the filesystem is full and nothing can be written. The second is I/O performance — the disk has space but is so busy that read and write operations queue up, causing the system to grind. This chapter covers both, including the trap where df and du disagree and the reason why is a hidden process.
What this chapter covers: Why df and du can disagree — the deleted-but-open file trap. Reading iostat's extended output. Scenario 1: root filesystem at 100%. Scenario 2: inode exhaustion — disk has space but files can't be created. Scenario 3: high I/O wait — finding which process is hammering the disk. Log file management. ionice for throttling disk-intensive processes. Mount options that affect performance.
df vs du — Why They Sometimes Disagree
df reads disk usage from the filesystem's own accounting (the superblock). du walks the directory tree and adds up file sizes. These two methods give the same answer unless a file has been deleted from the directory tree while a process still holds it open — in which case df still counts the space (the inode and its blocks are still allocated) while du doesn't (the directory entry is gone, so the walk misses it).
The Deleted-But-Open File Problem
Normal state:
Directory entry → Inode → Data blocks on disk
(du finds this) (df counts this)
After "rm large_log.log" while nginx still has it open:
~~Directory entry~~ → Inode → Data blocks on disk
(du: file gone!) (df: still counted! inode still allocated)
Effect: df says filesystem is 8 GB used.
du / reports only 4 GB.
Missing 4 GB = deleted-but-open files.
Fix: find the process holding the file open, restart it or
send it SIGHUP to reopen its log files. The OS then
releases the inode and the space is reclaimed.
$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 48G 2.0G 96% /
$ du -sh /* 2>/dev/null | sort -h | tail -5
12G /usr
8G /var
4G /home
2G /opt
$ lsof | grep -i deleted | awk '{print $7, $1, $2, $NF}' | sort -rn | head -10
22548342912 nginx 1234 /var/log/nginx/access.log (deleted)
4194304000 python 5678 /tmp/cache_dump.bin (deleted)
$ kill -HUP $(pgrep nginx)
$ systemctl restart nginx
$ df -h /
Filesystem Size Used Avail Use%
/dev/sda1 50G 26G 24G 52%
Most well-behaved services (nginx, Apache, PostgreSQL, syslog) support SIGHUP to reopen their log files without dropping connections or restarting. This is also how logrotate works — it renames the log file, sends SIGHUP to the service, and the service starts writing to a new file at the original path. The old file (now renamed) can then be compressed or deleted.
Reading iostat — The Columns That Matter
iostat -x 1 gives per-device I/O statistics updated every second. It's the primary tool for identifying which disk is saturated and whether the problem is reads, writes, throughput, or latency.
$ iostat -xh 1
Device r/s w/s rkB/s wkB/s await r_await w_await aqu-sz %util
sda 12.0 8.0 480.0 320.0 2.1 1.8 2.5 0.04 3.2
nvme0n1 45.0 320.0 1800.0 12800.0 4.8 2.1 5.2 0.82 42.0
sdb 2.0 1840.0 8.0 73600.0 180.0 12.0 182.0 24.8 98.6
%util
Utilisation
Percentage of time the device was busy. At 100% the device is saturated — I/O requests are queuing.
Above 80% sustained on HDD, or 95%+ on SSD — investigate
await
Average I/O Wait (ms)
Average time from I/O request submission to completion. Includes queue time plus actual disk service time.
HDD: >20ms. SSD: >5ms. NVMe: >1ms. Consistently above these = saturation
aqu-sz
Average Queue Depth
Average number of I/O requests waiting in the device queue. The clearest saturation signal.
Sustained above 1–2 on a single disk = more requests than the device can handle
r/s, w/s
Operations / second
Read and write IOPS. Spins disks are limited to ~100–200 IOPS. SSDs handle 10,000–100,000+. Tells you whether the workload is I/O-count bound.
HDD: r/s + w/s consistently near 150–200 = IOPS saturated
rkB/s, wkB/s
Throughput (KB/s)
Actual data transferred per second. Compare against the device's rated throughput. Large sequential files show high kB/s with low IOPS.
HDD: above ~100–150 MB/s. SSD: above rated speed. Sustained = throughput bound.
r_await / w_await
Read / Write await separately
Splits await into read-side and write-side. If w_await is high but r_await is fine, writes are queuing. Useful for mixed workloads.
Asymmetry between r_await and w_await points to write-heavy saturation
iotop — Finding Which Process Is Doing the I/O
iostat tells you the device is saturated. iotop tells you who is responsible. It requires root (or CAP_NET_ADMIN) to read kernel I/O accounting.
$ iotop -o
Total DISK READ: 1.20 M/s | Total DISK WRITE: 73.60 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
9821 be/4 mysql 0.00 B/s 68.40 M/s 0.00 % 98.72 % mysqld
4812 be/4 www-data 1.20 M/s 5.20 M/s 0.00 % 1.02 % php-fpm: pool www
$ iotop -o -b -n 5
$ lsof -p 9821 | grep -E "REG|DIR" | head -20
mysqld 9821 mysql 4u REG 8,2 2147483648 /var/lib/mysql/ibdata1
mysqld 9821 mysql 5u REG 8,2 524288000 /var/lib/mysql/ib_logfile0
Scenario 1 — Root Filesystem at 100%
1
Confirm which filesystem is full and check for the df/du discrepancy first. If they disagree, a deleted-but-open file may be the cause — the fastest fix with no risk of deleting the wrong thing.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 50G 0 100% /
/dev/sdb1 500G 120G 380G 24% /data
$ lsof | grep -i deleted | awk '{printf "%s MB\t%s\t%s\n", $7/1048576, $1, $NF}' | sort -rn | head -5
21504 MB nginx /var/log/nginx/access.log (deleted)
$ kill -HUP $(pgrep -x nginx)
$ df -h /
Filesystem Size Used Avail Use%
/dev/sda1 50G 29G 21G 58% ← Immediate recovery, no deletions needed
2
If lsof shows nothing significant, find the largest directories. Work top-down — start at root, drill into the biggest directory at each level.
$ du -sh /* 2>/dev/null | sort -h
12G /usr
38G /var
4G /home
$ du -sh /var/* 2>/dev/null | sort -h
1G /var/cache
2G /var/lib
35G /var/log
$ du -sh /var/log/* 2>/dev/null | sort -h | tail -10
1G /var/log/syslog
30G /var/log/myapp
$ ls -lh /var/log/myapp/
-rw-r--r-- 1 app app 30G Jun 14 14:22 debug.log
3
Use ncdu for interactive exploration — faster than repeated du commands when you don't know where to look.
$ ncdu /var
4
Safe quick wins to recover space — in order of risk:
- Zero risk: Clear the systemd journal —
journalctl --vacuum-size=500M
- Zero risk: Clear package manager cache —
apt clean or yum clean all
- Zero risk: Remove old kernel packages —
apt autoremove (keeps current kernel)
- Low risk: Truncate (not delete) the offending log, leaving the file handle intact —
truncate -s 0 /var/log/myapp/debug.log
- Check before removing: Core dumps in
/var/crash or /tmp — find /tmp /var/crash -name "core*" -size +100M
- Check before removing: Old log rotations —
find /var/log -name "*.gz" -mtime +30
5
Truncate vs delete — why truncate is safer for open log files. If you
rm a log file that a running process has open, the process continues writing to the now-deleted inode — the space isn't freed (the deleted-but-open problem again).
truncate -s 0 zeros the file's content while leaving the inode intact — the process's file handle remains valid and the space is freed immediately.
$ truncate -s 0 /var/log/myapp/debug.log
# rm /var/log/myapp/debug.log ← process keeps the deleted inode open; space not freed
6
Root cause and prevention. After recovering space, address why it filled up:
- Set up
logrotate for application logs (or ensure it's configured correctly)
- Disable debug logging in production application config
- Set a journal size cap:
SystemMaxUse=500M in /etc/systemd/journald.conf
- Set up disk space monitoring alerts (before it reaches 100%)
On systems where / is separate from /var, /tmp, and /home, a full /var won't affect the system's ability to run — only to write logs and package updates. On single-partition systems, a full / can prevent logins and crash running services. Always know your partition layout before an incident.
Scenario 2 — Inode Exhaustion
1
Check inode usage immediately — this is the tell. Inodes are pre-allocated metadata slots (one per file or directory). When they run out, no new files can be created even if disk space is available.
$ df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda1 6553600 6553600 0 100% /
/dev/sdb1 6553600 412000 6141600 6% /data
2
Find which directory contains the most files. Inode exhaustion is almost always caused by one directory containing millions of tiny files — mail queues, PHP session files, cache directories, or a logging system that creates one file per event.
$ for dir in /*; do
count=$(find "$dir" -xdev 2>/dev/null | wc -l)
echo "$count $dir"
done | sort -rn | head -10
5821432 /var
412000 /usr
98000 /home
$ find /var -xdev -printf '%h\n' 2>/dev/null | sort | uniq -c | sort -rn | head -10
5819200 /var/spool/postfix/deferred
1200 /var/log
3
Recovery — delete the tiny files. With millions of files,
rm -rf /path/* fails ("argument list too long"). Use find instead:
$ postsuper -d ALL deferred
$ find /var/spool/postfix/deferred -type f -delete
$ find /var/lib/php/sessions -type f -mtime +1 -delete
$ df -i /
Filesystem Inodes IUsed IFree IUse%
/dev/sda1 6553600 734400 5819200 11%
4
Prevention. Add inode monitoring to your alerting alongside disk space — many monitoring systems skip it. Also consider: if your application legitimately creates millions of small files, use a directory structure that spreads them across subdirectories (e.g., by first two characters of the filename: ca/cache_abc123) rather than a flat directory. Some filesystems (ext4 with dir_index) handle large directories better than others.
You cannot add inodes to an existing ext4 filesystem without reformatting. The inode count is set at creation time with mkfs.ext4 -i bytes-per-inode. If inode exhaustion is a recurring problem on a partition you can't reformat, consider moving the high-file-count directory to its own partition or filesystem (like tmpfs for session files).
Scenario 3 — High I/O Wait, Server Sluggish
1
Confirm it's disk I/O wait and not NFS or swap. High wa covers all uninterruptible I/O — disk, NFS mounts, and swap all contribute.
$ vmstat 1 5
r b swpd free buff cache si so bi bo in cs us sy id wa
1 4 0 2.1G 200M 4.2G 0 0 1200 73800 2100 3400 8 4 43 45
2
Identify the saturated device with iostat.
$ iostat -xh 1 3
Device r/s w/s rkB/s wkB/s await aqu-sz %util
nvme0n1 8.0 12.0 320.0 480.0 0.9 0.02 2.1
sdb 2.0 1840.0 8.0 73600.0 165.0 24.4 98.8
3
Find which process is responsible with iotop.
$ iotop -o
Total DISK READ: 8.0 KB/s | Total DISK WRITE: 73.60 MB/s
TID PRIO USER DISK READ DISK WRITE IO> COMMAND
12841 be/4 backup 0.00 B/s 73.50 MB/s 96.2% rsync -av /data /mnt/backup
4812 be/4 www-data 8.00 KB/s 0.10 MB/s 1.8% php-fpm
4
Throttle the backup job with ionice without stopping it. ionice changes a process's I/O scheduling class — the disk equivalent of
renice for CPU.
$ ionice -c 3 -p 12841
$ ionice -c 2 -n 7 -p 12841
$ ionice -c 3 nice -n 19 rsync -av /data /mnt/backup
$ vmstat 1 3
r b swpd free si so bi bo wa
0 0 0 2.3G 0 0 800 8400 5 ← wa dropped from 45% to 5%
5
If the I/O is from a production service and not a background job — the disk is genuinely undersized for the workload. Options to investigate:
- Move the high-I/O service to an NVMe SSD if it's currently on HDD
- Add a read cache (bcache, lvmcache) to the slow disk
- For databases: tune
innodb_buffer_pool_size (MySQL) or shared_buffers (PostgreSQL) to serve more reads from RAM and reduce disk reads
- For write-heavy workloads: check if
sync() / fsync() is being called too frequently — sometimes a config change (careful: tradeoff with durability) can dramatically reduce write IOPS
Always run iotop at the same time as iostat when diagnosing I/O — iostat identifies the saturated device, iotop identifies the responsible process. Neither alone gives the full picture.
Log Files — The Most Common Disk-Space Culprit
Logs fill disks more reliably than almost anything else. The systemd journal alone can consume dozens of gigabytes without size limits configured.
$ journalctl --disk-usage
Archived and active journals take up 8.3G in the filesystem.
$ journalctl --vacuum-size=500M
$ journalctl --vacuum-time=2weeks
$ grep -E "SystemMaxUse|MaxRetention" /etc/systemd/journald.conf
SystemMaxUse=500M # Hard cap on journal disk usage
MaxRetentionSec=2week # Auto-delete entries older than 2 weeks
$ systemctl restart systemd-journald
$ find /var/log -type f -size +100M -exec ls -lh {} \;
-rw-r--r-- 1 root root 30G Jun 14 /var/log/myapp/debug.log
-rw-r--r-- 1 root root 4.2G Jun 12 /var/log/syslog
$ cat /etc/logrotate.d/nginx
/var/log/nginx/*.log {
daily # Rotate daily
missingok # Don't error if log is missing
rotate 14 # Keep 14 days of logs
compress # Compress old logs (.gz)
delaycompress # Compress previous rotation, not current
notifempty # Don't rotate empty logs
sharedscripts
postrotate
nginx -s reopen # Signal nginx to reopen log files after rotation
endscript
}
$ logrotate -f /etc/logrotate.d/nginx
Mount Options That Affect Performance
The options used when mounting a filesystem can make a measurable difference, particularly for read-heavy or small-file workloads. They're set in /etc/fstab or via the mount command.
noatime
No access time updates
By default, Linux updates the "last accessed" timestamp on every file read. On read-heavy workloads this turns every read into a write, doubling I/O. noatime disables this entirely.
Impact: significant — up to 30% I/O reduction on read-heavy small-file workloads. Safe for most servers. May affect backup tools that use atime to detect changed files.
relatime
Relative access time
A compromise: only updates atime if it's older than mtime (when the file was last modified). This is the default on modern Linux — most systems already have this behaviour without explicitly setting it.
Impact: moderate. Good balance — keeps atime meaningful for tools that use it while eliminating most unnecessary write traffic. The safe default.
errors=remount-ro
Remount read-only on error
If the filesystem encounters an error, remount it read-only instead of continuing to allow writes that could corrupt data. Common default for ext4.
Impact: safety improvement. No performance effect. Prevents data corruption on disk errors — recommended for all production filesystems.
nobarrier
Disable write barriers
Write barriers ensure data is physically on disk before acknowledging a write, protecting against corruption on power loss. Disabling them can improve write performance but risks data loss on power failure.
Impact: dangerous on real hardware. Only safe on VMs where the hypervisor provides equivalent guarantees. Do not use on physical servers without a battery-backed write cache.
$ findmnt -o TARGET,OPTIONS /
TARGET OPTIONS
/ rw,relatime,errors=remount-ro
$ grep " / " /etc/fstab
UUID=abc123 / ext4 rw,relatime,errors=remount-ro 0 1
UUID=abc123 / ext4 rw,noatime,errors=remount-ro 0 1
$ mount -o remount,noatime /
Quick Reference — Chapter 4 Commands
| Command | Purpose | Key flags / notes |
| df -h | Disk space per filesystem — human-readable | -i inode usage · -T show filesystem type · any filesystem at 100% = problem |
| df -i | Inode usage per filesystem — check when "no space" despite available space | 100% inodes = no new files can be created regardless of disk space |
| du -sh /* 2>/dev/null | sort -h | Size of each top-level directory, sorted smallest to largest | Drill down iteratively: du -sh /var/* | sort -h etc. |
| ncdu /path | Interactive disk usage explorer — navigate, sort, delete | Much faster than repeated du. May need: apt install ncdu |
| lsof | grep deleted | Find deleted-but-open files still consuming disk space | Pipe to awk '{print $7, $1, $NF}'| sort -rn to show size + owner + filename |
| truncate -s 0 file | Zero a log file's content while leaving file handle intact | Safer than rm for files held open by running processes |
| kill -HUP PID | Signal a service to reopen its log files after rotation | Works for nginx, Apache, syslog, PostgreSQL and most well-behaved services |
| iostat -xh 1 | Per-device I/O stats — find saturated devices | Watch %util (saturation), await (latency ms), aqu-sz (queue depth) |
| iotop -o | Per-process I/O — find which process is hammering the disk | -b -n 5 batch mode · -a accumulated totals · requires root |
| ionice -c 3 -p PID | Set process to idle I/O class — only gets disk when nothing else needs it | -c 2 -n 7 = best-effort low priority · use for backup jobs, batch work |
| find /dir -type f -delete | Delete all files in a directory when rm -rf fails (too many args) | Add -mtime +N to only delete files older than N days |
| journalctl --disk-usage | Show how much disk the systemd journal is using | --vacuum-size=500M trim immediately · --vacuum-time=2weeks by age |
| logrotate -f /etc/logrotate.d/X | Force an immediate logrotate run for a specific application | Check config with logrotate --debug |
| findmnt -o TARGET,OPTIONS / | Show current mount options for a filesystem | Add noatime in /etc/fstab + mount -o remount,noatime / to apply live |