Advanced find and Filesystem Operations
Chapter 9 — Advanced find and Filesystem Operations
You already know the basics of find from the beginner course. This chapter digs into complex filter expressions, efficient bulk execution, parallel processing with xargs, and the diagnostic tools that let you see who has files open: lsof, fuser, and inotifywait.
1 — find: Compound Expressions
find builds a boolean expression tree from tests. Understanding operator precedence unlocks searches that would be impossible with a single predicate.
Operators and grouping
| Operator | Meaning | Short form |
|---|---|---|
-and | Both sides must be true (default) | (space) |
-or | Either side true | -o |
-not | Negate | ! |
( … ) | Grouping — must be shell-escaped |
# Files ending in .log OR .tmp — without grouping, -or has lower precedence find /var/app -type f \( -name '*.log' -o -name '*.tmp' \) # NOT hidden files (name starts with .) find . -not -name '.*' # Owned by www-data AND world-writable find /srv/web -type f -user www-data -perm -o+w # Files modified in the last 7 days but not the last 1 day find /data -type f -mtime -7 -not -mtime -1
Key tests quick-reference
| Test | Meaning |
|---|---|
-name PATTERN | Basename matches glob (case-sensitive) |
-iname PATTERN | Same, case-insensitive |
-type f|d|l|p|s | File / directory / symlink / FIFO / socket |
-mtime ±N | Modified N×24 h ago (-N=less than, +N=more than) |
-mmin ±N | Modified N minutes ago |
-newer FILE | Modified more recently than FILE |
-size ±Nc/k/M/G | Exact or range by size unit |
-empty | Zero-length file or empty directory |
-perm MODE | Exact permissions; -MODE = at least these bits set |
-user NAME / -group NAME | Ownership |
-path PATTERN | Full path matches glob |
-prune | Do not descend into matched directory |
-maxdepth N | Limit recursion depth |
-mindepth N | Skip first N levels |
Pruning subtrees for speed
# Skip .git directories entirely — note the -prune -o … -print pattern find . -name '.git' -prune -o -name '*.py' -print # Multiple directories to skip find . \( -name '.git' -o -name '.svn' -o -name 'node_modules' \) -prune \ -o -type f -name '*.js' -print
2 — Executing Commands: -exec vs -exec {} +
One-at-a-time: -exec CMD {} \;
Spawns a new process for every matched file. Simple, but slow for large result sets.
find /tmp -name '*.sock' -exec rm -f {} \;
Batched: -exec CMD {} +
Appends all matched paths to a single command invocation (like xargs built in). Much faster — one process instead of N.
# Delete 10 000 .tmp files — one rm call instead of 10 000 find /cache -name '*.tmp' -exec rm -f {} + # Compress all logs older than 30 days — gzip accepts multiple files find /var/log/app -name '*.log' -mtime +30 -exec gzip {} +
{} +: {} must be the last argument before
+. You cannot write -exec mv {} /dest/ + — use
xargs instead when the path needs to appear in the middle of the command.
-execdir: safer for untrusted paths
-execdir runs the command in the matched file's own directory, using ./filename rather than the full path. This prevents path-traversal attacks when processing untrusted filenames.
# Run checksums in each directory rather than with full paths find /uploads -type f -execdir sha256sum {} \;
3 — xargs in Depth
xargs reads items from stdin and builds command lines from them. When -exec {} + can't do what you need, xargs almost certainly can.
Core flags
| Flag | Effect |
|---|---|
-0 | Items delimited by NUL byte — pair with find -print0 |
-I REPLACE | Replace string in command (default {}) — one item per invocation |
-n N | Maximum N arguments per command |
-P N | Run up to N processes in parallel |
-a FILE | Read items from FILE instead of stdin |
-r | Do not run command if input is empty (GNU xargs) |
-t | Print each command before executing (trace) |
-p | Prompt before each execution |
--max-procs=N | Long form of -P |
Always use -print0 | xargs -0 for filenames
# Filenames with spaces break plain xargs — use NUL delimiters find . -name '*.jpg' -print0 | xargs -0 -I{} convert {} -resize 800x {} # bad — overwrites find . -name '*.jpg' -print0 | xargs -0 -I{} convert {} -resize 800x thumbs/{}
Parallel processing with -P
# Compress 8 files simultaneously — saturate CPU cores find /archive -name '*.log' -print0 | \ xargs -0 -P8 -n1 gzip -9 # Run a custom function in parallel — must use bash -c process_file() { local f="$1" # … do work … } export -f process_file find . -type f -print0 | xargs -0 -P4 -n1 bash -c 'process_file "$@"' _
Reading from a file with -a
# Process a pre-made list — one path per line, NUL-terminated find /data -name '*.csv' -print0 > /tmp/csv_list.txt # … later … xargs -0 -a /tmp/csv_list.txt wc -l
Controlling batch size with -n
# Some commands have argument-count limits — split into batches of 100 find . -name '*.png' -print0 | xargs -0 -n100 optipng -o2
4 — lsof: List Open Files
Every open file, socket, pipe, or device in the OS can be queried with lsof. It answers "who has this file open?" and "what does this process have open?"
Common invocations
# Which process has /var/log/app.log open? lsof /var/log/app.log # All files open by a specific PID lsof -p 1234 # All files open by process named nginx lsof -c nginx # All open files in /var/log directory lsof +D /var/log # All network connections (like netstat) lsof -i # TCP connections on port 8080 lsof -i TCP:8080 # Which process is listening on port 443? lsof -i TCP:443 -sTCP:LISTEN # Files opened by a specific user lsof -u deploy
The deleted-but-still-open pattern
A process that holds a file open keeps the disk space allocated even after rm removes the name. This is the classic "disk is full but df and du disagree" scenario.
# Find deleted files still consuming disk space lsof | grep 'deleted' # Typical output — (deleted) in the NAME column: nginx 1823 www 3w REG 8,1 2147483648 ... /var/log/nginx/access.log (deleted) # Fix: truncate via the /proc fd, no restart needed > /proc/1823/fd/3
Useful lsof flags
| Flag | Meaning |
|---|---|
-n | No hostname lookup (faster) |
-P | No port-name lookup (show numbers) |
-t | Output PIDs only — scriptable |
-r N | Repeat every N seconds |
+D DIR | Recurse directory |
-sTCP:STATE | Filter by TCP state (LISTEN, ESTABLISHED, …) |
# Kill all processes that have a specific file open kill $(lsof -t /mnt/nas/bigfile.dat) # Watch open connections refresh every 2 seconds lsof -i TCP -nP -r2
5 — fuser: Find Processes Using Files or Sockets
fuser is a lighter tool than lsof — it focuses on a single file or socket and returns the PIDs using it. Useful for quickly answering "what is blocking this unmount?"
# Which processes are using this mount point? fuser -m /mnt/usb # Which processes are using port 80 (TCP)? fuser 80/tcp # Verbose — show usernames and access types fuser -v /var/log/syslog USER PID ACCESS COMMAND /var/log/syslog: root 892 F.... rsyslogd # Kill all processes using a file (be careful!) fuser -k /var/run/app.pid # Send a specific signal fuser -k -HUP 80/tcp
| Access code | Meaning |
|---|---|
c | Current directory |
e | Executable being run |
f | Open file |
F | Open file for writing |
r | Root directory |
m | mmap'd file or shared library |
6 — inotifywait: Watching for Filesystem Events
inotifywait is part of the inotify-tools package (Linux only). It subscribes to the kernel's inotify API, which delivers events the instant a file changes — no polling, near-zero CPU cost.
One-shot mode
# Wait until /tmp/signal.txt is created, then continue inotifywait -e create /tmp/signal.txt Setting up watches. Watches established. /tmp/ CREATE signal.txt
Monitor mode: -m
# Watch a directory continuously, printing all events inotifywait -m -e create,modify,delete,moved_to /var/spool/jobs # Parse output in a while-read loop inotifywait -m -q --format '%e %f' -e close_write /var/spool/jobs | \ while read -r event file; do echo "Processing: $file ($event)" process_job "$file" done
Recursive watching: -r
# Watch an entire directory tree inotifywait -m -r -q --format '%w%f %e' -e modify /etc/nginx | \ while read -r path event; do echo "Config changed: $path" nginx -t && systemctl reload nginx done
Common events
| Event | Triggered when |
|---|---|
create | File or directory created |
modify | File contents written to |
close_write | File written and closed — safer than modify for processing |
delete | File removed |
moved_to | File moved into watched directory |
moved_from | File moved out of watched directory |
attrib | Permissions, timestamps, or ownership changed |
access | File read |
Format string tokens
# Available tokens for --format # %w — directory being watched # %f — filename (empty if event is on the directory itself) # %e — event name(s), comma-separated # %T — timestamp (requires --timefmt) inotifywait -m -q \ --timefmt '%Y-%m-%d %H:%M:%S' \ --format '%T %e %w%f' \ -e create,close_write \ /var/spool/incoming
7 — Practical Patterns
Finding and cleaning up old files safely
#!/usr/bin/env bash cleanup_old_files() { local dir="${1:?directory required}" local days="${2:-30}" local dry_run="${3:-false}" local count=0 size=0 while IFS= read -r -d '' file; do fsize=$(stat -c '%s' "$file") (( size += fsize )) (( count++ )) if [[ $dry_run == 'false' ]]; then rm -f "$file" else echo "[dry-run] would delete: $file" fi done <(find "$dir" -type f -mtime +${days} -print0) printf 'Removed %d files (%.1f MB)\n' "$count" "$(echo "scale=1; $size/1048576" | bc)" }
Parallel checksum generation
# Generate SHA256 checksums for all files — 8 parallel workers find /data/release -type f -print0 | \ xargs -0 -P8 -n1 sha256sum >> checksums.txt # Sort for deterministic output sort -k2 checksums.txt -o checksums.txt
Auto-reload service on config change
#!/usr/bin/env bash # watch_config.sh — reload a service whenever its config dir changes set -euo pipefail CONFIG_DIR="${1:?usage: $0 CONFIG_DIR SERVICE}" SERVICE="${2:?usage: $0 CONFIG_DIR SERVICE}" echo "Watching $CONFIG_DIR for changes…" inotifywait -m -r -q \ --format '%w%f' \ -e close_write,moved_to,delete \ "$CONFIG_DIR" | \ while IFS= read -r changed; do echo "Changed: $changed — reloading $SERVICE" if systemctl reload "$SERVICE" 2>&1; then echo "Reload OK" else echo "Reload FAILED — check journalctl -u $SERVICE" >&2 fi done
Find duplicate files by checksum
#!/usr/bin/env bash find_duplicates() { local dir="${1:-.}" # Group files by size first (cheap), then checksum only same-size groups find "$dir" -type f -print0 | \ xargs -0 stat --format '%s %n' | \ sort -n | \ awk '{ size[$1] = size[$1] " " $2; count[$1]++ } END { for (s in count) if (count[s]>1) print size[s] }' | \ while read -r files; do # Checksum only the size-matched candidates # shellcheck disable=SC2086 md5sum $files | sort | awk 'prev==$1 { print prev_f; print $2 } { prev=$1; prev_f=$2 }' done }
Exercises
Exercise 1 — Stale file report
Write a script stale_report.sh DIR [DAYS] (default 14 days) that:
- Uses
findwith NUL delimiters to list files in DIR not modified within DAYS days - Groups them by extension and prints a summary table: extension, count, total size
- Highlights groups over 100 MB with a warning prefix
#!/usr/bin/env bash set -euo pipefail DIR="${1:?usage: $0 DIR [DAYS]}" DAYS="${2:-14}" declare -A ext_count ext_size while IFS= read -r -d '' file; do ext="${file##*.}" [[ $ext == $file ]] && ext='(none)' # no extension sz=$(stat -c '%s' "$file") (( ext_count["$ext"]++ )) (( ext_size["$ext"] += sz )) done <(find "$DIR" -type f -mtime +$DAYS -print0) printf '%-15s %8s %12s\n' 'Extension' 'Count' 'Total Size' printf '%s\n' '---------------------------------------' for ext in "${!ext_count[@]}"; do bytes="${ext_size[$ext]}" mb=$(( bytes / 1048576 )) prefix='' (( mb > 100 )) && prefix='[LARGE] ' printf '%s%-15s %8d %9d MB\n' \ "$prefix" ".$ext" "${ext_count[$ext]}" "$mb" done | sort -k3 -rn
Exercise 2 — Parallel image resize
Write a script that:
- Accepts a source directory and an output directory as arguments
- Uses
find+xargs -Pto resize all.jpgand.pngfiles to 800px wide in parallel (useconvertfrom ImageMagick) - Preserves directory structure under the output directory
- Skips files that already exist in the output directory
#!/usr/bin/env bash set -euo pipefail SRC="${1:?usage: $0 SRC_DIR OUT_DIR}" OUT="${2:?usage: $0 SRC_DIR OUT_DIR}" resize_one() { local src="$1" src_base="$2" out="$3" local rel="${src#${src_base}/}" local dest="$out/$rel" [[ -f "$dest" ]] && return 0 # already done mkdir -p "$(dirname "$dest")" convert "$src" -resize 800x "$dest" echo "Resized: $rel" } export -f resize_one find "$SRC" \( -iname '*.jpg' -o -iname '*.png' \) -print0 | \ xargs -0 -P"$(nproc)" -I{} bash -c 'resize_one "$@"' _ {} "$SRC" "$OUT"
Exercise 3 — Open-handle diagnostic
Write a function file_users FILE that:
- Uses
lsofto list all processes with FILE open - Prints a formatted table: PID, user, access mode, command name
- Returns exit code 1 if no processes have the file open (safe-to-delete signal)
- Warns if the file shows as deleted but still held open
file_users() { local file="${1:?file required}" local -a lines # -F0 outputs NUL-separated fields; easier to parse than columns local raw raw=$(lsof -nP "$file" 2>/dev/null) || true if [[ -z "$raw" ]]; then echo "No processes have $file open. Safe to delete." return 1 fi if grep -q '(deleted)' <<<"$raw"; then echo "WARNING: file is deleted but still held open — disk space not released!" >&2 fi printf '%-8s %-12s %-6s %s\n' 'PID' 'USER' 'MODE' 'COMMAND' printf '%s\n' '------------------------------------------' # Skip header line from lsof, parse remaining while read -r cmd pid user fd rest; do [[ $cmd == 'COMMAND' ]] && continue mode="${fd: -1}" # last char of fd field: r/w/u printf '%-8s %-12s %-6s %s\n' "$pid" "$user" "$mode" "$cmd" done <<<"$raw" }
Exercise 4 — File-event processor
Write a script job_runner.sh SPOOL_DIR that:
- Uses
inotifywaitin monitor mode to watch SPOOL_DIR forclose_writeandmoved_toevents - On each event, processes the file (e.g.
wc -land move to adone/subdirectory) - Logs each action with a timestamp to a logfile in the spool directory
- Handles the case where inotify-tools is not installed (print a clear error and exit 1)
#!/usr/bin/env bash set -euo pipefail SPOOL="${1:?usage: $0 SPOOL_DIR}" LOG="$SPOOL/job_runner.log" DONE="$SPOOL/done" # Dependency check if ! command -v inotifywait >/dev/null 2&1; then echo "ERROR: inotifywait not found. Install with: apt install inotify-tools" >&2 exit 1 fi mkdir -p "$DONE" log() { printf '[%s] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*" | tee -a "$LOG"; } process_job() { local file="$SPOOL/$1" [[ -f "$file" ]] || return local lines lines=$(wc -l < "$file") log "Processed $1: $lines lines" mv "$file" "$DONE/$1" } log "Watching $SPOOL for new jobs…" inotifywait -m -q \ --format '%f' \ -e close_write,moved_to \ "$SPOOL" | \ while IFS= read -r fname; do # Skip the log file and done directory itself [[ $fname == 'job_runner.log' || $fname == 'done' ]] && continue process_job "$fname" done