Advanced find and Filesystem Operations

Chapter 9 — Advanced find and Filesystem Operations

You already know the basics of find from the beginner course. This chapter digs into complex filter expressions, efficient bulk execution, parallel processing with xargs, and the diagnostic tools that let you see who has files open: lsof, fuser, and inotifywait.

1 — find: Compound Expressions

find builds a boolean expression tree from tests. Understanding operator precedence unlocks searches that would be impossible with a single predicate.

Operators and grouping

OperatorMeaningShort form
-andBoth sides must be true (default)(space)
-orEither side true-o
-notNegate!
( … )Grouping — must be shell-escaped
# Files ending in .log OR .tmp — without grouping, -or has lower precedence
find /var/app -type f \( -name '*.log' -o -name '*.tmp' \)

# NOT hidden files (name starts with .)
find . -not -name '.*'

# Owned by www-data AND world-writable
find /srv/web -type f -user www-data -perm -o+w

# Files modified in the last 7 days but not the last 1 day
find /data -type f -mtime -7 -not -mtime -1

Key tests quick-reference

TestMeaning
-name PATTERNBasename matches glob (case-sensitive)
-iname PATTERNSame, case-insensitive
-type f|d|l|p|sFile / directory / symlink / FIFO / socket
-mtime ±NModified N×24 h ago (-N=less than, +N=more than)
-mmin ±NModified N minutes ago
-newer FILEModified more recently than FILE
-size ±Nc/k/M/GExact or range by size unit
-emptyZero-length file or empty directory
-perm MODEExact permissions; -MODE = at least these bits set
-user NAME / -group NAMEOwnership
-path PATTERNFull path matches glob
-pruneDo not descend into matched directory
-maxdepth NLimit recursion depth
-mindepth NSkip first N levels

Pruning subtrees for speed

# Skip .git directories entirely — note the -prune -o … -print pattern
find . -name '.git' -prune -o -name '*.py' -print

# Multiple directories to skip
find . \( -name '.git' -o -name '.svn' -o -name 'node_modules' \) -prune \
  -o -type f -name '*.js' -print
Short-circuit evaluation: find evaluates left to right and stops as soon as the result is determined. Put cheap tests (-type, -name) before expensive ones (-newer, stat-based tests) to keep things fast.

2 — Executing Commands: -exec vs -exec {} +

One-at-a-time: -exec CMD {} \;

Spawns a new process for every matched file. Simple, but slow for large result sets.

find /tmp -name '*.sock' -exec rm -f {} \;

Batched: -exec CMD {} +

Appends all matched paths to a single command invocation (like xargs built in). Much faster — one process instead of N.

# Delete 10 000 .tmp files — one rm call instead of 10 000
find /cache -name '*.tmp' -exec rm -f {} +

# Compress all logs older than 30 days — gzip accepts multiple files
find /var/log/app -name '*.log' -mtime +30 -exec gzip {} +
Limitation of {} +: {} must be the last argument before +. You cannot write -exec mv {} /dest/ + — use xargs instead when the path needs to appear in the middle of the command.

-execdir: safer for untrusted paths

-execdir runs the command in the matched file's own directory, using ./filename rather than the full path. This prevents path-traversal attacks when processing untrusted filenames.

# Run checksums in each directory rather than with full paths
find /uploads -type f -execdir sha256sum {} \;

3 — xargs in Depth

xargs reads items from stdin and builds command lines from them. When -exec {} + can't do what you need, xargs almost certainly can.

Core flags

FlagEffect
-0Items delimited by NUL byte — pair with find -print0
-I REPLACEReplace string in command (default {}) — one item per invocation
-n NMaximum N arguments per command
-P NRun up to N processes in parallel
-a FILERead items from FILE instead of stdin
-rDo not run command if input is empty (GNU xargs)
-tPrint each command before executing (trace)
-pPrompt before each execution
--max-procs=NLong form of -P

Always use -print0 | xargs -0 for filenames

# Filenames with spaces break plain xargs — use NUL delimiters
find . -name '*.jpg' -print0 | xargs -0 -I{} convert {} -resize 800x {} # bad — overwrites
find . -name '*.jpg' -print0 | xargs -0 -I{} convert {} -resize 800x thumbs/{}

Parallel processing with -P

# Compress 8 files simultaneously — saturate CPU cores
find /archive -name '*.log' -print0 | \
  xargs -0 -P8 -n1 gzip -9

# Run a custom function in parallel — must use bash -c
process_file() {
  local f="$1"
  # … do work …
}
export -f process_file
find . -type f -print0 | xargs -0 -P4 -n1 bash -c 'process_file "$@"' _

Reading from a file with -a

# Process a pre-made list — one path per line, NUL-terminated
find /data -name '*.csv' -print0 > /tmp/csv_list.txt
# … later …
xargs -0 -a /tmp/csv_list.txt wc -l

Controlling batch size with -n

# Some commands have argument-count limits — split into batches of 100
find . -name '*.png' -print0 | xargs -0 -n100 optipng -o2

4 — lsof: List Open Files

Every open file, socket, pipe, or device in the OS can be queried with lsof. It answers "who has this file open?" and "what does this process have open?"

Common invocations

# Which process has /var/log/app.log open?
lsof /var/log/app.log

# All files open by a specific PID
lsof -p 1234

# All files open by process named nginx
lsof -c nginx

# All open files in /var/log directory
lsof +D /var/log

# All network connections (like netstat)
lsof -i

# TCP connections on port 8080
lsof -i TCP:8080

# Which process is listening on port 443?
lsof -i TCP:443 -sTCP:LISTEN

# Files opened by a specific user
lsof -u deploy

The deleted-but-still-open pattern

A process that holds a file open keeps the disk space allocated even after rm removes the name. This is the classic "disk is full but df and du disagree" scenario.

# Find deleted files still consuming disk space
lsof | grep 'deleted'

# Typical output — (deleted) in the NAME column:
nginx   1823  www  3w  REG  8,1  2147483648  ... /var/log/nginx/access.log (deleted)

# Fix: truncate via the /proc fd, no restart needed
> /proc/1823/fd/3

Useful lsof flags

FlagMeaning
-nNo hostname lookup (faster)
-PNo port-name lookup (show numbers)
-tOutput PIDs only — scriptable
-r NRepeat every N seconds
+D DIRRecurse directory
-sTCP:STATEFilter by TCP state (LISTEN, ESTABLISHED, …)
# Kill all processes that have a specific file open
kill $(lsof -t /mnt/nas/bigfile.dat)

# Watch open connections refresh every 2 seconds
lsof -i TCP -nP -r2

5 — fuser: Find Processes Using Files or Sockets

fuser is a lighter tool than lsof — it focuses on a single file or socket and returns the PIDs using it. Useful for quickly answering "what is blocking this unmount?"

# Which processes are using this mount point?
fuser -m /mnt/usb

# Which processes are using port 80 (TCP)?
fuser 80/tcp

# Verbose — show usernames and access types
fuser -v /var/log/syslog
                     USER        PID ACCESS COMMAND
/var/log/syslog:     root        892 F....  rsyslogd

# Kill all processes using a file (be careful!)
fuser -k /var/run/app.pid

# Send a specific signal
fuser -k -HUP 80/tcp
Access codeMeaning
cCurrent directory
eExecutable being run
fOpen file
FOpen file for writing
rRoot directory
mmmap'd file or shared library
fuser vs lsof: Use fuser for a quick "what's blocking unmount" check. Use lsof when you need the full picture — network connections, deleted files, or per-process fd lists.

6 — inotifywait: Watching for Filesystem Events

inotifywait is part of the inotify-tools package (Linux only). It subscribes to the kernel's inotify API, which delivers events the instant a file changes — no polling, near-zero CPU cost.

inotifywait is Linux-only. On macOS use fswatch; on BSD use kqueue. The examples below assume Linux.

One-shot mode

# Wait until /tmp/signal.txt is created, then continue
inotifywait -e create /tmp/signal.txt
Setting up watches.
Watches established.
/tmp/ CREATE signal.txt

Monitor mode: -m

# Watch a directory continuously, printing all events
inotifywait -m -e create,modify,delete,moved_to /var/spool/jobs

# Parse output in a while-read loop
inotifywait -m -q --format '%e %f' -e close_write /var/spool/jobs | \
while read -r event file; do
  echo "Processing: $file ($event)"
  process_job "$file"
done

Recursive watching: -r

# Watch an entire directory tree
inotifywait -m -r -q --format '%w%f %e' -e modify /etc/nginx | \
while read -r path event; do
  echo "Config changed: $path"
  nginx -t && systemctl reload nginx
done

Common events

EventTriggered when
createFile or directory created
modifyFile contents written to
close_writeFile written and closed — safer than modify for processing
deleteFile removed
moved_toFile moved into watched directory
moved_fromFile moved out of watched directory
attribPermissions, timestamps, or ownership changed
accessFile read

Format string tokens

# Available tokens for --format
# %w  — directory being watched
# %f  — filename (empty if event is on the directory itself)
# %e  — event name(s), comma-separated
# %T  — timestamp (requires --timefmt)

inotifywait -m -q \
  --timefmt '%Y-%m-%d %H:%M:%S' \
  --format  '%T %e %w%f' \
  -e create,close_write \
  /var/spool/incoming

7 — Practical Patterns

Finding and cleaning up old files safely

#!/usr/bin/env bash
cleanup_old_files() {
  local dir="${1:?directory required}"
  local days="${2:-30}"
  local dry_run="${3:-false}"

  local count=0 size=0

  while IFS= read -r -d '' file; do
    fsize=$(stat -c '%s' "$file")
    (( size += fsize ))
    (( count++ ))
    if [[ $dry_run == 'false' ]]; then
      rm -f "$file"
    else
      echo "[dry-run] would delete: $file"
    fi
  done <(find "$dir" -type f -mtime +${days} -print0)

  printf 'Removed %d files (%.1f MB)\n' "$count" "$(echo "scale=1; $size/1048576" | bc)"
}

Parallel checksum generation

# Generate SHA256 checksums for all files — 8 parallel workers
find /data/release -type f -print0 | \
  xargs -0 -P8 -n1 sha256sum >> checksums.txt

# Sort for deterministic output
sort -k2 checksums.txt -o checksums.txt

Auto-reload service on config change

#!/usr/bin/env bash
# watch_config.sh — reload a service whenever its config dir changes
set -euo pipefail

CONFIG_DIR="${1:?usage: $0 CONFIG_DIR SERVICE}"
SERVICE="${2:?usage: $0 CONFIG_DIR SERVICE}"

echo "Watching $CONFIG_DIR for changes…"

inotifywait -m -r -q \
  --format '%w%f' \
  -e close_write,moved_to,delete \
  "$CONFIG_DIR" | \
while IFS= read -r changed; do
  echo "Changed: $changed — reloading $SERVICE"
  if systemctl reload "$SERVICE" 2>&1; then
    echo "Reload OK"
  else
    echo "Reload FAILED — check journalctl -u $SERVICE" >&2
  fi
done

Find duplicate files by checksum

#!/usr/bin/env bash
find_duplicates() {
  local dir="${1:-.}"
  # Group files by size first (cheap), then checksum only same-size groups
  find "$dir" -type f -print0 | \
    xargs -0 stat --format '%s %n' | \
    sort -n | \
    awk '{ size[$1] = size[$1] " " $2; count[$1]++ }
         END { for (s in count) if (count[s]>1) print size[s] }' | \
  while read -r files; do
    # Checksum only the size-matched candidates
    # shellcheck disable=SC2086
    md5sum $files | sort | awk 'prev==$1 { print prev_f; print $2 } { prev=$1; prev_f=$2 }'
  done
}

Exercises

Exercise 1 — Stale file report

Write a script stale_report.sh DIR [DAYS] (default 14 days) that:

  • Uses find with NUL delimiters to list files in DIR not modified within DAYS days
  • Groups them by extension and prints a summary table: extension, count, total size
  • Highlights groups over 100 MB with a warning prefix
#!/usr/bin/env bash
set -euo pipefail

DIR="${1:?usage: $0 DIR [DAYS]}"
DAYS="${2:-14}"

declare -A ext_count ext_size

while IFS= read -r -d '' file; do
  ext="${file##*.}"
  [[ $ext == $file ]] && ext='(none)'   # no extension
  sz=$(stat -c '%s' "$file")
  (( ext_count["$ext"]++ ))
  (( ext_size["$ext"] += sz ))
done <(find "$DIR" -type f -mtime +$DAYS -print0)

printf '%-15s %8s %12s\n' 'Extension' 'Count' 'Total Size'
printf '%s\n' '---------------------------------------'

for ext in "${!ext_count[@]}"; do
  bytes="${ext_size[$ext]}"
  mb=$(( bytes / 1048576 ))
  prefix=''
  (( mb > 100 )) && prefix='[LARGE] '
  printf '%s%-15s %8d %9d MB\n' \
    "$prefix" ".$ext" "${ext_count[$ext]}" "$mb"
done | sort -k3 -rn

Exercise 2 — Parallel image resize

Write a script that:

  • Accepts a source directory and an output directory as arguments
  • Uses find + xargs -P to resize all .jpg and .png files to 800px wide in parallel (use convert from ImageMagick)
  • Preserves directory structure under the output directory
  • Skips files that already exist in the output directory
#!/usr/bin/env bash
set -euo pipefail

SRC="${1:?usage: $0 SRC_DIR OUT_DIR}"
OUT="${2:?usage: $0 SRC_DIR OUT_DIR}"

resize_one() {
  local src="$1" src_base="$2" out="$3"
  local rel="${src#${src_base}/}"
  local dest="$out/$rel"

  [[ -f "$dest" ]] && return 0   # already done

  mkdir -p "$(dirname "$dest")"
  convert "$src" -resize 800x "$dest"
  echo "Resized: $rel"
}

export -f resize_one

find "$SRC" \( -iname '*.jpg' -o -iname '*.png' \) -print0 | \
  xargs -0 -P"$(nproc)" -I{} bash -c 'resize_one "$@"' _ {} "$SRC" "$OUT"

Exercise 3 — Open-handle diagnostic

Write a function file_users FILE that:

  • Uses lsof to list all processes with FILE open
  • Prints a formatted table: PID, user, access mode, command name
  • Returns exit code 1 if no processes have the file open (safe-to-delete signal)
  • Warns if the file shows as deleted but still held open
file_users() {
  local file="${1:?file required}"
  local -a lines

  # -F0 outputs NUL-separated fields; easier to parse than columns
  local raw
  raw=$(lsof -nP "$file" 2>/dev/null) || true

  if [[ -z "$raw" ]]; then
    echo "No processes have $file open. Safe to delete."
    return 1
  fi

  if grep -q '(deleted)' <<<"$raw"; then
    echo "WARNING: file is deleted but still held open — disk space not released!" >&2
  fi

  printf '%-8s %-12s %-6s %s\n' 'PID' 'USER' 'MODE' 'COMMAND'
  printf '%s\n' '------------------------------------------'

  # Skip header line from lsof, parse remaining
  while read -r cmd pid user fd rest; do
    [[ $cmd == 'COMMAND' ]] && continue
    mode="${fd: -1}"     # last char of fd field: r/w/u
    printf '%-8s %-12s %-6s %s\n' "$pid" "$user" "$mode" "$cmd"
  done <<<"$raw"
}

Exercise 4 — File-event processor

Write a script job_runner.sh SPOOL_DIR that:

  • Uses inotifywait in monitor mode to watch SPOOL_DIR for close_write and moved_to events
  • On each event, processes the file (e.g. wc -l and move to a done/ subdirectory)
  • Logs each action with a timestamp to a logfile in the spool directory
  • Handles the case where inotify-tools is not installed (print a clear error and exit 1)
#!/usr/bin/env bash
set -euo pipefail

SPOOL="${1:?usage: $0 SPOOL_DIR}"
LOG="$SPOOL/job_runner.log"
DONE="$SPOOL/done"

# Dependency check
if ! command -v inotifywait >/dev/null 2&1; then
  echo "ERROR: inotifywait not found. Install with: apt install inotify-tools" >&2
  exit 1
fi

mkdir -p "$DONE"

log() { printf '[%s] %s\n' "$(date '+%Y-%m-%d %H:%M:%S')" "$*" | tee -a "$LOG"; }

process_job() {
  local file="$SPOOL/$1"
  [[ -f "$file" ]] || return

  local lines
  lines=$(wc -l < "$file")
  log "Processed $1: $lines lines"
  mv "$file" "$DONE/$1"
}

log "Watching $SPOOL for new jobs…"

inotifywait -m -q \
  --format '%f' \
  -e close_write,moved_to \
  "$SPOOL" | \
while IFS= read -r fname; do
  # Skip the log file and done directory itself
  [[ $fname == 'job_runner.log' || $fname == 'done' ]] && continue
  process_job "$fname"
done