Working with Files and Text
📁 Topic 9 — Working with Files and Text
The shell's real power comes from combining simple text-processing tools into pipelines that transform data. This chapter covers reading and writing files safely, navigating the filesystem with find, and the essential Unix text tools — grep, sed, awk, cut, sort, uniq, wc, and tr — with an emphasis on the patterns you'll actually use in scripts every day.
1 — Reading Files
The canonical, safe way to read a file line by line is a while IFS= read -r loop. It handles blank lines, lines without a trailing newline, and filenames or values that contain spaces.
#!/bin/bash
# Canonical pattern — handles all edge cases
while IFS= read -r line; do
echo "Line: $line"
done < "/path/to/file.txt"
# With line numbers
lineno=0
while IFS= read -r line; do
(( lineno++ ))
printf "%4d %s\n" "$lineno" "$line"
done < file.txt
# Skip blank lines and comments (lines starting with #)
while IFS= read -r line; do
[[ -z "$line" || "$line" == \#* ]] && continue
echo "$line"
done < config.txt
# Read two fields per line (e.g. "name score" format)
while read -r name score; do
printf "%-15s %d\n" "$name" "$score"
done < scores.txt
for line in $(cat file)This splits on every whitespace character (not just newlines), breaks on filenames with spaces, and is slower than a
while read loop. Always use the while IFS= read -r pattern for processing files line by line.
Reading a File into a Variable or Array
# Read entire file into a single variable
content=$(<file.txt) # faster than $(cat file.txt)
# Read all lines into an array (one element per line)
lines=()
while IFS= read -r line; do
lines+=( "$line" )
done < file.txt
# Or with mapfile / readarray (bash 4+, most concise)
mapfile -t lines < file.txt
# -t strips the trailing newline from each element
echo "Total lines: ${#lines[@]}"
echo "First line : ${lines[0]}"
echo "Last line : ${lines[-1]}"
2 — Writing Files
# Overwrite (create or truncate)
echo "Hello" > output.txt
# Append
echo "World" >> output.txt
# Write multiple lines with a here-document
cat > config.ini <<'EOF'
host=localhost
port=8080
debug=false
EOF
# Write with variable expansion in here-doc (no quotes on delimiter)
app_name="myapp"
version="1.0"
cat > version.txt <<EOF
Application: $app_name
Version : $version
Built : $(date '+%Y-%m-%d')
EOF
# Write stdout AND stderr to the same log file
logfile="app.log"
{
echo "Starting process..."
some_command
echo "Done."
} &> "$logfile"
# Atomic write — write to temp file first, then rename
# (prevents partial reads if another process opens the file mid-write)
tmpfile=$(mktemp)
generate_data > "$tmpfile"
mv "$tmpfile" "final_output.txt"
mv it into place. A rename on the same filesystem is atomic; a plain > redirect is not — a reader could see a half-written file.
3 — Finding Files with find
find is the standard tool for locating files by name, type, size, age, permissions, or any combination. It recursively traverses the directory tree and can execute actions on matching files.
# Find by name (case-sensitive)
find /var/log -name "*.log"
# Find by name (case-insensitive)
find . -iname "*.jpg"
# Find only files (not directories)
find . -type f -name "*.sh"
# Find only directories
find . -type d -name "config"
# Find files modified in the last 7 days
find . -type f -mtime -7
# Find files larger than 100 MB
find / -type f -size +100M
# Limit search depth (don't recurse deeper than 2 levels)
find . -maxdepth 2 -name "*.conf"
# Run a command on each found file (-exec ... {} \;)
find . -name "*.sh" -exec chmod +x {} \;
# Safer: use -print0 | xargs -0 to handle spaces in names
find . -name "*.log" -print0 | xargs -0 rm -f
# Delete empty directories
find . -type d -empty -delete
# Find and loop in bash (safest — handles all filenames)
while IFS= read -r -d '' file; do
echo "Processing: $file"
done <(find . -name "*.txt" -print0)
-print0 + read -d '' combination uses null bytes as the record separator, making it safe for filenames that contain spaces, newlines, or other special characters.4 — Searching Text with grep
grep searches for lines matching a pattern. In scripts you use it both to filter output in pipelines and to test whether a match exists at all (via its exit code).
# Basic search — print matching lines
grep "error" app.log
# Case-insensitive
grep -i "error" app.log
# Show line numbers
grep -n "TODO" *.py
# Invert match — lines that do NOT match
grep -v "^#" config.txt # strip comment lines
# Count matching lines
grep -c "FAIL" results.txt
# Show only the matched part, not the whole line
grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" access.log # extract IPs
# Extended regex (no need to escape + ? | ( ) )
grep -E "^(ERROR|WARN)" app.log
# Recursive search in a directory tree
grep -r "password" /etc/
# Show N lines of context before/after the match
grep -A 3 "Exception" app.log # 3 lines after
grep -B 2 "Exception" app.log # 2 lines before
grep -C 2 "Exception" app.log # 2 lines either side
# Use exit code in a script (0 = found, 1 = not found)
if grep -q "CRITICAL" app.log; then # -q = quiet, no output
echo "Critical errors found!" >&2
exit 1
fi
5 — Stream Editing with sed
sed (stream editor) processes text line by line, making substitutions, deletions, and other edits. The substitution command s/pattern/replacement/ is by far the most used.
# Replace first occurrence on each line
sed 's/foo/bar/' input.txt
# Replace ALL occurrences on each line (g = global)
sed 's/foo/bar/g' input.txt
# Case-insensitive replacement
sed 's/error/ERROR/gI' app.log
# Edit in place (modify the file directly)
sed -i 's/localhost/192.168.1.1/g' config.ini
# -i.bak makes a backup: config.ini.bak
sed -i.bak 's/localhost/192.168.1.1/g' config.ini
# Delete lines matching a pattern
sed '/^#/d' config.txt # delete comment lines
sed '/^[[:space:]]*$/d' file.txt # delete blank lines
# Print only specific lines (suppress default output with -n)
sed -n '5p' file.txt # print line 5
sed -n '5,10p' file.txt # print lines 5–10
sed -n '/START/,/END/p' file.txt # print between markers
# Multiple expressions with -e
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Use & to refer to the whole matched text
sed 's/[0-9]\+/[&]/g' file.txt # wrap every number in brackets
Price [42] for [5] items
# Strip leading and trailing whitespace
sed 's/^[[:space:]]*//; s/[[:space:]]*$//' file.txt
-i requires an explicit backup suffix — sed -i '' 's/a/b/' file (empty string). On Linux, sed -i 's/a/b/' file works without a suffix. For portability in scripts, use -i.bak (creates a backup that you can then delete).
6 — Field Processing with awk
awk splits each input line into fields and lets you apply rules to each line. It's ideal for columnar data: log files, CSV, /etc/passwd, command output.
# Built-in variables:
# $0 — entire line $1 $2 ... — individual fields
# NR — current line number NF — number of fields on this line
# FS — field separator OFS — output field separator
# Print specific fields (default delimiter: any whitespace)
awk '{print $1, $3}' data.txt
# Print last field
awk '{print $NF}' data.txt
# Use a custom field separator
awk -F: '{print $1, $3}' /etc/passwd # username and UID
awk -F, '{print $2}' data.csv
# Print lines where a field matches a pattern
awk '/ERROR/ {print NR, $0}' app.log
awk '$3 > 100 {print $1, $3}' scores.txt # numeric comparison
# Sum a column
awk '{sum += $2} END {print "Total:", sum}' sales.txt
# Count lines matching a pattern
awk '/FAIL/ {count++} END {print count " failures"}' results.txt
# BEGIN and END blocks run before/after all input
awk 'BEGIN {print "Name", "Score"} {print $1, $2} END {print "Done"}' scores.txt
# Reformatting: change delimiter in output
awk -F, 'BEGIN {OFS="|"} {print $1, $2, $3}' data.csv
# Capture awk output in a variable
total=$(awk '{sum += $1} END {print sum}' numbers.txt)
echo "Total: $total"
7 — The Supporting Cast: cut, sort, uniq, wc, tr
cut — extract columns
# Cut by delimiter and field number
cut -d: -f1 /etc/passwd # usernames
cut -d, -f2,4 data.csv # columns 2 and 4
cut -d, -f2- data.csv # columns 2 to end
# Cut by character position
cut -c1-8 timestamps.txt # first 8 characters
cut -c9- timestamps.txt # from character 9 to end
sort — order lines
sort names.txt # alphabetical
sort -r names.txt # reverse alphabetical
sort -n numbers.txt # numeric
sort -nr numbers.txt # numeric descending (largest first)
sort -u names.txt # sort and remove duplicates
sort -t, -k2,2n data.csv # sort CSV by 2nd column numerically
sort -k1,1 -k2,2n data.txt # primary sort col 1, secondary col 2
sort -h sizes.txt # human-numeric (10K before 2M)
uniq — remove adjacent duplicate lines
sort items.txt | uniq # remove duplicates
sort items.txt | uniq -c # prefix each line with its count
sort items.txt | uniq -d # print only lines that appeared more than once
sort items.txt | uniq -u # print only lines that appeared exactly once
# Top 10 most frequent lines
sort access.log | uniq -c | sort -rn | head -10
wc — count lines, words, characters
wc -l file.txt # number of lines
wc -w file.txt # number of words
wc -c file.txt # number of bytes
wc -m file.txt # number of characters (multi-byte aware)
# Capture line count cleanly in a variable
count=$(wc -l < file.txt) # redirect avoids filename in output
echo "Lines: $count"
tr — translate or delete characters
# Convert to uppercase / lowercase
echo "hello world" | tr '[:lower:]' '[:upper:]'
HELLO WORLD
echo "HELLO" | tr 'A-Z' 'a-z'
hello
# Delete specific characters
echo "h3ll0 w0rld" | tr -d '0-9'
hll wrld
# Squeeze repeated characters (e.g. collapse multiple spaces)
echo "too many spaces" | tr -s ' '
too many spaces
# Replace colons with newlines (e.g. expand PATH for readability)
echo "$PATH" | tr ':' '\n'
# Remove Windows carriage returns from a file
tr -d '\r' < windows.txt > unix.txt
8 — Building Pipelines
The real power is combining these tools. Each tool does one job well; the pipe | connects them into a transformation chain.
awk '{print $1}' access.log \
| sort \
| uniq -c \
| sort -rn \
| head -5
523 192.168.1.105
311 10.0.0.22
198 172.16.0.4
145 192.168.1.200
89 10.0.0.1
grep "Failed password" /var/log/auth.log \
| awk '{print $(NF-3)}' \
| sort | uniq -c | sort -rn \
| head -10
# Input: date,region,amount e.g. 2026-01-05,North,1500
tail -n +2 sales.csv \ # skip header
| awk -F, '{region[$2] += $3}
END {for (r in region) printf "%-10s £%d\n", r, region[r]}' \
| sort -k2,2rn
South £48200
North £35700
East £29100
tee — split a pipeline to a file and stdout
# Log pipeline output to a file while still printing to screen
some_command | tee output.log | grep "ERROR"
# Append with -a
some_command | tee -a logfile.log >/dev/null # log only, suppress screen
9 — Process Substitution
Process substitution — <(command) — lets you feed the output of a command to another command that expects a filename. It's the clean way to use diff, while read, and other tools with live command output.
# diff two commands' output without temp files
diff <(sort file1.txt) <(sort file2.txt)
# Read the output of a command safely in a while loop
# (a plain pipe would run the loop body in a subshell)
while IFS= read -r line; do
echo "$line"
done <(grep "ERROR" app.log)
# Compare sorted lists from two directories
diff <(ls dir1/) <(ls dir2/)
# Write to a process (less common)
tee >(gzip > backup.gz) > plain_copy.txt < source.txt
while loop: variables set inside the loop body remain visible after the loop ends, because <() runs the command in a separate process but keeps the while loop in the current shell.10 — Quick Reference
| Tool / Pattern | What it does | Key flags |
|---|---|---|
while IFS= read -r line; do ... done < file | Safe line-by-line file reading | -r no backslash processing |
mapfile -t arr < file | Read all lines into an array | -t strips trailing newline |
content=$(<file) | Slurp whole file into variable | — |
find dir -name "*.ext" -type f | Locate files recursively | -mtime -7, -size +100M, -exec, -print0 |
grep -E "pattern" file | Print matching lines | -i case-insensitive, -v invert, -q silent, -c count, -n line numbers |
sed 's/old/new/g' file | Stream substitution | -i in-place, -n suppress output, /d delete lines |
awk -F: '{print $1}' file | Field extraction / processing | NR line no., NF field count, BEGIN/END |
cut -d, -f2 file | Extract columns by delimiter | -c character positions |
sort -n -k2 file | Sort lines | -r reverse, -u unique, -h human sizes |
uniq -c | Remove adjacent duplicates / count | -d duplicates only, -u unique only |
wc -l < file | Count lines (words, bytes) | -w words, -c bytes, -m chars |
tr 'a-z' 'A-Z' | Translate characters | -d delete, -s squeeze repeats |
tee file | Copy stdin to file and stdout | -a append |
diff <(cmd1) <(cmd2) | Compare command output with process substitution | — |
✏️ Exercises
Apply what you have learned. Try writing the script yourself before looking at the sample solution.
log_report.sh that accepts a log file as its first argument and prints: (a) total number of lines, (b) number of lines containing ERROR, (c) number of lines containing WARN, and (d) the 5 most frequent words in ERROR lines, with their counts.wc -l < file for counts, grep -c for pattern counts, and grep "ERROR" | tr -s ' ' '\n' | sort | uniq -c | sort -rn | head -5 for frequent words.#!/bin/bash
# log_report.sh — usage: ./log_report.sh app.log
logfile="${1:?Usage: $0 <logfile>}"
[[ -f "$logfile" ]] || { echo "File not found: $logfile" >&2; exit 1; }
total=$(wc -l < "$logfile")
errors=$(grep -c "ERROR" "$logfile" || echo 0)
warns=$(grep -c "WARN" "$logfile" || echo 0)
printf "Log file : %s\n" "$logfile"
printf "Total : %d lines\n" "$total"
printf "ERROR : %d lines\n" "$errors"
printf "WARN : %d lines\n" "$warns"
echo
echo "Top 5 words in ERROR lines:"
grep "ERROR" "$logfile" \
| tr -s '[:space:]' '\n' \
| tr '[:upper:]' '[:lower:]' \
| grep -v '^$' \
| sort | uniq -c | sort -rn | head -5 \
| awk '{printf " %4d %s\n", $1, $2}'
csv_filter.sh that reads a CSV file (with a header row), takes a column number and a search term as arguments, and prints all rows where that column matches the term. Also print the header. Example: ./csv_filter.sh sales.csv 2 North prints all rows where column 2 is "North".head -1 to print the header, then tail -n +2 to skip it and pipe to awk -F, with a condition on $colnum.#!/bin/bash
# csv_filter.sh — usage: ./csv_filter.sh file.csv COLUMN TERM
file="${1:?Usage: $0 <file.csv> <column> <term>}"
col="${2:?column number required}"
term="${3:?search term required}"
[[ -f "$file" ]] || { echo "File not found: $file" >&2; exit 1; }
# Print header
head -1 "$file"
# Filter rows
tail -n +2 "$file" | awk -F, -v c="$col" -v t="$term" '$c == t'
find_large.sh that accepts a directory and a size threshold (in MB) as arguments, finds all files larger than that threshold, and outputs a formatted table showing the filename (base name only) and size in MB, sorted largest first. At the end, print the total size of all matched files.find dir -type f -size +NNMb (or use +NNM). Pipe to du -m or use stat to get sizes. Collect results into an array, sort with sort -rn, and sum with awk.#!/bin/bash
# find_large.sh — usage: ./find_large.sh /path/to/dir SIZE_MB
dir="${1:?Usage: $0 <directory> <size_MB>}"
threshold="${2:?size threshold in MB required}"
[[ -d "$dir" ]] || { echo "Not a directory: $dir" >&2; exit 1; }
printf "\nFiles larger than %dMB in %s:\n\n" "$threshold" "$dir"
printf "%-40s %8s\n" "Filename" "Size(MB)"
printf '%.0s─' {1..50}; echo
total=0
found=0
while IFS= read -r -d '' filepath; do
size_bytes=$(stat --format='%s' "$filepath" 2>/dev/null)
[[ -z "$size_bytes" ]] && continue
size_mb=$(echo "scale=1; $size_bytes / 1048576" | bc)
total=$(echo "$total + $size_mb" | bc)
(( found++ ))
printf "%-40s %8.1f\n" "$(basename "$filepath")" "$size_mb"
done <(find "$dir" -type f -size +"${threshold}"M -print0 \
| xargs -0 -I{} stat --format='%s %n' {} 2>/dev/null \
| sort -rn \
| awk '{print $2}' \
| tr '\n' '\0')
printf '%.0s─' {1..50}; echo
printf "%-40s %8.1f MB (%d files)\n" "TOTAL" "$total" "$found"
replace_in_files.sh that accepts three arguments: a directory, a search string, and a replacement string. It should find all .txt files in that directory tree containing the search string, report how many files were found, show a preview of the first match in each file, and then (after confirmation) perform the replacement in all files using sed -i. Make a .bak backup of each file before modifying it.grep -rl to find files containing the pattern. Use grep -m1 for a single-line preview. Use read -r -p "Proceed? [y/N]" for confirmation. Use sed -i.bak for atomic in-place replacement with backup.#!/bin/bash
# replace_in_files.sh — usage: ./replace_in_files.sh DIR SEARCH REPLACE
dir="${1:?Usage: $0 <dir> <search> <replace>}"
search="${2:?search string required}"
replace="${3?replacement string required}" # note: allows empty string
[[ -d "$dir" ]] || { echo "Not a directory: $dir" >&2; exit 1; }
# Find matching files
matches=()
while IFS= read -r -d '' f; do
matches+=( "$f" )
done <(grep -rl --include='*.txt' -Z "$search" "$dir")
if [[ "${#matches[@]}" -eq 0 ]]; then
echo "No files found containing: $search"
exit 0
fi
printf "Found %d file(s) containing '%s':\n\n" "${#matches[@]}" "$search"
for f in "${matches[@]}"; do
preview=$(grep -m1 -n "$search" "$f")
printf " %s\n ↳ %s\n" "$f" "$preview"
done
echo
read -r -p "Replace '$search' → '$replace' in all files? [y/N] " confirm
[[ "$confirm" != [yY] ]] && { echo "Aborted."; exit 0; }
for f in "${matches[@]}"; do
sed -i.bak "s|${search}|${replace}|g" "$f"
printf " ✓ Updated: %s (backup: %s.bak)\n" "$f" "$f"
done
echo "Done."