Working with Files and Text

📁 Topic 9 — Working with Files and Text

The shell's real power comes from combining simple text-processing tools into pipelines that transform data. This chapter covers reading and writing files safely, navigating the filesystem with find, and the essential Unix text tools — grep, sed, awk, cut, sort, uniq, wc, and tr — with an emphasis on the patterns you'll actually use in scripts every day.

1 — Reading Files

The canonical, safe way to read a file line by line is a while IFS= read -r loop. It handles blank lines, lines without a trailing newline, and filenames or values that contain spaces.

🐧 Reading a file line by line

#!/bin/bash

# Canonical pattern — handles all edge cases
while IFS= read -r line; do
    echo "Line: $line"
done < "/path/to/file.txt"

# With line numbers
lineno=0
while IFS= read -r line; do
    (( lineno++ ))
    printf "%4d  %s\n" "$lineno" "$line"
done < file.txt

# Skip blank lines and comments (lines starting with #)
while IFS= read -r line; do
    [[ -z "$line" || "$line" == \#* ]] && continue
    echo "$line"
done < config.txt

# Read two fields per line (e.g. "name score" format)
while read -r name score; do
    printf "%-15s %d\n" "$name" "$score"
done < scores.txt

IFS= prevents leading/trailing whitespace being stripped from each line. -r prevents backslash escape sequences being interpreted. Both are almost always what you want.

⚠️ Don't do: for line in $(cat file)
This splits on every whitespace character (not just newlines), breaks on filenames with spaces, and is slower than a while read loop. Always use the while IFS= read -r pattern for processing files line by line.

Reading a File into a Variable or Array

🐧 Slurp an entire file or split into an array

# Read entire file into a single variable
content=$(<file.txt)         # faster than $(cat file.txt)

# Read all lines into an array (one element per line)
lines=()
while IFS= read -r line; do
    lines+=( "$line" )
done < file.txt

# Or with mapfile / readarray (bash 4+, most concise)
mapfile -t lines < file.txt
# -t strips the trailing newline from each element

echo "Total lines: ${#lines[@]}"
echo "First line : ${lines[0]}"
echo "Last line  : ${lines[-1]}"

2 — Writing Files

🐧 Output redirection patterns

# Overwrite (create or truncate)
echo "Hello" > output.txt

# Append
echo "World" >> output.txt

# Write multiple lines with a here-document
cat > config.ini <<'EOF'
host=localhost
port=8080
debug=false
EOF

# Write with variable expansion in here-doc (no quotes on delimiter)
app_name="myapp"
version="1.0"
cat > version.txt <<EOF
Application: $app_name
Version    : $version
Built      : $(date '+%Y-%m-%d')
EOF

# Write stdout AND stderr to the same log file
logfile="app.log"
{
    echo "Starting process..."
    some_command
    echo "Done."
} &> "$logfile"

# Atomic write — write to temp file first, then rename
# (prevents partial reads if another process opens the file mid-write)
tmpfile=$(mktemp)
generate_data > "$tmpfile"
mv "$tmpfile" "final_output.txt"

Tip — atomic writes with mktemp: When a script updates a file that another process might be reading (like a status file or config), always write to a temporary file first and then mv it into place. A rename on the same filesystem is atomic; a plain > redirect is not — a reader could see a half-written file.

3 — Finding Files with `find`

find is the standard tool for locating files by name, type, size, age, permissions, or any combination. It recursively traverses the directory tree and can execute actions on matching files.

🐧 find — common patterns

# Find by name (case-sensitive)
find /var/log -name "*.log"

# Find by name (case-insensitive)
find . -iname "*.jpg"

# Find only files (not directories)
find . -type f -name "*.sh"

# Find only directories
find . -type d -name "config"

# Find files modified in the last 7 days
find . -type f -mtime -7

# Find files larger than 100 MB
find / -type f -size +100M

# Limit search depth (don't recurse deeper than 2 levels)
find . -maxdepth 2 -name "*.conf"

# Run a command on each found file (-exec ... {} \;)
find . -name "*.sh" -exec chmod +x {} \;

# Safer: use -print0 | xargs -0 to handle spaces in names
find . -name "*.log" -print0 | xargs -0 rm -f

# Delete empty directories
find . -type d -empty -delete

# Find and loop in bash (safest — handles all filenames)
while IFS= read -r -d '' file; do
    echo "Processing: $file"
done <(find . -name "*.txt" -print0)

The -print0 + read -d '' combination uses null bytes as the record separator, making it safe for filenames that contain spaces, newlines, or other special characters.

4 — Searching Text with `grep`

grep searches for lines matching a pattern. In scripts you use it both to filter output in pipelines and to test whether a match exists at all (via its exit code).

🐧 grep patterns

# Basic search — print matching lines
grep "error" app.log

# Case-insensitive
grep -i "error" app.log

# Show line numbers
grep -n "TODO" *.py

# Invert match — lines that do NOT match
grep -v "^#" config.txt          # strip comment lines

# Count matching lines
grep -c "FAIL" results.txt

# Show only the matched part, not the whole line
grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" access.log  # extract IPs

# Extended regex (no need to escape + ? | ( ) )
grep -E "^(ERROR|WARN)" app.log

# Recursive search in a directory tree
grep -r "password" /etc/

# Show N lines of context before/after the match
grep -A 3 "Exception" app.log    # 3 lines after
grep -B 2 "Exception" app.log    # 2 lines before
grep -C 2 "Exception" app.log    # 2 lines either side

# Use exit code in a script (0 = found, 1 = not found)
if grep -q "CRITICAL" app.log; then   # -q = quiet, no output
    echo "Critical errors found!" >&2
    exit 1
fi

5 — Stream Editing with `sed`

sed (stream editor) processes text line by line, making substitutions, deletions, and other edits. The substitution command s/pattern/replacement/ is by far the most used.

🐧 sed — substitution and deletion

# Replace first occurrence on each line
sed 's/foo/bar/' input.txt

# Replace ALL occurrences on each line (g = global)
sed 's/foo/bar/g' input.txt

# Case-insensitive replacement
sed 's/error/ERROR/gI' app.log

# Edit in place (modify the file directly)
sed -i 's/localhost/192.168.1.1/g' config.ini
# -i.bak makes a backup: config.ini.bak
sed -i.bak 's/localhost/192.168.1.1/g' config.ini

# Delete lines matching a pattern
sed '/^#/d' config.txt             # delete comment lines
sed '/^[[:space:]]*$/d' file.txt   # delete blank lines

# Print only specific lines (suppress default output with -n)
sed -n '5p' file.txt               # print line 5
sed -n '5,10p' file.txt            # print lines 5–10
sed -n '/START/,/END/p' file.txt   # print between markers

# Multiple expressions with -e
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# Use & to refer to the whole matched text
sed 's/[0-9]\+/[&]/g' file.txt    # wrap every number in brackets
Price [42] for [5] items

# Strip leading and trailing whitespace
sed 's/^[[:space:]]*//; s/[[:space:]]*$//' file.txt

⚠️ macOS sed vs GNU sed: On macOS, -i requires an explicit backup suffix — sed -i '' 's/a/b/' file (empty string). On Linux, sed -i 's/a/b/' file works without a suffix. For portability in scripts, use -i.bak (creates a backup that you can then delete).

6 — Field Processing with `awk`

awk splits each input line into fields and lets you apply rules to each line. It's ideal for columnar data: log files, CSV, /etc/passwd, command output.

🐧 awk — essential patterns

# Built-in variables:
#   $0  — entire line        $1 $2 ... — individual fields
#   NR  — current line number NF — number of fields on this line
#   FS  — field separator    OFS — output field separator

# Print specific fields (default delimiter: any whitespace)
awk '{print $1, $3}' data.txt

# Print last field
awk '{print $NF}' data.txt

# Use a custom field separator
awk -F: '{print $1, $3}' /etc/passwd     # username and UID
awk -F, '{print $2}' data.csv

# Print lines where a field matches a pattern
awk '/ERROR/ {print NR, $0}' app.log
awk '$3 > 100 {print $1, $3}' scores.txt   # numeric comparison

# Sum a column
awk '{sum += $2} END {print "Total:", sum}' sales.txt

# Count lines matching a pattern
awk '/FAIL/ {count++} END {print count " failures"}' results.txt

# BEGIN and END blocks run before/after all input
awk 'BEGIN {print "Name", "Score"} {print $1, $2} END {print "Done"}' scores.txt

# Reformatting: change delimiter in output
awk -F, 'BEGIN {OFS="|"} {print $1, $2, $3}' data.csv

# Capture awk output in a variable
total=$(awk '{sum += $1} END {print sum}' numbers.txt)
echo "Total: $total"

7 — The Supporting Cast: `cut`, `sort`, `uniq`, `wc`, `tr`

cut — extract columns

🐧 cut

# Cut by delimiter and field number
cut -d: -f1 /etc/passwd             # usernames
cut -d, -f2,4 data.csv             # columns 2 and 4
cut -d, -f2- data.csv              # columns 2 to end

# Cut by character position
cut -c1-8 timestamps.txt           # first 8 characters
cut -c9- timestamps.txt            # from character 9 to end

sort — order lines

🐧 sort

sort names.txt                     # alphabetical
sort -r names.txt                  # reverse alphabetical
sort -n numbers.txt                # numeric
sort -nr numbers.txt               # numeric descending (largest first)
sort -u names.txt                  # sort and remove duplicates
sort -t, -k2,2n data.csv          # sort CSV by 2nd column numerically
sort -k1,1 -k2,2n data.txt        # primary sort col 1, secondary col 2
sort -h sizes.txt                  # human-numeric (10K before 2M)

uniq — remove adjacent duplicate lines

🐧 uniq (input must be sorted first)

sort items.txt | uniq             # remove duplicates
sort items.txt | uniq -c          # prefix each line with its count
sort items.txt | uniq -d          # print only lines that appeared more than once
sort items.txt | uniq -u          # print only lines that appeared exactly once

# Top 10 most frequent lines
sort access.log | uniq -c | sort -rn | head -10

wc — count lines, words, characters

🐧 wc

wc -l file.txt                     # number of lines
wc -w file.txt                     # number of words
wc -c file.txt                     # number of bytes
wc -m file.txt                     # number of characters (multi-byte aware)

# Capture line count cleanly in a variable
count=$(wc -l < file.txt)           # redirect avoids filename in output
echo "Lines: $count"

tr — translate or delete characters

🐧 tr (reads from stdin only)

# Convert to uppercase / lowercase
echo "hello world" | tr '[:lower:]' '[:upper:]'
HELLO WORLD
echo "HELLO" | tr 'A-Z' 'a-z'
hello

# Delete specific characters
echo "h3ll0 w0rld" | tr -d '0-9'
hll wrld

# Squeeze repeated characters (e.g. collapse multiple spaces)
echo "too   many   spaces" | tr -s ' '
too many spaces

# Replace colons with newlines (e.g. expand PATH for readability)
echo "$PATH" | tr ':' '\n'

# Remove Windows carriage returns from a file
tr -d '\r' < windows.txt > unix.txt

8 — Building Pipelines

The real power is combining these tools. Each tool does one job well; the pipe | connects them into a transformation chain.

Pipeline: top 5 IP addresses in an Apache log

awk '{print $1}' access.log \
  | sort \
  | uniq -c \
  | sort -rn \
  | head -5
    523 192.168.1.105
    311 10.0.0.22
    198 172.16.0.4
    145 192.168.1.200
     89 10.0.0.1

Pipeline: extract failed logins from auth.log

grep "Failed password" /var/log/auth.log \
  | awk '{print $(NF-3)}' \
  | sort | uniq -c | sort -rn \
  | head -10

Pipeline: CSV summary — total sales per region

# Input: date,region,amount  e.g.  2026-01-05,North,1500
tail -n +2 sales.csv \          # skip header
  | awk -F, '{region[$2] += $3}
         END {for (r in region) printf "%-10s £%d\n", r, region[r]}' \
  | sort -k2,2rn

South      £48200
North      £35700
East       £29100

tee — split a pipeline to a file and stdout

🐧 tee

# Log pipeline output to a file while still printing to screen
some_command | tee output.log | grep "ERROR"

# Append with -a
some_command | tee -a logfile.log >/dev/null  # log only, suppress screen

9 — Process Substitution

Process substitution — <(command) — lets you feed the output of a command to another command that expects a filename. It's the clean way to use diff, while read, and other tools with live command output.

🐧 Process substitution patterns

# diff two commands' output without temp files
diff <(sort file1.txt) <(sort file2.txt)

# Read the output of a command safely in a while loop
# (a plain pipe would run the loop body in a subshell)
while IFS= read -r line; do
    echo "$line"
done <(grep "ERROR" app.log)

# Compare sorted lists from two directories
diff <(ls dir1/) <(ls dir2/)

# Write to a process (less common)
tee >(gzip > backup.gz) > plain_copy.txt < source.txt

The key advantage over a pipe in a while loop: variables set inside the loop body remain visible after the loop ends, because <() runs the command in a separate process but keeps the while loop in the current shell.

10 — Quick Reference

Tool / Pattern	What it does	Key flags
`while IFS= read -r line; do ... done < file`	Safe line-by-line file reading	`-r` no backslash processing
`mapfile -t arr < file`	Read all lines into an array	`-t` strips trailing newline
`content=$(<file)`	Slurp whole file into variable	—
`find dir -name "*.ext" -type f`	Locate files recursively	`-mtime -7`, `-size +100M`, `-exec`, `-print0`
`grep -E "pattern" file`	Print matching lines	`-i` case-insensitive, `-v` invert, `-q` silent, `-c` count, `-n` line numbers
`sed 's/old/new/g' file`	Stream substitution	`-i` in-place, `-n` suppress output, `/d` delete lines
`awk -F: '{print $1}' file`	Field extraction / processing	`NR` line no., `NF` field count, `BEGIN/END`
`cut -d, -f2 file`	Extract columns by delimiter	`-c` character positions
`sort -n -k2 file`	Sort lines	`-r` reverse, `-u` unique, `-h` human sizes
`uniq -c`	Remove adjacent duplicates / count	`-d` duplicates only, `-u` unique only
`wc -l < file`	Count lines (words, bytes)	`-w` words, `-c` bytes, `-m` chars
`tr 'a-z' 'A-Z'`	Translate characters	`-d` delete, `-s` squeeze repeats
`tee file`	Copy stdin to file and stdout	`-a` append
`diff <(cmd1) <(cmd2)`	Compare command output with process substitution	—

✏️ Exercises

Apply what you have learned. Try writing the script yourself before looking at the sample solution.

Exercise 1

Write a script called log_report.sh that accepts a log file as its first argument and prints: (a) total number of lines, (b) number of lines containing ERROR, (c) number of lines containing WARN, and (d) the 5 most frequent words in ERROR lines, with their counts.

Sample Solution

#!/bin/bash
# log_report.sh — usage: ./log_report.sh app.log

logfile="${1:?Usage: $0 <logfile>}"
[[ -f "$logfile" ]] || { echo "File not found: $logfile" >&2; exit 1; }

total=$(wc -l < "$logfile")
errors=$(grep -c "ERROR" "$logfile" || echo 0)
warns=$(grep -c "WARN" "$logfile" || echo 0)

printf "Log file : %s\n" "$logfile"
printf "Total    : %d lines\n" "$total"
printf "ERROR    : %d lines\n" "$errors"
printf "WARN     : %d lines\n" "$warns"

echo
echo "Top 5 words in ERROR lines:"
grep "ERROR" "$logfile" \
  | tr -s '[:space:]' '\n' \
  | tr '[:upper:]' '[:lower:]' \
  | grep -v '^$' \
  | sort | uniq -c | sort -rn | head -5 \
  | awk '{printf "  %4d  %s\n", $1, $2}'

Exercise 2

Write a script called csv_filter.sh that reads a CSV file (with a header row), takes a column number and a search term as arguments, and prints all rows where that column matches the term. Also print the header. Example: ./csv_filter.sh sales.csv 2 North prints all rows where column 2 is "North".

Hint: use head -1 to print the header, then tail -n +2 to skip it and pipe to awk -F, with a condition on $colnum.

Sample Solution

#!/bin/bash
# csv_filter.sh — usage: ./csv_filter.sh file.csv COLUMN TERM

file="${1:?Usage: $0 <file.csv> <column> <term>}"
col="${2:?column number required}"
term="${3:?search term required}"

[[ -f "$file" ]] || { echo "File not found: $file" >&2; exit 1; }

# Print header
head -1 "$file"

# Filter rows
tail -n +2 "$file" | awk -F, -v c="$col" -v t="$term" '$c == t'

Exercise 3

Write a script called find_large.sh that accepts a directory and a size threshold (in MB) as arguments, finds all files larger than that threshold, and outputs a formatted table showing the filename (base name only) and size in MB, sorted largest first. At the end, print the total size of all matched files.

Hint: use find dir -type f -size +NNMb (or use +NNM). Pipe to du -m or use stat to get sizes. Collect results into an array, sort with sort -rn, and sum with awk.

Sample Solution

#!/bin/bash
# find_large.sh — usage: ./find_large.sh /path/to/dir SIZE_MB

dir="${1:?Usage: $0 <directory> <size_MB>}"
threshold="${2:?size threshold in MB required}"

[[ -d "$dir" ]] || { echo "Not a directory: $dir" >&2; exit 1; }

printf "\nFiles larger than %dMB in %s:\n\n" "$threshold" "$dir"
printf "%-40s %8s\n" "Filename" "Size(MB)"
printf '%.0s─' {1..50}; echo

total=0
found=0

while IFS= read -r -d '' filepath; do
    size_bytes=$(stat --format='%s' "$filepath" 2>/dev/null)
    [[ -z "$size_bytes" ]] && continue
    size_mb=$(echo "scale=1; $size_bytes / 1048576" | bc)
    total=$(echo "$total + $size_mb" | bc)
    (( found++ ))
    printf "%-40s %8.1f\n" "$(basename "$filepath")" "$size_mb"
done <(find "$dir" -type f -size +"${threshold}"M -print0 \
         | xargs -0 -I{} stat --format='%s %n' {} 2>/dev/null \
         | sort -rn \
         | awk '{print $2}' \
         | tr '\n' '\0')

printf '%.0s─' {1..50}; echo
printf "%-40s %8.1f MB  (%d files)\n" "TOTAL" "$total" "$found"

Exercise 4

Write a script called replace_in_files.sh that accepts three arguments: a directory, a search string, and a replacement string. It should find all .txt files in that directory tree containing the search string, report how many files were found, show a preview of the first match in each file, and then (after confirmation) perform the replacement in all files using sed -i. Make a .bak backup of each file before modifying it.

Hint: use grep -rl to find files containing the pattern. Use grep -m1 for a single-line preview. Use read -r -p "Proceed? [y/N]" for confirmation. Use sed -i.bak for atomic in-place replacement with backup.

Sample Solution

#!/bin/bash
# replace_in_files.sh — usage: ./replace_in_files.sh DIR SEARCH REPLACE

dir="${1:?Usage: $0 <dir> <search> <replace>}"
search="${2:?search string required}"
replace="${3?replacement string required}"    # note: allows empty string

[[ -d "$dir" ]] || { echo "Not a directory: $dir" >&2; exit 1; }

# Find matching files
matches=()
while IFS= read -r -d '' f; do
    matches+=( "$f" )
done <(grep -rl --include='*.txt' -Z "$search" "$dir")

if [[ "${#matches[@]}" -eq 0 ]]; then
    echo "No files found containing: $search"
    exit 0
fi

printf "Found %d file(s) containing '%s':\n\n" "${#matches[@]}" "$search"
for f in "${matches[@]}"; do
    preview=$(grep -m1 -n "$search" "$f")
    printf "  %s\n    ↳ %s\n" "$f" "$preview"
done

echo
read -r -p "Replace '$search' → '$replace' in all files? [y/N] " confirm
[[ "$confirm" != [yY] ]] && { echo "Aborted."; exit 0; }

for f in "${matches[@]}"; do
    sed -i.bak "s|${search}|${replace}|g" "$f"
    printf "  ✓ Updated: %s  (backup: %s.bak)\n" "$f" "$f"
done
echo "Done."

Working with Files and Text

📁 Topic 9 — Working with Files and Text

1 — Reading Files

Reading a File into a Variable or Array

2 — Writing Files

3 — Finding Files with find

4 — Searching Text with grep

5 — Stream Editing with sed

6 — Field Processing with awk

7 — The Supporting Cast: cut, sort, uniq, wc, tr

cut — extract columns

sort — order lines

uniq — remove adjacent duplicate lines

wc — count lines, words, characters

tr — translate or delete characters

8 — Building Pipelines

tee — split a pipeline to a file and stdout

9 — Process Substitution

10 — Quick Reference

✏️ Exercises

3 — Finding Files with `find`

4 — Searching Text with `grep`

5 — Stream Editing with `sed`

6 — Field Processing with `awk`

7 — The Supporting Cast: `cut`, `sort`, `uniq`, `wc`, `tr`