Working with Files and Text

📁 Topic 9 — Working with Files and Text

The shell's real power comes from combining simple text-processing tools into pipelines that transform data. This chapter covers reading and writing files safely, navigating the filesystem with find, and the essential Unix text tools — grep, sed, awk, cut, sort, uniq, wc, and tr — with an emphasis on the patterns you'll actually use in scripts every day.

1 — Reading Files

The canonical, safe way to read a file line by line is a while IFS= read -r loop. It handles blank lines, lines without a trailing newline, and filenames or values that contain spaces.

🐧 Reading a file line by line
#!/bin/bash # Canonical pattern — handles all edge cases while IFS= read -r line; do echo "Line: $line" done < "/path/to/file.txt" # With line numbers lineno=0 while IFS= read -r line; do (( lineno++ )) printf "%4d %s\n" "$lineno" "$line" done < file.txt # Skip blank lines and comments (lines starting with #) while IFS= read -r line; do [[ -z "$line" || "$line" == \#* ]] && continue echo "$line" done < config.txt # Read two fields per line (e.g. "name score" format) while read -r name score; do printf "%-15s %d\n" "$name" "$score" done < scores.txt
IFS= prevents leading/trailing whitespace being stripped from each line. -r prevents backslash escape sequences being interpreted. Both are almost always what you want.
⚠️ Don't do: for line in $(cat file)
This splits on every whitespace character (not just newlines), breaks on filenames with spaces, and is slower than a while read loop. Always use the while IFS= read -r pattern for processing files line by line.

Reading a File into a Variable or Array

🐧 Slurp an entire file or split into an array
# Read entire file into a single variable content=$(<file.txt) # faster than $(cat file.txt) # Read all lines into an array (one element per line) lines=() while IFS= read -r line; do lines+=( "$line" ) done < file.txt # Or with mapfile / readarray (bash 4+, most concise) mapfile -t lines < file.txt # -t strips the trailing newline from each element echo "Total lines: ${#lines[@]}" echo "First line : ${lines[0]}" echo "Last line : ${lines[-1]}"

2 — Writing Files

🐧 Output redirection patterns
# Overwrite (create or truncate) echo "Hello" > output.txt # Append echo "World" >> output.txt # Write multiple lines with a here-document cat > config.ini <<'EOF' host=localhost port=8080 debug=false EOF # Write with variable expansion in here-doc (no quotes on delimiter) app_name="myapp" version="1.0" cat > version.txt <<EOF Application: $app_name Version : $version Built : $(date '+%Y-%m-%d') EOF # Write stdout AND stderr to the same log file logfile="app.log" { echo "Starting process..." some_command echo "Done." } &> "$logfile" # Atomic write — write to temp file first, then rename # (prevents partial reads if another process opens the file mid-write) tmpfile=$(mktemp) generate_data > "$tmpfile" mv "$tmpfile" "final_output.txt"
Tip — atomic writes with mktemp: When a script updates a file that another process might be reading (like a status file or config), always write to a temporary file first and then mv it into place. A rename on the same filesystem is atomic; a plain > redirect is not — a reader could see a half-written file.

3 — Finding Files with find

find is the standard tool for locating files by name, type, size, age, permissions, or any combination. It recursively traverses the directory tree and can execute actions on matching files.

🐧 find — common patterns
# Find by name (case-sensitive) find /var/log -name "*.log" # Find by name (case-insensitive) find . -iname "*.jpg" # Find only files (not directories) find . -type f -name "*.sh" # Find only directories find . -type d -name "config" # Find files modified in the last 7 days find . -type f -mtime -7 # Find files larger than 100 MB find / -type f -size +100M # Limit search depth (don't recurse deeper than 2 levels) find . -maxdepth 2 -name "*.conf" # Run a command on each found file (-exec ... {} \;) find . -name "*.sh" -exec chmod +x {} \; # Safer: use -print0 | xargs -0 to handle spaces in names find . -name "*.log" -print0 | xargs -0 rm -f # Delete empty directories find . -type d -empty -delete # Find and loop in bash (safest — handles all filenames) while IFS= read -r -d '' file; do echo "Processing: $file" done <(find . -name "*.txt" -print0)
The -print0 + read -d '' combination uses null bytes as the record separator, making it safe for filenames that contain spaces, newlines, or other special characters.

4 — Searching Text with grep

grep searches for lines matching a pattern. In scripts you use it both to filter output in pipelines and to test whether a match exists at all (via its exit code).

🐧 grep patterns
# Basic search — print matching lines grep "error" app.log # Case-insensitive grep -i "error" app.log # Show line numbers grep -n "TODO" *.py # Invert match — lines that do NOT match grep -v "^#" config.txt # strip comment lines # Count matching lines grep -c "FAIL" results.txt # Show only the matched part, not the whole line grep -o "[0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+" access.log # extract IPs # Extended regex (no need to escape + ? | ( ) ) grep -E "^(ERROR|WARN)" app.log # Recursive search in a directory tree grep -r "password" /etc/ # Show N lines of context before/after the match grep -A 3 "Exception" app.log # 3 lines after grep -B 2 "Exception" app.log # 2 lines before grep -C 2 "Exception" app.log # 2 lines either side # Use exit code in a script (0 = found, 1 = not found) if grep -q "CRITICAL" app.log; then # -q = quiet, no output echo "Critical errors found!" >&2 exit 1 fi

5 — Stream Editing with sed

sed (stream editor) processes text line by line, making substitutions, deletions, and other edits. The substitution command s/pattern/replacement/ is by far the most used.

🐧 sed — substitution and deletion
# Replace first occurrence on each line sed 's/foo/bar/' input.txt # Replace ALL occurrences on each line (g = global) sed 's/foo/bar/g' input.txt # Case-insensitive replacement sed 's/error/ERROR/gI' app.log # Edit in place (modify the file directly) sed -i 's/localhost/192.168.1.1/g' config.ini # -i.bak makes a backup: config.ini.bak sed -i.bak 's/localhost/192.168.1.1/g' config.ini # Delete lines matching a pattern sed '/^#/d' config.txt # delete comment lines sed '/^[[:space:]]*$/d' file.txt # delete blank lines # Print only specific lines (suppress default output with -n) sed -n '5p' file.txt # print line 5 sed -n '5,10p' file.txt # print lines 5–10 sed -n '/START/,/END/p' file.txt # print between markers # Multiple expressions with -e sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt # Use & to refer to the whole matched text sed 's/[0-9]\+/[&]/g' file.txt # wrap every number in brackets Price [42] for [5] items # Strip leading and trailing whitespace sed 's/^[[:space:]]*//; s/[[:space:]]*$//' file.txt
⚠️ macOS sed vs GNU sed: On macOS, -i requires an explicit backup suffix — sed -i '' 's/a/b/' file (empty string). On Linux, sed -i 's/a/b/' file works without a suffix. For portability in scripts, use -i.bak (creates a backup that you can then delete).

6 — Field Processing with awk

awk splits each input line into fields and lets you apply rules to each line. It's ideal for columnar data: log files, CSV, /etc/passwd, command output.

🐧 awk — essential patterns
# Built-in variables: # $0 — entire line $1 $2 ... — individual fields # NR — current line number NF — number of fields on this line # FS — field separator OFS — output field separator # Print specific fields (default delimiter: any whitespace) awk '{print $1, $3}' data.txt # Print last field awk '{print $NF}' data.txt # Use a custom field separator awk -F: '{print $1, $3}' /etc/passwd # username and UID awk -F, '{print $2}' data.csv # Print lines where a field matches a pattern awk '/ERROR/ {print NR, $0}' app.log awk '$3 > 100 {print $1, $3}' scores.txt # numeric comparison # Sum a column awk '{sum += $2} END {print "Total:", sum}' sales.txt # Count lines matching a pattern awk '/FAIL/ {count++} END {print count " failures"}' results.txt # BEGIN and END blocks run before/after all input awk 'BEGIN {print "Name", "Score"} {print $1, $2} END {print "Done"}' scores.txt # Reformatting: change delimiter in output awk -F, 'BEGIN {OFS="|"} {print $1, $2, $3}' data.csv # Capture awk output in a variable total=$(awk '{sum += $1} END {print sum}' numbers.txt) echo "Total: $total"

7 — The Supporting Cast: cut, sort, uniq, wc, tr

cut — extract columns

🐧 cut
# Cut by delimiter and field number cut -d: -f1 /etc/passwd # usernames cut -d, -f2,4 data.csv # columns 2 and 4 cut -d, -f2- data.csv # columns 2 to end # Cut by character position cut -c1-8 timestamps.txt # first 8 characters cut -c9- timestamps.txt # from character 9 to end

sort — order lines

🐧 sort
sort names.txt # alphabetical sort -r names.txt # reverse alphabetical sort -n numbers.txt # numeric sort -nr numbers.txt # numeric descending (largest first) sort -u names.txt # sort and remove duplicates sort -t, -k2,2n data.csv # sort CSV by 2nd column numerically sort -k1,1 -k2,2n data.txt # primary sort col 1, secondary col 2 sort -h sizes.txt # human-numeric (10K before 2M)

uniq — remove adjacent duplicate lines

🐧 uniq (input must be sorted first)
sort items.txt | uniq # remove duplicates sort items.txt | uniq -c # prefix each line with its count sort items.txt | uniq -d # print only lines that appeared more than once sort items.txt | uniq -u # print only lines that appeared exactly once # Top 10 most frequent lines sort access.log | uniq -c | sort -rn | head -10

wc — count lines, words, characters

🐧 wc
wc -l file.txt # number of lines wc -w file.txt # number of words wc -c file.txt # number of bytes wc -m file.txt # number of characters (multi-byte aware) # Capture line count cleanly in a variable count=$(wc -l < file.txt) # redirect avoids filename in output echo "Lines: $count"

tr — translate or delete characters

🐧 tr (reads from stdin only)
# Convert to uppercase / lowercase echo "hello world" | tr '[:lower:]' '[:upper:]' HELLO WORLD echo "HELLO" | tr 'A-Z' 'a-z' hello # Delete specific characters echo "h3ll0 w0rld" | tr -d '0-9' hll wrld # Squeeze repeated characters (e.g. collapse multiple spaces) echo "too many spaces" | tr -s ' ' too many spaces # Replace colons with newlines (e.g. expand PATH for readability) echo "$PATH" | tr ':' '\n' # Remove Windows carriage returns from a file tr -d '\r' < windows.txt > unix.txt

8 — Building Pipelines

The real power is combining these tools. Each tool does one job well; the pipe | connects them into a transformation chain.

Pipeline: top 5 IP addresses in an Apache log
awk '{print $1}' access.log \ | sort \ | uniq -c \ | sort -rn \ | head -5 523 192.168.1.105 311 10.0.0.22 198 172.16.0.4 145 192.168.1.200 89 10.0.0.1
Pipeline: extract failed logins from auth.log
grep "Failed password" /var/log/auth.log \ | awk '{print $(NF-3)}' \ | sort | uniq -c | sort -rn \ | head -10
Pipeline: CSV summary — total sales per region
# Input: date,region,amount e.g. 2026-01-05,North,1500 tail -n +2 sales.csv \ # skip header | awk -F, '{region[$2] += $3} END {for (r in region) printf "%-10s £%d\n", r, region[r]}' \ | sort -k2,2rn South £48200 North £35700 East £29100

tee — split a pipeline to a file and stdout

🐧 tee
# Log pipeline output to a file while still printing to screen some_command | tee output.log | grep "ERROR" # Append with -a some_command | tee -a logfile.log >/dev/null # log only, suppress screen

9 — Process Substitution

Process substitution — <(command) — lets you feed the output of a command to another command that expects a filename. It's the clean way to use diff, while read, and other tools with live command output.

🐧 Process substitution patterns
# diff two commands' output without temp files diff <(sort file1.txt) <(sort file2.txt) # Read the output of a command safely in a while loop # (a plain pipe would run the loop body in a subshell) while IFS= read -r line; do echo "$line" done <(grep "ERROR" app.log) # Compare sorted lists from two directories diff <(ls dir1/) <(ls dir2/) # Write to a process (less common) tee >(gzip > backup.gz) > plain_copy.txt < source.txt
The key advantage over a pipe in a while loop: variables set inside the loop body remain visible after the loop ends, because <() runs the command in a separate process but keeps the while loop in the current shell.

10 — Quick Reference

Tool / PatternWhat it doesKey flags
while IFS= read -r line; do ... done < fileSafe line-by-line file reading-r no backslash processing
mapfile -t arr < fileRead all lines into an array-t strips trailing newline
content=$(<file)Slurp whole file into variable
find dir -name "*.ext" -type fLocate files recursively-mtime -7, -size +100M, -exec, -print0
grep -E "pattern" filePrint matching lines-i case-insensitive, -v invert, -q silent, -c count, -n line numbers
sed 's/old/new/g' fileStream substitution-i in-place, -n suppress output, /d delete lines
awk -F: '{print $1}' fileField extraction / processingNR line no., NF field count, BEGIN/END
cut -d, -f2 fileExtract columns by delimiter-c character positions
sort -n -k2 fileSort lines-r reverse, -u unique, -h human sizes
uniq -cRemove adjacent duplicates / count-d duplicates only, -u unique only
wc -l < fileCount lines (words, bytes)-w words, -c bytes, -m chars
tr 'a-z' 'A-Z'Translate characters-d delete, -s squeeze repeats
tee fileCopy stdin to file and stdout-a append
diff <(cmd1) <(cmd2)Compare command output with process substitution

✏️ Exercises

Apply what you have learned. Try writing the script yourself before looking at the sample solution.

Exercise 1
Write a script called log_report.sh that accepts a log file as its first argument and prints: (a) total number of lines, (b) number of lines containing ERROR, (c) number of lines containing WARN, and (d) the 5 most frequent words in ERROR lines, with their counts.
Hint: use wc -l < file for counts, grep -c for pattern counts, and grep "ERROR" | tr -s ' ' '\n' | sort | uniq -c | sort -rn | head -5 for frequent words.
Sample Solution
#!/bin/bash # log_report.sh — usage: ./log_report.sh app.log logfile="${1:?Usage: $0 <logfile>}" [[ -f "$logfile" ]] || { echo "File not found: $logfile" >&2; exit 1; } total=$(wc -l < "$logfile") errors=$(grep -c "ERROR" "$logfile" || echo 0) warns=$(grep -c "WARN" "$logfile" || echo 0) printf "Log file : %s\n" "$logfile" printf "Total : %d lines\n" "$total" printf "ERROR : %d lines\n" "$errors" printf "WARN : %d lines\n" "$warns" echo echo "Top 5 words in ERROR lines:" grep "ERROR" "$logfile" \ | tr -s '[:space:]' '\n' \ | tr '[:upper:]' '[:lower:]' \ | grep -v '^$' \ | sort | uniq -c | sort -rn | head -5 \ | awk '{printf " %4d %s\n", $1, $2}'
Exercise 2
Write a script called csv_filter.sh that reads a CSV file (with a header row), takes a column number and a search term as arguments, and prints all rows where that column matches the term. Also print the header. Example: ./csv_filter.sh sales.csv 2 North prints all rows where column 2 is "North".
Hint: use head -1 to print the header, then tail -n +2 to skip it and pipe to awk -F, with a condition on $colnum.
Sample Solution
#!/bin/bash # csv_filter.sh — usage: ./csv_filter.sh file.csv COLUMN TERM file="${1:?Usage: $0 <file.csv> <column> <term>}" col="${2:?column number required}" term="${3:?search term required}" [[ -f "$file" ]] || { echo "File not found: $file" >&2; exit 1; } # Print header head -1 "$file" # Filter rows tail -n +2 "$file" | awk -F, -v c="$col" -v t="$term" '$c == t'
Exercise 3
Write a script called find_large.sh that accepts a directory and a size threshold (in MB) as arguments, finds all files larger than that threshold, and outputs a formatted table showing the filename (base name only) and size in MB, sorted largest first. At the end, print the total size of all matched files.
Hint: use find dir -type f -size +NNMb (or use +NNM). Pipe to du -m or use stat to get sizes. Collect results into an array, sort with sort -rn, and sum with awk.
Sample Solution
#!/bin/bash # find_large.sh — usage: ./find_large.sh /path/to/dir SIZE_MB dir="${1:?Usage: $0 <directory> <size_MB>}" threshold="${2:?size threshold in MB required}" [[ -d "$dir" ]] || { echo "Not a directory: $dir" >&2; exit 1; } printf "\nFiles larger than %dMB in %s:\n\n" "$threshold" "$dir" printf "%-40s %8s\n" "Filename" "Size(MB)" printf '%.0s─' {1..50}; echo total=0 found=0 while IFS= read -r -d '' filepath; do size_bytes=$(stat --format='%s' "$filepath" 2>/dev/null) [[ -z "$size_bytes" ]] && continue size_mb=$(echo "scale=1; $size_bytes / 1048576" | bc) total=$(echo "$total + $size_mb" | bc) (( found++ )) printf "%-40s %8.1f\n" "$(basename "$filepath")" "$size_mb" done <(find "$dir" -type f -size +"${threshold}"M -print0 \ | xargs -0 -I{} stat --format='%s %n' {} 2>/dev/null \ | sort -rn \ | awk '{print $2}' \ | tr '\n' '\0') printf '%.0s─' {1..50}; echo printf "%-40s %8.1f MB (%d files)\n" "TOTAL" "$total" "$found"
Exercise 4
Write a script called replace_in_files.sh that accepts three arguments: a directory, a search string, and a replacement string. It should find all .txt files in that directory tree containing the search string, report how many files were found, show a preview of the first match in each file, and then (after confirmation) perform the replacement in all files using sed -i. Make a .bak backup of each file before modifying it.
Hint: use grep -rl to find files containing the pattern. Use grep -m1 for a single-line preview. Use read -r -p "Proceed? [y/N]" for confirmation. Use sed -i.bak for atomic in-place replacement with backup.
Sample Solution
#!/bin/bash # replace_in_files.sh — usage: ./replace_in_files.sh DIR SEARCH REPLACE dir="${1:?Usage: $0 <dir> <search> <replace>}" search="${2:?search string required}" replace="${3?replacement string required}" # note: allows empty string [[ -d "$dir" ]] || { echo "Not a directory: $dir" >&2; exit 1; } # Find matching files matches=() while IFS= read -r -d '' f; do matches+=( "$f" ) done <(grep -rl --include='*.txt' -Z "$search" "$dir") if [[ "${#matches[@]}" -eq 0 ]]; then echo "No files found containing: $search" exit 0 fi printf "Found %d file(s) containing '%s':\n\n" "${#matches[@]}" "$search" for f in "${matches[@]}"; do preview=$(grep -m1 -n "$search" "$f") printf " %s\n ↳ %s\n" "$f" "$preview" done echo read -r -p "Replace '$search' → '$replace' in all files? [y/N] " confirm [[ "$confirm" != [yY] ]] && { echo "Aborted."; exit 0; } for f in "${matches[@]}"; do sed -i.bak "s|${search}|${replace}|g" "$f" printf " ✓ Updated: %s (backup: %s.bak)\n" "$f" "$f" done echo "Done."