Structured Data: CSV, TSV, and INI

📊 Intermediate Topic 6 — Structured Data: CSV, TSV, and INI

Shell scripts constantly consume and produce structured text: exports from databases, configuration files, reports. The tools are right there — awk, IFS splitting, pattern matching — but the edge cases bite hard. A CSV file with quoted fields containing commas will silently corrupt your data if you split naively. An INI file with inline comments or trailing spaces will hand you the wrong value. This chapter covers the full picture: how the formats actually work, where the traps are, and the robust patterns used in production scripts.

1 — CSV Fundamentals: Why It's Harder Than It Looks

CSV sounds simple — values separated by commas. The problem is that real CSV (RFC 4180) allows fields to contain commas, newlines, and double-quote characters, all of which break naive parsing. A field that contains a comma must be wrapped in double-quotes. A double-quote inside a quoted field must be escaped as two consecutive double-quotes ("").

❌ Naive — breaks on quoted fields
# Input: name,city,note # Alice,"New York",no comma # Bob,"Paris, France",has comma while IFS=',' read -r name city note; do echo "city=$city" done < data.csv # Bob's city = '"Paris' ← WRONG # note = ' France"' ← WRONG
✓ Real-world CSV has quoting rules
# RFC 4180 rules: # 1. Fields MAY be quoted with " # 2. A quoted field may contain , \n " # 3. A " inside quotes = "" # 4. Leading/trailing spaces inside # quotes are part of the value # # "Alice","New York","says ""hi""" # → Alice | New York | says "hi" # # Need a state-machine parser. # awk's FPAT is the cleanest option.
The IFS comma split is only safe when you control the data. If your CSV comes from a spreadsheet export, a database dump, or any system that follows RFC 4180, you must handle quoted fields. Use the awk FPAT approach or pipe through a purpose-built tool like csvkit (csvcut, csvstat). For data you generate yourself — logs, reports — you can choose to avoid values that need quoting, or use TSV instead.

2 — Robust CSV Parsing with awk and FPAT

GNU awk (gawk) provides a special variable FPAT — Field PATtern — which instead of defining the separator defines what a field looks like. Setting it to a pattern that matches either a quoted field or an unquoted field gives you a proper CSV parser in one line.

🐧 FPAT-based CSV parsing in gawk
# FPAT pattern breakdown: # ([^,]*) — any unquoted field (zero or more non-comma chars) # | — OR # ("([^"]|"")*") — a double-quoted field (handles "" escape inside) FPAT_CSV='([^,]*)|("([^"]|"")*")' # ── Print every field on a separate line ────────────────────── gawk -v FPAT='([^,]*)|("([^"]|"")*")' ' { for (i=1; i<=NF; i++) print i, $i } ' data.csv # ── Strip surrounding quotes from a field value ─────────────── # Fields from FPAT still include the outer quotes — strip them: csv_unquote() { # Strips surrounding quotes and unescapes "" → " awk 'BEGIN { s = ARGV[1] if (substr(s,1,1) == "\"") s = substr(s, 2, length(s)-2) gsub(/""/, "\"", s) print s }' "$1" } # ── Practical: process a CSV with header row ────────────────── # File: employees.csv # id,name,department,salary # 1,Alice,"Engineering",95000 # 2,Bob,"Sales, East",72000 # 3,"O'Brien, Carol","HR",68000 gawk -v FPAT='([^,]*)|("([^"]|"")*")' ' NR == 1 { next } # skip header { id = $1 name = $2; gsub(/^"|"$/, "", name) # strip quotes dept = $3; gsub(/^"|"$/, "", dept) sal = $4 printf "%-4s %-20s %-20s %s\n", id, name, dept, sal } ' employees.csv # ── Filter: only Engineering employees ──────────────────────── gawk -v FPAT='([^,]*)|("([^"]|"")*")' ' NR == 1 { print; next } { dept = $3; gsub(/^"|"$/, "", dept) if (dept == "Engineering") print } ' employees.csv # ── Extract a single column by header name ──────────────────── gawk -v FPAT='([^,]*)|("([^"]|"")*")' \ -v col="salary" ' NR == 1 { for (i=1; i<=NF; i++) { h = $i; gsub(/^"|"$/, "", h) if (h == col) colidx = i } next } colidx { v = $colidx; gsub(/^"|"$/, "", v); print v } ' employees.csv
FPAT is a GNU awk extension — not available in POSIX awk or mawk. Check with gawk --version. On systems where only awk is available, fall back to a Python one-liner or csvkit.

Building a reusable csv_parse() library function

🐧 A shell function that turns a CSV row into a bash array
# csv_split ROW → prints fields one per line, with quotes stripped # Pipe its output to mapfile to get an array csv_split() { gawk 'BEGIN { FPAT = "([^,]*)|(\\"([^\\"]|\\"\\")* \\")" $0 = ARGV[1] for (i=1; i<=NF; i++) { v = $i if (substr(v,1,1) == "\"") v = substr(v, 2, length(v)-2) gsub(/""/, "\"", v) print v } }' "$1" } # Usage: turn a CSV row into a bash array row='42,"Smith, John","New York","says ""hello"""' mapfile -t fields < <(csv_split "$row") echo "ID: ${fields[0]}" → 42 echo "Name: ${fields[1]}" → Smith, John echo "City: ${fields[2]}" → New York echo "Note: ${fields[3]}" → says "hello" # Process a whole file row by row process_csv() { local file="$1" local header_done=0 local row while IFS= read -r row; do (( header_done == 0 )) && { header_done=1; continue; } mapfile -t f < <(csv_split "$row") # Now f[0], f[1], f[2]... hold the clean field values printf 'id=%s name=%s\n' "${f[0]}" "${f[1]}" done < "$file" }

3 — Generating CSV Output

Writing CSV is easier than reading it — you control the quoting. The safest rule: always quote fields that may contain commas, double-quotes, or newlines; double up any literal double-quotes inside them. For simple data you know is clean, quoting only when necessary is fine.

🐧 Writing safe CSV from shell scripts
# ── csv_field: quote a single value for CSV output ──────────── csv_field() { local val="$1" # Always quote — simplest and safest val="${val//\"/\"\"}" # escape " → "" printf '"%s"' "$val" } # ── csv_row: turn args into a CSV line ──────────────────────── csv_row() { local first=1 for arg in "$@"; do (( first )) || printf ',' csv_field "$arg" first=0 done printf '\n' } # ── Generate a CSV report ───────────────────────────────────── { csv_row "hostname" "cpu_pct" "mem_pct" "disk_pct" "timestamp" while IFS= read -r host; do # Collect metrics for $host (simplified) cpu=$(ssh "$host" "top -bn1 | awk '/Cpu/ {print 100-\$8}'" 2/dev/null) ts=$(date '+%Y-%m-%d %H:%M:%S') csv_row "$host" "${cpu:-N/A}" "N/A" "N/A" "$ts" done < hosts.txt } > report.csv # ── Append to an existing CSV safely ────────────────────────── csv_append() { local file="$1"; shift if [[ ! -f "$file" ]]; then # Create with header on first call if header provided as first arg touch "$file" fi csv_row "$@" >> "$file" } # ── Column extraction without gawk (safe only for simple CSV) ─ # cut -d, -f2 — works only when no quoted commas exist cut -d',' -f2 employees.csv # With gawk FPAT — handles quoted commas gawk -v FPAT='([^,]*)|("([^"]|"")*")' '{v=$2; gsub(/^"|"$/,"",v); print v}' employees.csv

4 — TSV: Tab-Separated Values

TSV uses a tab character as the delimiter. Tabs almost never appear in real data, which means quoting rules are usually unnecessary and parsing with IFS=$'\t' is safe. Many tools (sort, join, awk) integrate cleanly with tab-delimited data. When you control the format, TSV is often the better choice over CSV for shell processing.

🐧 TSV reading, writing, and conversion
# ── Reading TSV ─────────────────────────────────────────────── while IFS=$'\t' read -r id name dept salary; do printf '%-4s %-20s %-15s %s\n' "$id" "$name" "$dept" "$salary" done < employees.tsv # ── Writing TSV ─────────────────────────────────────────────── tsv_row() { local IFS=$'\t' printf '%s\n' "$*" } tsv_row "id" "name" "department" "salary" tsv_row "1" "Alice" "Engineering" "95000" # ── Converting CSV to TSV ───────────────────────────────────── # Simple (only safe when no embedded commas/newlines in fields) tr ',' $'\t' < simple.csv > simple.tsv # Robust (uses gawk FPAT to parse, then re-joins with tab) gawk -v FPAT='([^,]*)|("([^"]|"")*")' ' { for (i=1; i<=NF; i++) { v = $i gsub(/^"|"$/, "", v) # strip surrounding quotes gsub(/""/, "\"", v) # unescape doubled quotes printf "%s%s", (i>1?"\t":""), v } print "" } ' input.csv > output.tsv # ── Sort TSV by column ──────────────────────────────────────── sort -t$'\t' -k3,3 employees.tsv # sort by col 3 (alphabetic) sort -t$'\t' -k4,4n -r employees.tsv # sort by col 4 (numeric, desc) # ── join two TSV files on a common key ──────────────────────── # employees.tsv: id name dept_id # departments.tsv: dept_id dept_name budget # Both must be sorted on the join field first join -t$'\t' -1 3 -2 1 \ <(sort -t$'\t' -k3,3 employees.tsv) \ <(sort -t$'\t' -k1,1 departments.tsv)

5 — INI and Config File Parsing

INI files are ubiquitous: /etc/mysql/my.cnf, ~/.gitconfig, php.ini, smb.conf, countless application configs. The format has no strict standard but the conventions are consistent: [section] headers, key=value or key = value pairs, and # or ; line comments. The traps are whitespace around the =, inline comments, multi-line values, and section awareness.

🐧 Reading INI files with awk
# Sample config.ini: # [database] # host = db.example.com # port = 5432 # name = myapp_prod # production DB # # [cache] # host = redis.example.com # port = 6379 # ── Get a single value: ini_get FILE SECTION KEY ────────────── ini_get() { local file="$1" section="$2" key="$3" awk -v s="$section" -v k="$key" ' /^\[/ { gsub(/[\[\]]/, "") cur = $0 next } cur == s && /=/ { # Split on first = only eq = index($0, "=") key_part = substr($0, 1, eq-1) val_part = substr($0, eq+1) # Trim whitespace gsub(/^[[:space:]]+|[[:space:]]+$/, "", key_part) gsub(/^[[:space:]]+/, "", val_part) # Strip inline comment (# or ;) sub(/[[:space:]]+[#;].*$/, "", val_part) # Trim trailing whitespace gsub(/[[:space:]]+$/, "", val_part) if (key_part == k) { print val_part; exit } } ' "$file" } db_host=$(ini_get config.ini database host) db_port=$(ini_get config.ini database port) db_name=$(ini_get config.ini database name) echo "Connecting to $db_host:$db_port/$db_name" # ── Load an entire section into bash variables ───────────────── ini_load_section() { local file="$1" section="$2" prefix="${3:-}" local k v while IFS='=' read -r k v; do # Strip whitespace and inline comments k="${k// /}" v="${v#"${v%%[![:space:]]*}"}" # ltrim v="${v%%[[:space:]]*#*}" # strip inline comment v="${v%"${v##*[![:space:]]}"}" # rtrim [[ -z "$k" ]] && continue declare -g "${prefix}${k}=${v}" done < <(awk -v s="$section" ' /^\[/ { gsub(/[\[\]]/,""); cur=$0; next } cur==s && /=/ { print } cur==s && /^\s*($|[#;])/ { next } ' "$file") } # Loads database.host, database.port, etc. as shell vars with prefix ini_load_section config.ini database db_ echo "host=$db_host port=$db_port"

Writing and updating INI values

🐧 Updating a key in an INI file in-place
# ini_set FILE SECTION KEY VALUE # Updates the key if it exists; inserts it if it doesn't ini_set() { local file="$1" section="$2" key="$3" value="$4" if grep -q "^\[${section}\]" "$file"; then # Section exists — try to update existing key if awk -v s="$section" -v k="$key" ' /^\[/ { cur=substr($0,2,length($0)-2) } cur==s && $0 ~ "^[[:space:]]*"k"[[:space:]]*=" { found=1 } END { exit !found } ' "$file"; then # Key exists — update it with sed sed -i "/^\[${section}\]/,/^\[/" \ -e "/^\[${section}\]/,/^\[/ s|^\([[:space:]]*${key}[[:space:]]*=\).*|\1 ${value}|" \ "$file" else # Key missing — append to section sed -i "/^\[${section}\]/a ${key} = ${value}" "$file" fi else # Section missing — append at end of file printf '\n[%s]\n%s = %s\n' "$section" "$key" "$value" >> "$file" fi } ini_set config.ini database port 5433 ini_set config.ini database timeout 30 # new key ini_set config.ini newapp debug true # new section + key # ── List all sections in a file ─────────────────────────────── ini_sections() { grep -oP '(?<=^\[)[^\]]*' "$1" } # ── List all keys in a section ──────────────────────────────── ini_keys() { awk -v s="$2" ' /^\[/ { gsub(/[\[\]]/,""); cur=$0; next } cur==s && /=/ { eq=index($0,"="); k=substr($0,1,eq-1) gsub(/^[[:space:]]+|[[:space:]]+$/,"",k) print k } ' "$1" } ini_sections config.ini → database cache ini_keys config.ini database → host port name

6 — Key=Value and .env Files

The simpler cousin of INI: a flat file of KEY=VALUE pairs with no sections. Docker Compose, systemd unit EnvironmentFile=, and countless deployment tools use this format. Bash can source it directly — but only after sanitising it, because a malicious or malformed .env file can execute arbitrary code on source.

🐧 Safely loading .env files
# ── UNSAFE: direct source runs any code in the file ─────────── # source .env ← NEVER do this with untrusted files # . .env ← same risk # ── Safe: parse and export only valid KEY=VALUE lines ───────── load_env() { local envfile="${1:-.env}" [[ -f "$envfile" ]] || { echo "$envfile not found" >&2; return 1; } local key val line while IFS= read -r line; do # Skip blanks and comments [[ "$line" =~ ^[[:space:]]*($|#) ]] && continue # Must match KEY=VALUE with valid identifier key [[ "$line" =~ ^([A-Za-z_][A-Za-z0-9_]*)=(.*)$ ]] || continue key="${BASH_REMATCH[1]}" val="${BASH_REMATCH[2]}" # Strip optional surrounding quotes [[ "$val" =~ ^\"(.*)\"$ ]] && val="${BASH_REMATCH[1]}" [[ "$val" =~ ^\'(.*)\'$ ]] && val="${BASH_REMATCH[1]}" # Strip inline comment (only outside quotes) val="${val%%[[:space:]]*\#*}" export "${key}=${val}" done < "$envfile" } load_env .env.production echo "DB_HOST=$DB_HOST" # ── source is fine when you wrote the file yourself ─────────── # In CI or deploy scripts, sourcing a file you generated is OK # Just never source files from user input or external sources # ── Generate a .env file ────────────────────────────────────── write_env() { local file="$1"; shift # Args: KEY VALUE KEY VALUE ... { printf '# Generated %s\n' "$(date -Is)" while (( $# >= 2 )); do k="$1" v="$2"; shift 2 # Quote if value contains spaces, #, or special chars [[ "$v" =~ [[:space:]#\"] ]] && v='"'"${v//\"/\\\"}"'"' printf '%s=%s\n' "$k" "$v" done } > "$file" chmod 600 "$file" # env files often contain secrets } write_env .env.staging \ DB_HOST staging-db.internal \ DB_PORT 5432 \ APP_SECRET "my secret value" \ DEBUG false

7 — Generating Structured Reports

A common pattern in ops scripts: collect data from multiple sources, assemble it into a structured table, and emit it as either a human-readable aligned table or a machine-readable CSV. The key is separating data collection from formatting.

🐧 Aligned console table + CSV output from the same data
#!/usr/bin/env bash # disk_report.sh — show disk usage for a list of paths set -euo pipefail FORMAT="${FORMAT:-table}" # table | csv | tsv PATHS=( "${@:-/}" ) # ── Column definitions: name, width, awk-expression ─────────── declare -A COL_W=( [mount]=20 [total]=8 [used]=8 [avail]=8 [pct]=6 ) # ── Collect data from df ─────────────────────────────────────── # df -h outputs: Filesystem Size Used Avail Use% Mounted-on collect_data() { df -h "${PATHS[@]}" 2>/dev/null | tail -n +2 | \ awk '{print $6, $2, $3, $4, $5}' # mount total used avail pct } # ── Render as aligned table ──────────────────────────────────── render_table() { local w_m=${COL_W[mount]} w_t=${COL_W[total]} local w_u=${COL_W[used]} w_a=${COL_W[avail]} w_p=${COL_W[pct]} local sep; sep=$(printf '%*s' $(( w_m + w_t + w_u + w_a + w_p + 5 )) '' | tr ' ' '─') printf "\033[1m%-${w_m}s %${w_t}s %${w_u}s %${w_a}s %${w_p}s\033[0m\n" \ MOUNT TOTAL USED AVAIL PCT printf '%s\n' "$sep" while read -r mount total used avail pct; do # Colour-code by usage percentage pct_n=${pct//%/} local colour='\033[0m' (( pct_n >= 90 )) && colour='\033[31m' (( pct_n >= 75 && pct_n < 90 )) && colour='\033[33m' printf "%-${w_m}s %${w_t}s %${w_u}s %${w_a}s ${colour}%${w_p}s\033[0m\n" \ "$mount" "$total" "$used" "$avail" "$pct" done < <(collect_data) } # ── Render as CSV ───────────────────────────────────────────── render_csv() { csv_row mount total used avail pct while read -r mount total used avail pct; do csv_row "$mount" "$total" "$used" "$avail" "$pct" done < <(collect_data) } # ── Dispatch ────────────────────────────────────────────────── case "$FORMAT" in table) render_table ;; csv) render_csv ;; tsv) render_csv | tr ',' $'\t' ;; # quick for clean data *) echo "Unknown FORMAT: $FORMAT" >&2; exit 1 ;; esac # Usage: # ./disk_report.sh /var /home /tmp # FORMAT=csv ./disk_report.sh / /var

8 — Quick Reference

TaskTool / PatternNotes
Parse CSV with quoted fieldsgawk FPAT='([^,]*)|("([^"]|"")*")'GNU awk only
Simple CSV split (no quotes)IFS=',' read -r f1 f2 f3Only safe when no quoted commas
Extract CSV column by namegawk -v FPAT=... -v col="name" 'NR==1{...} {print $colidx}'
Sort TSV by column Nsort -t$'\t' -kN,NAdd n flag for numeric sort
Join two TSV filesjoin -t$'\t' -1 N -2 M file1 file2Files must be pre-sorted
Read INI valueini_get FILE SECTION KEYawk-based function
Update INI valueini_set FILE SECTION KEY VALUEUses sed in-place
Load .env safelyRegex-validate keys, then export "$key=$val"Never source untrusted files
Quote a CSV fieldval="${val//\"/\"\"}"; printf '"%s"' "$val"Always safe to over-quote
CSV → TSV (no embedded commas)tr ',' $'\t'
CSV → TSV (with quoted commas)gawk FPAT=... + gsub + printf with \t
When to use a real tool instead: For anything beyond simple filtering and extraction, reach for csvkit (pip install csvkit) — it provides csvcut, csvjoin, csvstat, csvsql, and handles encoding, BOM markers, and edge cases you haven't thought of yet. For Python users, pandas and its read_csv() are the standard. Shell CSV parsing is the right tool for quick one-offs and environments where you can't install extra packages.

✏️ Exercises

Each exercise builds a utility you'd actually reach for. Test with edge cases: values containing commas, empty fields, sections with missing keys, files with Windows-style CRLF line endings.

Exercise 1
Write a function called csv_summary() that accepts a CSV file path and prints a summary of its structure: number of rows (excluding header), number of columns, each column name with its index, and for numeric columns the min, max, and average. Use gawk with FPAT so it handles quoted fields correctly.
Hint: in the NR==1 block, capture column names into an array. For data rows, try to detect numeric columns by checking if the value matches /^-?[0-9.]+$/. Track running sum, min, and max per column in arrays. Print the summary in the END block.
Sample Solution
csv_summary() { local file="${1:?csv_summary: file required}" [[ -f "$file" ]] || { echo "Not found: $file" >&2; return 1; } gawk ' BEGIN { FPAT = "([^,]*)(\"([^\"]|\"\")*\")" } function strip(s) { gsub(/^[[:space:]"]+|[[:space:]"]+$/, "", s) return s } NR == 1 { ncols = NF for (i=1; i<=NF; i++) hdr[i] = strip($i) next } { rows++ for (i=1; i<=NF; i++) { v = strip($i) if (v ~ /^-?[0-9]+(\.[0-9]+)?$/) { is_num[i] = 1 sum[i] += v if (!(i in mn) || v+0 < mn[i]) mn[i] = v+0 if (!(i in mx) || v+0 > mx[i]) mx[i] = v+0 } } } END { print "File: " FILENAME print "Rows: " rows print "Columns: " ncols print "" printf "%-4s %-20s %-10s %s\n", "IDX", "COLUMN", "TYPE", "STATS" printf "%s\n", "─────────────────────────────────────────────────────" for (i=1; i<=ncols; i++) { if (is_num[i]) { avg = (rows > 0) ? sum[i] / rows : 0 printf "%-4d %-20s %-10s min=%-8g max=%-8g avg=%.2f\n", i, hdr[i], "numeric", mn[i], mx[i], avg } else { printf "%-4d %-20s %-10s\n", i, hdr[i], "text" } } } ' "$file" } # Test: printf 'id,name,score,city\n1,Alice,92,"New York"\n2,Bob,78,"Paris"\n3,Carol,85,London\n' > /tmp/test.csv csv_summary /tmp/test.csv
Exercise 2
Write a script called csv_filter.sh that filters rows from a CSV file based on a column name and a value. Usage: csv_filter.sh FILE COLUMN VALUE. It should print the header row, then all rows where the named column exactly matches VALUE (case-insensitive). Handle quoted fields correctly and strip surrounding quotes when comparing. Support an optional --contains flag that does a substring match instead of an exact match.
Hint: in NR==1 find the column index by name. In data rows, strip quotes from the target field before comparing. Store the --contains flag in a variable passed via -v. Use tolower() in awk for case-insensitive comparison.
Sample Solution
#!/usr/bin/env bash # csv_filter.sh [--contains] FILE COLUMN VALUE set -euo pipefail MATCH_TYPE="exact" if [[ "${1:-}" == "--contains" ]]; then MATCH_TYPE="contains" shift fi FILE="${1:?Usage: csv_filter.sh [--contains] FILE COLUMN VALUE}" COL="${2:?column name required}" VAL="${3:?value required}" [[ -f "$FILE" ]] || { echo "Error: $FILE not found" >&2; exit 1; } gawk -v col="$COL" -v val="$VAL" -v mtype="$MATCH_TYPE" ' BEGIN { FPAT = "([^,]*)(\"([^\"]|\"\")*\")" colidx = 0 val_l = tolower(val) } function strip(s) { gsub(/^[[:space:]"]+|[[:space:]"]+$/, "", s) gsub(/""/, "\"", s) return s } NR == 1 { for (i=1; i<=NF; i++) { h = strip($i) if (tolower(h) == tolower(col)) { colidx = i } } if (!colidx) { print "Error: column \"" col "\" not found" > "/dev/stderr" exit 1 } print # always print header next } { if (!colidx) next field = strip($colidx) field_l = tolower(field) if (mtype == "exact" && field_l == val_l) print else if (mtype == "contains" && index(field_l, val_l)) print } ' "$FILE"
Exercise 3
Write a complete config library in a file called lib_config.sh designed to be sourced by other scripts. It should provide: config_load FILE (loads all sections into CONF[section.key] associative array), config_get SECTION KEY [DEFAULT] (retrieves a value, returns DEFAULT if missing), config_require SECTION KEY (dies with an error if the key is absent), and config_dump (prints all loaded values sorted). Include a self-test that runs when the file is executed directly.
Hint: use declare -gA CONF to create a global associative array. The key format is "${section}.${key}". In config_get, use "${CONF[$section.$key]:-$default}". The ${BASH_SOURCE[0]} == $0 trick from Topic 4 triggers the self-test.
Sample Solution
#!/usr/bin/env bash # lib_config.sh — INI config library # Source this file; if executed directly it runs self-tests declare -gA CONF=() _CONF_LOADED="" # config_load FILE — parse INI into CONF[section.key] config_load() { local file="${1:?config_load: file required}" [[ -f "$file" ]] || { printf 'config_load: %s not found\n' "$file" >&2; return 1; } CONF=() # reset _CONF_LOADED="$file" local cur_section="" local line k v while IFS= read -r line; do # Remove CR (handle Windows CRLF) line="${line//$'\r'/}" # Skip blank lines and comments [[ "$line" =~ ^[[:space:]]*($|[#;]) ]] && continue # Section header if [[ "$line" =~ ^\[([A-Za-z0-9_.-]+)\] ]]; then cur_section="${BASH_REMATCH[1]}" continue fi # key = value (only inside a section) [[ -z "$cur_section" ]] && continue [[ "$line" =~ ^[[:space:]]*([A-Za-z0-9_./-]+)[[:space:]]*=[[:space:]]*(.*) ]] || continue k="${BASH_REMATCH[1]}" v="${BASH_REMATCH[2]}" # Strip inline comment v="${v%%[[:space:]][#;]*}" # Strip trailing whitespace v="${v%"${v##*[![:space:]]}"}" # Strip surrounding quotes [[ "$v" =~ ^\"(.*)\"$ ]] && v="${BASH_REMATCH[1]}" [[ "$v" =~ ^\'(.*)\'$ ]] && v="${BASH_REMATCH[1]}" CONF["${cur_section}.${k}"]="$v" done < "$file" } # config_get SECTION KEY [DEFAULT] config_get() { local sec="$1" key="$2" default="${3:-}" local ckey="${sec}.${key}" if [[ -v "CONF[$ckey]" ]]; then printf '%s' "${CONF[$ckey]}" else printf '%s' "$default" fi } # config_require SECTION KEY — exits with error if missing config_require() { local sec="$1" key="$2" [[ -v "CONF[${sec}.${key}]" ]] || { printf 'config_require: [%s] %s is required but not set\n' "$sec" "$key" >&2 return 1 } printf '%s' "${CONF[${sec}.${key}]}" } # config_dump — print all loaded values config_dump() { printf '# Config loaded from: %s\n' "$_CONF_LOADED" local k for k in $(printf '%s\n' "${!CONF[@]}" | sort); do printf '%-40s = %s\n' "$k" "${CONF[$k]}" done } # ── Self-test (runs only when executed, not sourced) ─────────── if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then tmpf=$(mktemp /tmp/lib_config_test.XXXXXX.ini) trap "rm -f '$tmpf'" EXIT cat > "$tmpf" <<'EOF' [database] host = db.example.com # primary DB port = 5432 name = "myapp prod" [cache] host = redis.example.com port = 6379 EOF pass=0; fail=0 assert_eq() { if [[ "$1" == "$2" ]]; then printf '\033[32m✓\033[0m %s\n' "$3"; (( pass++ )) else printf '\033[31m✗\033[0m %s (got "%s", want "%s")\n' "$3" "$1" "$2"; (( fail++ )) fi } config_load "$tmpf" assert_eq "$(config_get database host)" "db.example.com" "host without inline comment" assert_eq "$(config_get database port)" "5432" "port as string" assert_eq "$(config_get database name)" "myapp prod" "quoted value" assert_eq "$(config_get cache host)" "redis.example.com" "cache host" assert_eq "$(config_get database missing default_val)" \ "default_val" "missing key returns default" assert_eq "$(config_get nosec nokey fallback)" \ "fallback" "missing section returns default" echo; config_dump printf '\nResults: %d passed, %d failed\n' "$pass" "$fail" (( fail == 0 )) fi
Exercise 4
Write a script called csv_join.sh that performs a left join on two CSV files, matching rows by a common key column. Usage: csv_join.sh FILE1 COL1 FILE2 COL2. The output should have all columns from FILE1, followed by all non-key columns from FILE2. Rows in FILE1 with no match in FILE2 should still appear, with empty strings for the FILE2 columns. Handle quoted fields throughout.
Hint: the cleanest approach is a two-pass awk: first pass reads FILE2 into an associative array keyed by the join field; second pass processes FILE1 and looks up each row's key. Pass both FPAT and the column indices via -v. To find column indices, process NR==1 of each file separately using BEGINFILE or by running two awk programs.
Sample Solution
#!/usr/bin/env bash # csv_join.sh FILE1 COL1 FILE2 COL2 # Left join: all rows from FILE1, matched with FILE2 on key columns set -euo pipefail F1="${1:?Usage: csv_join.sh FILE1 COL1 FILE2 COL2}" C1="${2:?col1 required}" F2="${3:?file2 required}" C2="${4:?col2 required}" for f in "$F1" "$F2"; do [[ -f "$f" ]] || { printf 'Not found: %s\n' "$f" >&2; exit 1; } done gawk \ -v f1="$F1" -v c1="$C1" \ -v f2="$F2" -v c2="$C2" \ ' BEGIN { FPAT = "([^,]*)(\"([^\"]|\"\")*\")" FS = "," } function strip(s, v) { v = s gsub(/^[[:space:]"]+|[[:space:]"]+$/, "", v) gsub(/""/, "\"", v) return v } function csv_field(s, v) { # Always quote output for safety v = s; gsub(/"/, "\"\"", v) return "\"" v "\"" } # ── Pass 1: load FILE2 into memory ───────────────────────── BEGINFILE { if (FILENAME == f2) in_f2 = 1 else in_f2 = 0 } in_f2 && FNR == 1 { for (i=1; i<=NF; i++) { h = strip($i) if (tolower(h) == tolower(c2)) { f2_key_idx = i } f2_hdr[i] = h } f2_ncols = NF next } in_f2 { key = strip($f2_key_idx) row = "" for (i=1; i<=NF; i++) { if (i == f2_key_idx) continue # exclude key col from f2 output row = row (row ? SUBSEP : "") strip($i) } f2_data[key] = row next } # ── Pass 2: process FILE1, join with FILE2 data ───────────── !in_f2 && FNR == 1 { for (i=1; i<=NF; i++) { h = strip($i) if (tolower(h) == tolower(c1)) f1_key_idx = i f1_hdr[i] = h } f1_ncols = NF # Print merged header out = "" for (i=1; i<=f1_ncols; i++) out = out (out?",":"") csv_field(f1_hdr[i]) for (i=1; i<=f2_ncols; i++) { if (i == f2_key_idx) continue out = out "," csv_field(f2_hdr[i]) } print out next } !in_f2 { key = strip($f1_key_idx) out = "" for (i=1; i<=f1_ncols; i++) out = out (out?",":"") csv_field(strip($i)) if (key in f2_data) { n = split(f2_data[key], parts, SUBSEP) for (i=1; i<=n; i++) out = out "," csv_field(parts[i]) } else { # No match: fill with empty fields empties = f2_ncols - 1 # -1 for the key col we excluded for (i=1; i<=empties; i++) out = out "," } print out } ' "$F2" "$F1" # F2 first so it is loaded before F1 is processed # Example: # employees.csv: id,name,dept_id # departments.csv: dept_id,dept_name,budget # ./csv_join.sh employees.csv dept_id departments.csv dept_id