Structured Data: CSV, TSV, and INI

📊 Intermediate Topic 6 — Structured Data: CSV, TSV, and INI

Shell scripts constantly consume and produce structured text: exports from databases, configuration files, reports. The tools are right there — awk, IFS splitting, pattern matching — but the edge cases bite hard. A CSV file with quoted fields containing commas will silently corrupt your data if you split naively. An INI file with inline comments or trailing spaces will hand you the wrong value. This chapter covers the full picture: how the formats actually work, where the traps are, and the robust patterns used in production scripts.

1 — CSV Fundamentals: Why It's Harder Than It Looks

CSV sounds simple — values separated by commas. The problem is that real CSV (RFC 4180) allows fields to contain commas, newlines, and double-quote characters, all of which break naive parsing. A field that contains a comma must be wrapped in double-quotes. A double-quote inside a quoted field must be escaped as two consecutive double-quotes ("").

❌ Naive — breaks on quoted fields

# Input: name,city,note
# Alice,"New York",no comma
# Bob,"Paris, France",has comma

while IFS=',' read -r name city note; do
    echo "city=$city"
done < data.csv
# Bob's city = '"Paris'   ← WRONG
# note       = ' France"' ← WRONG

✓ Real-world CSV has quoting rules

# RFC 4180 rules:
# 1. Fields MAY be quoted with "
# 2. A quoted field may contain , \n "
# 3. A " inside quotes = ""
# 4. Leading/trailing spaces inside
#    quotes are part of the value
#
# "Alice","New York","says ""hi"""
# → Alice | New York | says "hi"
#
# Need a state-machine parser.
# awk's FPAT is the cleanest option.

The IFS comma split is only safe when you control the data. If your CSV comes from a spreadsheet export, a database dump, or any system that follows RFC 4180, you must handle quoted fields. Use the awk FPAT approach or pipe through a purpose-built tool like csvkit (csvcut, csvstat). For data you generate yourself — logs, reports — you can choose to avoid values that need quoting, or use TSV instead.

2 — Robust CSV Parsing with `awk` and `FPAT`

GNU awk (gawk) provides a special variable FPAT — Field PATtern — which instead of defining the separator defines what a field looks like. Setting it to a pattern that matches either a quoted field or an unquoted field gives you a proper CSV parser in one line.

🐧 FPAT-based CSV parsing in gawk

# FPAT pattern breakdown:
#   ([^,]*)           — any unquoted field (zero or more non-comma chars)
#   |                 — OR
#   ("([^"]|"")*")    — a double-quoted field (handles "" escape inside)
FPAT_CSV='([^,]*)|("([^"]|"")*")'

# ── Print every field on a separate line ──────────────────────
gawk -v FPAT='([^,]*)|("([^"]|"")*")' '
{ for (i=1; i<=NF; i++) print i, $i }
' data.csv

# ── Strip surrounding quotes from a field value ───────────────
# Fields from FPAT still include the outer quotes — strip them:
csv_unquote() {
    # Strips surrounding quotes and unescapes "" → "
    awk 'BEGIN {
        s = ARGV[1]
        if (substr(s,1,1) == "\"") s = substr(s, 2, length(s)-2)
        gsub(/""/, "\"", s)
        print s
    }' "$1"
}

# ── Practical: process a CSV with header row ──────────────────
# File: employees.csv
# id,name,department,salary
# 1,Alice,"Engineering",95000
# 2,Bob,"Sales, East",72000
# 3,"O'Brien, Carol","HR",68000

gawk -v FPAT='([^,]*)|("([^"]|"")*")' '
NR == 1 { next }            # skip header
{
    id   = $1
    name = $2; gsub(/^"|"$/, "", name)   # strip quotes
    dept = $3; gsub(/^"|"$/, "", dept)
    sal  = $4

    printf "%-4s %-20s %-20s %s\n", id, name, dept, sal
}
' employees.csv

# ── Filter: only Engineering employees ────────────────────────
gawk -v FPAT='([^,]*)|("([^"]|"")*")' '
NR == 1 { print; next }
{
    dept = $3; gsub(/^"|"$/, "", dept)
    if (dept == "Engineering") print
}
' employees.csv

# ── Extract a single column by header name ────────────────────
gawk -v FPAT='([^,]*)|("([^"]|"")*")' \
     -v col="salary" '
NR == 1 {
    for (i=1; i<=NF; i++) {
        h = $i; gsub(/^"|"$/, "", h)
        if (h == col) colidx = i
    }
    next
}
colidx { v = $colidx; gsub(/^"|"$/, "", v); print v }
' employees.csv

FPAT is a GNU awk extension — not available in POSIX awk or mawk. Check with gawk --version. On systems where only awk is available, fall back to a Python one-liner or csvkit.

Building a reusable csv_parse() library function

🐧 A shell function that turns a CSV row into a bash array

# csv_split ROW → prints fields one per line, with quotes stripped
# Pipe its output to mapfile to get an array
csv_split() {
    gawk 'BEGIN {
        FPAT = "([^,]*)|(\\"([^\\"]|\\"\\")* \\")"
        $0 = ARGV[1]
        for (i=1; i<=NF; i++) {
            v = $i
            if (substr(v,1,1) == "\"") v = substr(v, 2, length(v)-2)
            gsub(/""/, "\"", v)
            print v
        }
    }' "$1"
}

# Usage: turn a CSV row into a bash array
row='42,"Smith, John","New York","says ""hello"""'
mapfile -t fields < <(csv_split "$row")

echo "ID:   ${fields[0]}"   → 42
echo "Name: ${fields[1]}"   → Smith, John
echo "City: ${fields[2]}"   → New York
echo "Note: ${fields[3]}"   → says "hello"

# Process a whole file row by row
process_csv() {
    local file="$1"
    local header_done=0
    local row

    while IFS= read -r row; do
        (( header_done == 0 )) && { header_done=1; continue; }
        mapfile -t f < <(csv_split "$row")
        # Now f[0], f[1], f[2]... hold the clean field values
        printf 'id=%s name=%s\n' "${f[0]}" "${f[1]}"
    done < "$file"
}

3 — Generating CSV Output

Writing CSV is easier than reading it — you control the quoting. The safest rule: always quote fields that may contain commas, double-quotes, or newlines; double up any literal double-quotes inside them. For simple data you know is clean, quoting only when necessary is fine.

🐧 Writing safe CSV from shell scripts

# ── csv_field: quote a single value for CSV output ────────────
csv_field() {
    local val="$1"
    # Always quote — simplest and safest
    val="${val//\"/\"\"}"   # escape " → ""
    printf '"%s"' "$val"
}

# ── csv_row: turn args into a CSV line ────────────────────────
csv_row() {
    local first=1
    for arg in "$@"; do
        (( first )) || printf ','
        csv_field "$arg"
        first=0
    done
    printf '\n'
}

# ── Generate a CSV report ─────────────────────────────────────
{
    csv_row "hostname" "cpu_pct" "mem_pct" "disk_pct" "timestamp"

    while IFS= read -r host; do
        # Collect metrics for $host (simplified)
        cpu=$(ssh "$host" "top -bn1 | awk '/Cpu/ {print 100-\$8}'" 2/dev/null)
        ts=$(date '+%Y-%m-%d %H:%M:%S')
        csv_row "$host" "${cpu:-N/A}" "N/A" "N/A" "$ts"
    done < hosts.txt
} > report.csv

# ── Append to an existing CSV safely ──────────────────────────
csv_append() {
    local file="$1"; shift
    if [[ ! -f "$file" ]]; then
        # Create with header on first call if header provided as first arg
        touch "$file"
    fi
    csv_row "$@" >> "$file"
}

# ── Column extraction without gawk (safe only for simple CSV) ─
# cut -d, -f2 — works only when no quoted commas exist
cut -d',' -f2 employees.csv

# With gawk FPAT — handles quoted commas
gawk -v FPAT='([^,]*)|("([^"]|"")*")' '{v=$2; gsub(/^"|"$/,"",v); print v}' employees.csv

4 — TSV: Tab-Separated Values

TSV uses a tab character as the delimiter. Tabs almost never appear in real data, which means quoting rules are usually unnecessary and parsing with IFS=$'\t' is safe. Many tools (sort, join, awk) integrate cleanly with tab-delimited data. When you control the format, TSV is often the better choice over CSV for shell processing.

🐧 TSV reading, writing, and conversion

# ── Reading TSV ───────────────────────────────────────────────
while IFS=$'\t' read -r id name dept salary; do
    printf '%-4s %-20s %-15s %s\n' "$id" "$name" "$dept" "$salary"
done < employees.tsv

# ── Writing TSV ───────────────────────────────────────────────
tsv_row() {
    local IFS=$'\t'
    printf '%s\n' "$*"
}
tsv_row "id" "name" "department" "salary"
tsv_row "1"  "Alice" "Engineering" "95000"

# ── Converting CSV to TSV ─────────────────────────────────────
# Simple (only safe when no embedded commas/newlines in fields)
tr ',' $'\t' < simple.csv > simple.tsv

# Robust (uses gawk FPAT to parse, then re-joins with tab)
gawk -v FPAT='([^,]*)|("([^"]|"")*")' '
{
    for (i=1; i<=NF; i++) {
        v = $i
        gsub(/^"|"$/, "", v)   # strip surrounding quotes
        gsub(/""/, "\"", v)    # unescape doubled quotes
        printf "%s%s", (i>1?"\t":""), v
    }
    print ""
}
' input.csv > output.tsv

# ── Sort TSV by column ────────────────────────────────────────
sort -t$'\t' -k3,3       employees.tsv   # sort by col 3 (alphabetic)
sort -t$'\t' -k4,4n -r  employees.tsv   # sort by col 4 (numeric, desc)

# ── join two TSV files on a common key ────────────────────────
# employees.tsv:  id  name  dept_id
# departments.tsv: dept_id  dept_name  budget
# Both must be sorted on the join field first
join -t$'\t' -1 3 -2 1 \
    <(sort -t$'\t' -k3,3 employees.tsv) \
    <(sort -t$'\t' -k1,1 departments.tsv)

5 — INI and Config File Parsing

INI files are ubiquitous: /etc/mysql/my.cnf, ~/.gitconfig, php.ini, smb.conf, countless application configs. The format has no strict standard but the conventions are consistent: [section] headers, key=value or key = value pairs, and # or ; line comments. The traps are whitespace around the =, inline comments, multi-line values, and section awareness.

🐧 Reading INI files with awk

# Sample config.ini:
# [database]
# host = db.example.com
# port = 5432
# name = myapp_prod   # production DB
#
# [cache]
# host = redis.example.com
# port = 6379

# ── Get a single value: ini_get FILE SECTION KEY ──────────────
ini_get() {
    local file="$1" section="$2" key="$3"
    awk -v s="$section" -v k="$key" '
    /^\[/ {
        gsub(/[\[\]]/, "")
        cur = $0
        next
    }
    cur == s && /=/ {
        # Split on first = only
        eq = index($0, "=")
        key_part = substr($0, 1, eq-1)
        val_part = substr($0, eq+1)
        # Trim whitespace
        gsub(/^[[:space:]]+|[[:space:]]+$/, "", key_part)
        gsub(/^[[:space:]]+/, "", val_part)
        # Strip inline comment (# or ;)
        sub(/[[:space:]]+[#;].*$/, "", val_part)
        # Trim trailing whitespace
        gsub(/[[:space:]]+$/, "", val_part)
        if (key_part == k) { print val_part; exit }
    }
    ' "$file"
}

db_host=$(ini_get config.ini database host)
db_port=$(ini_get config.ini database port)
db_name=$(ini_get config.ini database name)
echo "Connecting to $db_host:$db_port/$db_name"

# ── Load an entire section into bash variables ─────────────────
ini_load_section() {
    local file="$1" section="$2" prefix="${3:-}"
    local k v
    while IFS='=' read -r k v; do
        # Strip whitespace and inline comments
        k="${k// /}"
        v="${v#"${v%%[![:space:]]*}"}"   # ltrim
        v="${v%%[[:space:]]*#*}"          # strip inline comment
        v="${v%"${v##*[![:space:]]}"}"   # rtrim
        [[ -z "$k" ]] && continue
        declare -g "${prefix}${k}=${v}"
    done < <(awk -v s="$section" '
        /^\[/ { gsub(/[\[\]]/,""); cur=$0; next }
        cur==s && /=/ { print }
        cur==s && /^\s*($|[#;])/ { next }
    ' "$file")
}

# Loads database.host, database.port, etc. as shell vars with prefix
ini_load_section config.ini database db_
echo "host=$db_host  port=$db_port"

Writing and updating INI values

🐧 Updating a key in an INI file in-place

# ini_set FILE SECTION KEY VALUE
# Updates the key if it exists; inserts it if it doesn't
ini_set() {
    local file="$1" section="$2" key="$3" value="$4"

    if grep -q "^\[${section}\]" "$file"; then
        # Section exists — try to update existing key
        if awk -v s="$section" -v k="$key" '
            /^\[/ { cur=substr($0,2,length($0)-2) }
            cur==s && $0 ~ "^[[:space:]]*"k"[[:space:]]*=" { found=1 }
            END { exit !found }
        ' "$file"; then
            # Key exists — update it with sed
            sed -i "/^\[${section}\]/,/^\[/" \
                -e "/^\[${section}\]/,/^\[/ s|^\([[:space:]]*${key}[[:space:]]*=\).*|\1 ${value}|" \
                "$file"
        else
            # Key missing — append to section
            sed -i "/^\[${section}\]/a ${key} = ${value}" "$file"
        fi
    else
        # Section missing — append at end of file
        printf '\n[%s]\n%s = %s\n' "$section" "$key" "$value" >> "$file"
    fi
}

ini_set config.ini database port 5433
ini_set config.ini database timeout 30    # new key
ini_set config.ini newapp debug true       # new section + key

# ── List all sections in a file ───────────────────────────────
ini_sections() {
    grep -oP '(?<=^\[)[^\]]*' "$1"
}

# ── List all keys in a section ────────────────────────────────
ini_keys() {
    awk -v s="$2" '
        /^\[/ { gsub(/[\[\]]/,""); cur=$0; next }
        cur==s && /=/ {
            eq=index($0,"="); k=substr($0,1,eq-1)
            gsub(/^[[:space:]]+|[[:space:]]+$/,"",k)
            print k
        }
    ' "$1"
}

ini_sections config.ini          → database
                                      cache
ini_keys     config.ini database  → host
                                      port
                                      name

6 — Key=Value and `.env` Files

The simpler cousin of INI: a flat file of KEY=VALUE pairs with no sections. Docker Compose, systemd unit EnvironmentFile=, and countless deployment tools use this format. Bash can source it directly — but only after sanitising it, because a malicious or malformed .env file can execute arbitrary code on source.

🐧 Safely loading .env files

# ── UNSAFE: direct source runs any code in the file ───────────
# source .env        ← NEVER do this with untrusted files
# . .env             ← same risk

# ── Safe: parse and export only valid KEY=VALUE lines ─────────
load_env() {
    local envfile="${1:-.env}"
    [[ -f "$envfile" ]] || { echo "$envfile not found" >&2; return 1; }

    local key val line
    while IFS= read -r line; do
        # Skip blanks and comments
        [[ "$line" =~ ^[[:space:]]*($|#) ]] && continue
        # Must match KEY=VALUE with valid identifier key
        [[ "$line" =~ ^([A-Za-z_][A-Za-z0-9_]*)=(.*)$ ]] || continue
        key="${BASH_REMATCH[1]}"
        val="${BASH_REMATCH[2]}"
        # Strip optional surrounding quotes
        [[ "$val" =~ ^\"(.*)\"$ ]] && val="${BASH_REMATCH[1]}"
        [[ "$val" =~ ^\'(.*)\'$ ]] && val="${BASH_REMATCH[1]}"
        # Strip inline comment (only outside quotes)
        val="${val%%[[:space:]]*\#*}"
        export "${key}=${val}"
    done < "$envfile"
}

load_env .env.production
echo "DB_HOST=$DB_HOST"

# ── source is fine when you wrote the file yourself ───────────
# In CI or deploy scripts, sourcing a file you generated is OK
# Just never source files from user input or external sources

# ── Generate a .env file ──────────────────────────────────────
write_env() {
    local file="$1"; shift
    # Args: KEY VALUE KEY VALUE ...
    {
        printf '# Generated %s\n' "$(date -Is)"
        while (( $# >= 2 )); do
            k="$1" v="$2"; shift 2
            # Quote if value contains spaces, #, or special chars
            [[ "$v" =~ [[:space:]#\"] ]] && v='"'"${v//\"/\\\"}"'"'
            printf '%s=%s\n' "$k" "$v"
        done
    } > "$file"
    chmod 600 "$file"   # env files often contain secrets
}
write_env .env.staging \
    DB_HOST staging-db.internal \
    DB_PORT 5432 \
    APP_SECRET "my secret value" \
    DEBUG false

7 — Generating Structured Reports

A common pattern in ops scripts: collect data from multiple sources, assemble it into a structured table, and emit it as either a human-readable aligned table or a machine-readable CSV. The key is separating data collection from formatting.

🐧 Aligned console table + CSV output from the same data

#!/usr/bin/env bash
# disk_report.sh — show disk usage for a list of paths
set -euo pipefail

FORMAT="${FORMAT:-table}"   # table | csv | tsv
PATHS=( "${@:-/}" )

# ── Column definitions: name, width, awk-expression ───────────
declare -A COL_W=( [mount]=20 [total]=8 [used]=8 [avail]=8 [pct]=6 )

# ── Collect data from df ───────────────────────────────────────
# df -h outputs: Filesystem Size Used Avail Use% Mounted-on
collect_data() {
    df -h "${PATHS[@]}" 2>/dev/null | tail -n +2 | \
    awk '{print $6, $2, $3, $4, $5}'  # mount total used avail pct
}

# ── Render as aligned table ────────────────────────────────────
render_table() {
    local w_m=${COL_W[mount]} w_t=${COL_W[total]}
    local w_u=${COL_W[used]}  w_a=${COL_W[avail]} w_p=${COL_W[pct]}
    local sep; sep=$(printf '%*s' $(( w_m + w_t + w_u + w_a + w_p + 5 )) '' | tr ' ' '─')

    printf "\033[1m%-${w_m}s %${w_t}s %${w_u}s %${w_a}s %${w_p}s\033[0m\n" \
        MOUNT TOTAL USED AVAIL PCT
    printf '%s\n' "$sep"

    while read -r mount total used avail pct; do
        # Colour-code by usage percentage
        pct_n=${pct//%/}
        local colour='\033[0m'
        (( pct_n >= 90 )) && colour='\033[31m'
        (( pct_n >= 75 && pct_n < 90 )) && colour='\033[33m'

        printf "%-${w_m}s %${w_t}s %${w_u}s %${w_a}s ${colour}%${w_p}s\033[0m\n" \
            "$mount" "$total" "$used" "$avail" "$pct"
    done < <(collect_data)
}

# ── Render as CSV ─────────────────────────────────────────────
render_csv() {
    csv_row mount total used avail pct
    while read -r mount total used avail pct; do
        csv_row "$mount" "$total" "$used" "$avail" "$pct"
    done < <(collect_data)
}

# ── Dispatch ──────────────────────────────────────────────────
case "$FORMAT" in
    table) render_table ;;
    csv)   render_csv   ;;
    tsv)   render_csv | tr ',' $'\t' ;;  # quick for clean data
    *)     echo "Unknown FORMAT: $FORMAT" >&2; exit 1 ;;
esac

# Usage:
# ./disk_report.sh /var /home /tmp
# FORMAT=csv ./disk_report.sh / /var

8 — Quick Reference

Task	Tool / Pattern	Notes
Parse CSV with quoted fields	`gawk FPAT='([^,])\|("([^"]\|"")")'`	GNU awk only
Simple CSV split (no quotes)	`IFS=',' read -r f1 f2 f3`	Only safe when no quoted commas
Extract CSV column by name	`gawk -v FPAT=... -v col="name" 'NR==1{...} {print $colidx}'`
Sort TSV by column N	`sort -t$'\t' -kN,N`	Add `n` flag for numeric sort
Join two TSV files	`join -t$'\t' -1 N -2 M file1 file2`	Files must be pre-sorted
Read INI value	`ini_get FILE SECTION KEY`	awk-based function
Update INI value	`ini_set FILE SECTION KEY VALUE`	Uses sed in-place
Load .env safely	Regex-validate keys, then `export "$key=$val"`	Never `source` untrusted files
Quote a CSV field	`val="${val//\"/\"\"}"; printf '"%s"' "$val"`	Always safe to over-quote
CSV → TSV (no embedded commas)	`tr ',' $'\t'`
CSV → TSV (with quoted commas)	`gawk FPAT=... + gsub + printf with \t`

When to use a real tool instead: For anything beyond simple filtering and extraction, reach for csvkit (pip install csvkit) — it provides csvcut, csvjoin, csvstat, csvsql, and handles encoding, BOM markers, and edge cases you haven't thought of yet. For Python users, pandas and its read_csv() are the standard. Shell CSV parsing is the right tool for quick one-offs and environments where you can't install extra packages.

✏️ Exercises

Each exercise builds a utility you'd actually reach for. Test with edge cases: values containing commas, empty fields, sections with missing keys, files with Windows-style CRLF line endings.

Exercise 1

Write a function called csv_summary() that accepts a CSV file path and prints a summary of its structure: number of rows (excluding header), number of columns, each column name with its index, and for numeric columns the min, max, and average. Use gawk with FPAT so it handles quoted fields correctly.

Hint: in the NR==1 block, capture column names into an array. For data rows, try to detect numeric columns by checking if the value matches /^-?[0-9.]+$/. Track running sum, min, and max per column in arrays. Print the summary in the END block.

Sample Solution

csv_summary() {
    local file="${1:?csv_summary: file required}"
    [[ -f "$file" ]] || { echo "Not found: $file" >&2; return 1; }

    gawk '
    BEGIN { FPAT = "([^,]*)(\"([^\"]|\"\")*\")" }
    function strip(s) {
        gsub(/^[[:space:]"]+|[[:space:]"]+$/, "", s)
        return s
    }
    NR == 1 {
        ncols = NF
        for (i=1; i<=NF; i++) hdr[i] = strip($i)
        next
    }
    {
        rows++
        for (i=1; i<=NF; i++) {
            v = strip($i)
            if (v ~ /^-?[0-9]+(\.[0-9]+)?$/) {
                is_num[i] = 1
                sum[i] += v
                if (!(i in mn) || v+0 < mn[i]) mn[i] = v+0
                if (!(i in mx) || v+0 > mx[i]) mx[i] = v+0
            }
        }
    }
    END {
        print "File:    " FILENAME
        print "Rows:    " rows
        print "Columns: " ncols
        print ""
        printf "%-4s %-20s %-10s %s\n", "IDX", "COLUMN", "TYPE", "STATS"
        printf "%s\n", "─────────────────────────────────────────────────────"
        for (i=1; i<=ncols; i++) {
            if (is_num[i]) {
                avg = (rows > 0) ? sum[i] / rows : 0
                printf "%-4d %-20s %-10s min=%-8g max=%-8g avg=%.2f\n",
                    i, hdr[i], "numeric", mn[i], mx[i], avg
            } else {
                printf "%-4d %-20s %-10s\n", i, hdr[i], "text"
            }
        }
    }
    ' "$file"
}

# Test:
printf 'id,name,score,city\n1,Alice,92,"New York"\n2,Bob,78,"Paris"\n3,Carol,85,London\n' > /tmp/test.csv
csv_summary /tmp/test.csv

Exercise 2

Write a script called csv_filter.sh that filters rows from a CSV file based on a column name and a value. Usage: csv_filter.sh FILE COLUMN VALUE. It should print the header row, then all rows where the named column exactly matches VALUE (case-insensitive). Handle quoted fields correctly and strip surrounding quotes when comparing. Support an optional --contains flag that does a substring match instead of an exact match.

Hint: in NR==1 find the column index by name. In data rows, strip quotes from the target field before comparing. Store the --contains flag in a variable passed via -v. Use tolower() in awk for case-insensitive comparison.

Sample Solution

#!/usr/bin/env bash
# csv_filter.sh [--contains] FILE COLUMN VALUE
set -euo pipefail

MATCH_TYPE="exact"
if [[ "${1:-}" == "--contains" ]]; then
    MATCH_TYPE="contains"
    shift
fi

FILE="${1:?Usage: csv_filter.sh [--contains] FILE COLUMN VALUE}"
COL="${2:?column name required}"
VAL="${3:?value required}"

[[ -f "$FILE" ]] || { echo "Error: $FILE not found" >&2; exit 1; }

gawk -v col="$COL" -v val="$VAL" -v mtype="$MATCH_TYPE" '
BEGIN {
    FPAT = "([^,]*)(\"([^\"]|\"\")*\")"
    colidx = 0
    val_l = tolower(val)
}

function strip(s) {
    gsub(/^[[:space:]"]+|[[:space:]"]+$/, "", s)
    gsub(/""/, "\"", s)
    return s
}

NR == 1 {
    for (i=1; i<=NF; i++) {
        h = strip($i)
        if (tolower(h) == tolower(col)) { colidx = i }
    }
    if (!colidx) {
        print "Error: column \"" col "\" not found" > "/dev/stderr"
        exit 1
    }
    print   # always print header
    next
}

{
    if (!colidx) next
    field = strip($colidx)
    field_l = tolower(field)

    if (mtype == "exact" && field_l == val_l) print
    else if (mtype == "contains" && index(field_l, val_l)) print
}
' "$FILE"

Exercise 3

Write a complete config library in a file called lib_config.sh designed to be sourced by other scripts. It should provide: config_load FILE (loads all sections into CONF[section.key] associative array), config_get SECTION KEY [DEFAULT] (retrieves a value, returns DEFAULT if missing), config_require SECTION KEY (dies with an error if the key is absent), and config_dump (prints all loaded values sorted). Include a self-test that runs when the file is executed directly.

Hint: use declare -gA CONF to create a global associative array. The key format is "${section}.${key}". In config_get, use "${CONF[$section.$key]:-$default}". The ${BASH_SOURCE[0]} == $0 trick from Topic 4 triggers the self-test.

Sample Solution

#!/usr/bin/env bash
# lib_config.sh — INI config library
# Source this file; if executed directly it runs self-tests

declare -gA CONF=()
_CONF_LOADED=""

# config_load FILE — parse INI into CONF[section.key]
config_load() {
    local file="${1:?config_load: file required}"
    [[ -f "$file" ]] || { printf 'config_load: %s not found\n' "$file" >&2; return 1; }
    CONF=()   # reset
    _CONF_LOADED="$file"

    local cur_section=""
    local line k v

    while IFS= read -r line; do
        # Remove CR (handle Windows CRLF)
        line="${line//$'\r'/}"
        # Skip blank lines and comments
        [[ "$line" =~ ^[[:space:]]*($|[#;]) ]] && continue
        # Section header
        if [[ "$line" =~ ^\[([A-Za-z0-9_.-]+)\] ]]; then
            cur_section="${BASH_REMATCH[1]}"
            continue
        fi
        # key = value (only inside a section)
        [[ -z "$cur_section" ]] && continue
        [[ "$line" =~ ^[[:space:]]*([A-Za-z0-9_./-]+)[[:space:]]*=[[:space:]]*(.*) ]] || continue
        k="${BASH_REMATCH[1]}"
        v="${BASH_REMATCH[2]}"
        # Strip inline comment
        v="${v%%[[:space:]][#;]*}"
        # Strip trailing whitespace
        v="${v%"${v##*[![:space:]]}"}"
        # Strip surrounding quotes
        [[ "$v" =~ ^\"(.*)\"$ ]] && v="${BASH_REMATCH[1]}"
        [[ "$v" =~ ^\'(.*)\'$ ]] && v="${BASH_REMATCH[1]}"
        CONF["${cur_section}.${k}"]="$v"
    done < "$file"
}

# config_get SECTION KEY [DEFAULT]
config_get() {
    local sec="$1" key="$2" default="${3:-}"
    local ckey="${sec}.${key}"
    if [[ -v "CONF[$ckey]" ]]; then
        printf '%s' "${CONF[$ckey]}"
    else
        printf '%s' "$default"
    fi
}

# config_require SECTION KEY — exits with error if missing
config_require() {
    local sec="$1" key="$2"
    [[ -v "CONF[${sec}.${key}]" ]] || {
        printf 'config_require: [%s] %s is required but not set\n' "$sec" "$key" >&2
        return 1
    }
    printf '%s' "${CONF[${sec}.${key}]}"
}

# config_dump — print all loaded values
config_dump() {
    printf '# Config loaded from: %s\n' "$_CONF_LOADED"
    local k
    for k in $(printf '%s\n' "${!CONF[@]}" | sort); do
        printf '%-40s = %s\n' "$k" "${CONF[$k]}"
    done
}

# ── Self-test (runs only when executed, not sourced) ───────────
if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then
    tmpf=$(mktemp /tmp/lib_config_test.XXXXXX.ini)
    trap "rm -f '$tmpf'" EXIT

    cat > "$tmpf" <<'EOF'
[database]
host = db.example.com  # primary DB
port = 5432
name = "myapp prod"

[cache]
host = redis.example.com
port = 6379
EOF

    pass=0; fail=0
    assert_eq() {
        if [[ "$1" == "$2" ]]; then
            printf '\033[32m✓\033[0m %s\n' "$3"; (( pass++ ))
        else
            printf '\033[31m✗\033[0m %s (got "%s", want "%s")\n' "$3" "$1" "$2"; (( fail++ ))
        fi
    }

    config_load "$tmpf"

    assert_eq "$(config_get database host)"    "db.example.com"  "host without inline comment"
    assert_eq "$(config_get database port)"    "5432"            "port as string"
    assert_eq "$(config_get database name)"    "myapp prod"      "quoted value"
    assert_eq "$(config_get cache host)"       "redis.example.com" "cache host"
    assert_eq "$(config_get database missing default_val)" \
               "default_val"  "missing key returns default"
    assert_eq "$(config_get nosec nokey fallback)" \
               "fallback"     "missing section returns default"

    echo; config_dump
    printf '\nResults: %d passed, %d failed\n' "$pass" "$fail"
    (( fail == 0 ))
fi

Exercise 4

Write a script called csv_join.sh that performs a left join on two CSV files, matching rows by a common key column. Usage: csv_join.sh FILE1 COL1 FILE2 COL2. The output should have all columns from FILE1, followed by all non-key columns from FILE2. Rows in FILE1 with no match in FILE2 should still appear, with empty strings for the FILE2 columns. Handle quoted fields throughout.

Hint: the cleanest approach is a two-pass awk: first pass reads FILE2 into an associative array keyed by the join field; second pass processes FILE1 and looks up each row's key. Pass both FPAT and the column indices via -v. To find column indices, process NR==1 of each file separately using BEGINFILE or by running two awk programs.

Sample Solution

#!/usr/bin/env bash
# csv_join.sh FILE1 COL1 FILE2 COL2
# Left join: all rows from FILE1, matched with FILE2 on key columns
set -euo pipefail

F1="${1:?Usage: csv_join.sh FILE1 COL1 FILE2 COL2}"
C1="${2:?col1 required}"
F2="${3:?file2 required}"
C2="${4:?col2 required}"

for f in "$F1" "$F2"; do
    [[ -f "$f" ]] || { printf 'Not found: %s\n' "$f" >&2; exit 1; }
done

gawk \
    -v f1="$F1" -v c1="$C1" \
    -v f2="$F2" -v c2="$C2" \
    '
BEGIN {
    FPAT = "([^,]*)(\"([^\"]|\"\")*\")"
    FS  = ","
}

function strip(s,    v) {
    v = s
    gsub(/^[[:space:]"]+|[[:space:]"]+$/, "", v)
    gsub(/""/, "\"", v)
    return v
}

function csv_field(s,    v) {
    # Always quote output for safety
    v = s; gsub(/"/, "\"\"", v)
    return "\"" v "\""
}

# ── Pass 1: load FILE2 into memory ─────────────────────────
BEGINFILE {
    if (FILENAME == f2) in_f2 = 1
    else                in_f2 = 0
}

in_f2 && FNR == 1 {
    for (i=1; i<=NF; i++) {
        h = strip($i)
        if (tolower(h) == tolower(c2)) { f2_key_idx = i }
        f2_hdr[i] = h
    }
    f2_ncols = NF
    next
}

in_f2 {
    key = strip($f2_key_idx)
    row = ""
    for (i=1; i<=NF; i++) {
        if (i == f2_key_idx) continue   # exclude key col from f2 output
        row = row (row ? SUBSEP : "") strip($i)
    }
    f2_data[key] = row
    next
}

# ── Pass 2: process FILE1, join with FILE2 data ─────────────
!in_f2 && FNR == 1 {
    for (i=1; i<=NF; i++) {
        h = strip($i)
        if (tolower(h) == tolower(c1)) f1_key_idx = i
        f1_hdr[i] = h
    }
    f1_ncols = NF
    # Print merged header
    out = ""
    for (i=1; i<=f1_ncols; i++) out = out (out?",":"") csv_field(f1_hdr[i])
    for (i=1; i<=f2_ncols; i++) {
        if (i == f2_key_idx) continue
        out = out "," csv_field(f2_hdr[i])
    }
    print out
    next
}

!in_f2 {
    key = strip($f1_key_idx)
    out = ""
    for (i=1; i<=f1_ncols; i++) out = out (out?",":"") csv_field(strip($i))

    if (key in f2_data) {
        n = split(f2_data[key], parts, SUBSEP)
        for (i=1; i<=n; i++) out = out "," csv_field(parts[i])
    } else {
        # No match: fill with empty fields
        empties = f2_ncols - 1   # -1 for the key col we excluded
        for (i=1; i<=empties; i++) out = out ","
    }
    print out
}
' "$F2" "$F1"   # F2 first so it is loaded before F1 is processed

# Example:
# employees.csv: id,name,dept_id
# departments.csv: dept_id,dept_name,budget
# ./csv_join.sh employees.csv dept_id departments.csv dept_id

Structured Data: CSV, TSV, and INI

📊 Intermediate Topic 6 — Structured Data: CSV, TSV, and INI

1 — CSV Fundamentals: Why It's Harder Than It Looks

2 — Robust CSV Parsing with awk and FPAT

Building a reusable csv_parse() library function

3 — Generating CSV Output

4 — TSV: Tab-Separated Values

5 — INI and Config File Parsing

Writing and updating INI values

6 — Key=Value and .env Files

7 — Generating Structured Reports

8 — Quick Reference

✏️ Exercises

2 — Robust CSV Parsing with `awk` and `FPAT`

6 — Key=Value and `.env` Files