Chapter 5 — Network Performance
Network problems are the easiest bottleneck to miss. When CPU and memory look healthy, many administrators declare the server fine — but a 200ms DNS lookup on every database query, or a NIC saturated by a background rsync, can make an application feel broken without leaving any obvious trace in the usual tools. This chapter covers how to systematically rule network in or out, and how to find and fix it when it's the culprit.
What this chapter covers: Bandwidth vs latency — two separate problems. Diagnosing with ping, mtr, and traceroute. Reading ss (the modern netstat). TCP connection states — ESTABLISHED, TIME_WAIT, CLOSE_WAIT and what each means. Scenario 1: app is slow but CPU/mem are idle. Scenario 2: thousands of TIME_WAIT connections. Scenario 3: NIC saturation — finding the process and throttling it. /proc/net/dev for NIC errors and drops.
Bandwidth vs Latency — Two Different Problems
🚿
Bandwidth (Throughput)
The maximum data transfer rate — how wide the pipe is. Measured in Mbps or GB/s. Symptoms when limited: large file transfers are slow, video streams buffer, bulk API calls take long. Diagnosed with: iftop, nethogs, /proc/net/dev. The NIC itself has a hard limit (1 Gbps, 10 Gbps etc.).
⚡
Latency (Round-Trip Time)
The time for a single packet to travel and return — how fast the pipe responds. Measured in milliseconds. Symptoms when high: interactive apps feel laggy, many small API calls are slow even though bandwidth is fine, database queries with many round-trips are sluggish. Diagnosed with: ping, mtr, curl timing.
📦
Packet Loss
A percentage of packets that never arrive. Even 1% loss causes TCP to retransmit, adding latency and throttling throughput dramatically. Symptoms: connections work but are unreliable and slow, timeouts appear intermittently. Diagnosed with: mtr (shows per-hop loss%), ping with count.
🔍
DNS Resolution Time
Often invisible in monitoring but catastrophic in impact. If your app resolves a hostname on every request and DNS takes 200ms, that's 200ms of latency on every operation. Symptoms: app slow, server-to-server calls slow, intermittent timeouts. Diagnosed with: dig +stats, strace on a running process.
Connectivity Diagnostics — ping, traceroute, mtr
$ ping -c 10 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=12.4 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=11.9 ms
64 bytes from 8.8.8.8: icmp_seq=8 ttl=116 time=245.1 ms ← spike
--- 8.8.8.8 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9013ms
rtt min/avg/max/mdev = 11.9/38.2/245.1/69.4 ms
$ traceroute -n 8.8.8.8
1 192.168.1.1 1.2 ms 1.1 ms 1.0 ms ← your router
2 10.0.0.1 4.8 ms 4.9 ms 4.7 ms ← ISP edge
3 * * * ← hop drops ICMP (normal)
4 72.14.215.100 185.0 ms 182.0 ms 190.0 ms ← latency jump HERE
5 8.8.8.8 12.3 ms 12.1 ms 12.4 ms ← destination is fast
mtr — the best tool for diagnosing path problems
mtr (Matt's Traceroute) combines ping and traceroute into a live view that shows round-trip time and packet loss at every hop simultaneously. It's the single most useful tool for diagnosing whether a network problem is in your server, your network, or somewhere on the internet path.
$ mtr --report --report-cycles 20 -n 8.8.8.8
Start: 2025-06-14T15:22:10+0100
HOST: myserver Loss% Snt Last Avg Best Wrst StDev
1. 192.168.1.1 0.0% 20 1.2 1.1 1.0 1.5 0.1
2. 10.0.0.1 0.0% 20 4.8 4.7 4.6 5.1 0.2
3. ??? 100.0% 20 0.0 0.0 0.0 0.0 0.0
4. 72.14.215.100 0.0% 20 12.0 12.1 11.8 12.6 0.2
5. 8.8.8.8 0.0% 20 12.3 12.2 12.0 12.8 0.2
100% loss at an intermediate hop is not packet loss — many routers de-prioritise or block ICMP TTL-exceeded messages (what traceroute/mtr uses) while still forwarding packets normally. If all subsequent hops are reachable with 0% loss, the "100%" hop is just filtering probes. Only worry if loss appears at your destination or persists across multiple subsequent hops.
Testing DNS resolution time
$ dig google.com | grep "Query time"
;; Query time: 2 msec
$ dig google.com | grep "Query time"
;; Query time: 342 msec
$ dig @8.8.8.8 google.com | grep "Query time"
$ curl -o /dev/null -s -w "DNS: %{time_namelookup}s\nConnect: %{time_connect}s\nTTFB: %{time_starttransfer}s\nTotal: %{time_total}s\n" https://example.com
DNS: 0.342s ← DNS lookup taking 342ms — this is the problem
Connect: 0.344s
TTFB: 0.489s
Total: 0.490s
ss — The Modern netstat
ss (socket statistics) replaced netstat as the recommended tool for inspecting network connections. It's faster, more informative, and available on all modern Linux systems. The flags are similar to netstat but the output is richer.
$ ss -tulpn
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=891))
tcp LISTEN 0 511 0.0.0.0:80 0.0.0.0:* users:(("nginx",pid=1234))
tcp LISTEN 0 128 127.0.0.1:5432 0.0.0.0:* users:(("postgres",pid=2341))
$ ss -s
Total: 4821
TCP: 4701 (estab 241, closed 4120, orphaned 0, timewait 4112)
$ ss -tp state established
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
0 0 10.0.0.5:44321 10.0.0.10:5432 users:(("python3",pid=8821))
0 0 10.0.0.5:44322 10.0.0.10:5432 users:(("python3",pid=8821))
$ ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c | sort -rn
4112 TIME_WAIT
241 ESTABLISHED
8 LISTEN
$ ss -tp dst 10.0.0.10:5432
$ ss -tp sport :80
Recv-Q and Send-Q in ss output: For LISTEN sockets, Recv-Q is the number of connections waiting to be accepted (should be near 0; a large value means the application isn't calling accept() fast enough). For established sockets, Send-Q is data buffered waiting to be sent to the remote end — a large Send-Q means the remote side is reading slowly or the connection is congested.
TCP Connection States
LISTEN Normal
Server socket waiting for incoming connections. Should exist for every service port. The Recv-Q shows the backlog queue — connections waiting to be accepted by the application.
ESTABLISHED Normal
Active two-way connection. Both sides can send data. The number of established connections reflects your actual active users or service-to-service connections right now.
TIME_WAIT Watch
Connection has been closed. The kernel holds the socket for 60 seconds (2×MSL) to absorb any late-arriving packets. Normal in small numbers; thousands indicates high connection turnover or missing keep-alive.
CLOSE_WAIT App Bug
Remote end sent FIN (closed its side), but the local application hasn't called close() yet. A large and growing CLOSE_WAIT count is almost always an application bug — sockets not being closed after use.
SYN_SENT / SYN_RECV
TCP handshake in progress. SYN_SENT = local side waiting for remote to respond. SYN_RECV = server received SYN, waiting for ACK. Many SYN_RECV can indicate a SYN flood attack.
FIN_WAIT1 / FIN_WAIT2
Connection teardown in progress — local side initiated the close. Brief transitional states. Many FIN_WAIT2 with no progression to TIME_WAIT can indicate the remote end is not responding to the close sequence.
iftop and nethogs — Bandwidth by Connection and by Process
$ nethogs -d 1 eth0
NetHogs version 0.8.5
PID USER PROGRAM DEV SENT RECEIVED
8821 backup /usr/bin/rsync eth0 94.2 0.1 MB/s
891 www /usr/sbin/nginx eth0 8.4 2.1 MB/s
2341 mysql /usr/sbin/mysqld eth0 0.8 0.3 MB/s
/proc/net/dev — NIC Errors and Drops
NIC-level errors are distinct from application-level bandwidth saturation. Hardware errors (CRC failures, framing errors) indicate a physical problem — bad cable, faulty NIC, misconfigured duplex. Drops indicate the kernel couldn't process packets fast enough.
$ cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eth0: 82341M 61234K 0 0 0 0 0 4821K 52341M 48923K 0 0 0 0 0 0
eth1: 12341M 18234K 142 891 0 12 0 0 8234M 12821K 0 0 0 0 0 0
$ ip -s link show eth1
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
RX: bytes packets errors dropped missed mcast
12341M 18234K 142 891 0 0
TX: bytes packets errors dropped carrier collisions
8234M 12821K 0 0 0 0
| Field | Meaning | Non-zero means… |
| RX errors | Packets received with hardware errors (CRC, framing, length) | Physical problem: bad cable, NIC fault, duplex mismatch. Replace cable first. |
| RX dropped | Received packets discarded by the kernel before processing | Ring buffer overflow — NIC received faster than the kernel could process. Increase ring buffer size: ethtool -G eth0 rx 4096 |
| RX missed | Packets missed by the NIC hardware before they reached the ring buffer | NIC is hardware-saturated. Interrupt coalescing or RSS (Receive Side Scaling) may help. |
| TX carrier | Lost carrier signal during transmit — link went down mid-send | Physical link instability — cable, switch port, or NIC issue |
| TX collisions | Ethernet collisions (half-duplex only) | Should be zero on modern full-duplex links. Non-zero = duplex mismatch with switch. |
Scenario 1 — App Is Slow but CPU and Memory Are Idle
1
Confirm with vmstat that it's not disguised I/O wait. Network blocking shows as idle CPU (not wa), making it look like the system is doing nothing when it's actually waiting on network responses.
$ vmstat 1 5
r b swpd free si so bi bo wa us sy id
0 0 0 8.2G 0 0 0 0 0 5 2 93
2
Test DNS resolution time — this is the most commonly missed culprit.
$ time dig db-server.internal A
;; Query time: 380 msec
;; SERVER: 10.0.0.53#53
$ cat /etc/hosts | grep db-server
$ echo "10.0.0.10 db-server.internal" | sudo tee -a /etc/hosts
3
Use curl timing to break down an HTTP request into phases.
$ curl -o /dev/null -s -w \
"DNS lookup: %{time_namelookup}s\nTCP connect: %{time_connect}s\nSSL handshake: %{time_appconnect}s\nTime to first byte: %{time_starttransfer}s\nTotal: %{time_total}s\n" \
https://api.example.com/health
DNS lookup: 0.002s ← fast, cached
TCP connect: 0.015s ← 13ms to connect, reasonable
SSL handshake: 1.240s ← 1.2 seconds for TLS — this is the bottleneck
Time to first byte: 1.250s
Total: 1.252s
4
Check for CLOSE_WAIT accumulation — a sign the app isn't closing connections.
$ ss -tan | awk '{print $1}' | sort | uniq -c
4 LISTEN
12 ESTABLISHED
820 CLOSE_WAIT
$ ss -tanp state CLOSE_WAIT | awk '{print $NF}' | sort | uniq -c | sort -rn | head -5
820 users:(("node",pid=8821,fd=42))
5
Check NIC errors — hardware problems can cause intermittent slowness.
$ ip -s link show eth0 | grep -A2 "RX:"
RX: bytes packets errors dropped
82341M 61234K 1842 234
$ ethtool eth0 | grep -E "Speed|Duplex"
Speed: 100Mb/s ← Should be 1000Mb/s — NIC negotiated at wrong speed
Duplex: Half ← Half duplex — bad, should be Full. Explains collisions.
The curl timing breakdown is one of the most productive 30-second investments when diagnosing slow HTTP applications. DNS → Connect → TLS → TTFB each map to a specific layer you can investigate independently.
Scenario 2 — Thousands of TIME_WAIT Connections
1
Understand why TIME_WAIT exists before deciding to fight it. TIME_WAIT is the TCP protocol's safety mechanism — after a connection closes, the kernel holds the port combination for 60 seconds to absorb any late-arriving packets that were delayed in transit. It prevents a new connection on the same port from receiving old data. Removing it entirely is dangerous.
2
Determine whether it's actually causing a problem. TIME_WAIT is only a problem if you're running out of ephemeral ports — the pool of local ports used for outgoing connections.
$ cat /proc/sys/net/ipv4/ip_local_port_range
32768 60999
$ dmesg -T | grep -i "port\|connect\|socket"
$ netstat -s | grep -i "failed\|refused\|exhausted"
3
If TIME_WAIT is genuinely causing port exhaustion — tune carefully.
$ sysctl -w net.ipv4.tcp_tw_reuse=1
$ sysctl -w net.ipv4.ip_local_port_range="1024 65535"
keepalive_timeout 65; # in nginx.conf http {} block
$ echo "net.ipv4.tcp_tw_reuse=1" | sudo tee -a /etc/sysctl.d/99-network.conf
$ sysctl -p /etc/sysctl.d/99-network.conf
4
Do not use tcp_tw_recycle. This parameter was removed in Linux kernel 4.12 because it broke connections from clients behind NAT (a common scenario with load balancers and mobile networks). If you see advice recommending it for TIME_WAIT, the advice is outdated and potentially harmful.
8,000 TIME_WAIT sockets with a 28,000-port range is perfectly healthy — you have headroom. TIME_WAIT only warrants action when either the count approaches your port range limit, or you're seeing actual "cannot assign requested address" errors in application logs or dmesg.
Scenario 3 — One Process Is Saturating the NIC
1
Confirm NIC saturation and identify the interface.
$ watch -n 1 'cat /proc/net/dev | awk "/eth0/{print \"RX: \" $2/1048576 \" MB total | TX: \" $10/1048576 \" MB total\"}"'
$ ip -s -s link show eth0 | grep -A4 "TX:"
$ iftop -i eth0 -n -B -P
12.5MB 25.0MB 37.5MB 50MB 62.4MB
10.0.0.5 => 192.168.50.20:873 89.4Mb 91.2Mb 87.8Mb
<= 0.12Mb 0.14Mb 0.13Mb
2
Confirm the responsible process with nethogs.
$ nethogs -d 1 eth0
PID USER PROGRAM SENT RECEIVED
9821 backup /usr/bin/rsync 94.2 MB/s 0.1 MB/s
891 www nginx 8.4 MB/s 2.1 MB/s
3
Throttle rsync's bandwidth directly. rsync has a built-in bandwidth limit flag — if you have control over how it's invoked, this is the cleanest solution.
$ kill 9821
$ rsync -av --bwlimit=20480 /data /mnt/backup
4
For processes without built-in throttling — use trickle or tc.
$ trickle -u 20480 rsync -av /data /mnt/backup
$ tc qdisc add dev eth0 root tbf rate 100mbit burst 32kbit latency 400ms
$ tc qdisc show dev eth0
$ tc qdisc del dev eth0 root
5
Long-term fix: schedule bandwidth-heavy jobs during off-peak hours.
0 2 * * * /usr/bin/rsync -av --bwlimit=51200 /data /mnt/backup >> /var/log/backup.log 2>&1
If the high-bandwidth process is a production service rather than a backup job, investigate whether it's doing unnecessary data transfer (missing caching, missing compression, pulling full datasets when it only needs diffs) before throttling it — throttling a production service degrades its performance for users.
Quick Reference — Chapter 5 Commands
| Command | Purpose | Key flags / notes |
| ping -c 10 host | Basic connectivity test and round-trip time. Watch for packet loss and jitter (mdev). | High mdev = jitter = congestion somewhere in path |
| mtr --report -n host | Combined traceroute + ping — shows loss% at every hop over multiple cycles | --report-cycles 20 for 20 probes · 100% loss at intermediate hop is usually normal |
| dig hostname | DNS lookup — check "Query time:" for DNS latency | dig @8.8.8.8 host to test a specific resolver |
| curl -w "..." url | Break HTTP request into phases: DNS / Connect / TLS / TTFB / Total | Use the timing format string from Scenario 1 step 3 |
| ss -tulpn | Listening ports with owning process — the first check for "what's running" | -t TCP · -u UDP · -l listening · -p process · -n no DNS |
| ss -s | Connection state summary — quick view of ESTABLISHED, TIME_WAIT, CLOSE_WAIT counts | Many CLOSE_WAIT = app bug. Many TIME_WAIT = high connection turnover. |
| ss -tan | awk '{print $1}' | sort | uniq -c | Count connections per state | Add state CLOSE_WAIT to filter to one state |
| ip -s link show eth0 | NIC statistics including RX/TX errors, drops, missed packets | Non-zero errors = physical problem. Non-zero drops = ring buffer overflow. |
| ethtool eth0 | NIC link status — speed and duplex negotiation | Speed should be 1000Mb/s+, Duplex should be Full on modern links |
| iftop -i eth0 -n | Live bandwidth by source→destination connection pair | -B bytes · -P show ports · -n no DNS |
| nethogs eth0 | Live bandwidth by process — the iotop equivalent for network | -d 1 update every 1s · requires root |
| rsync --bwlimit=20480 | Limit rsync bandwidth to 20 MB/s (20480 KB/s) | Built-in to rsync — cleanest solution when rsync is the culprit |
| trickle -u 20480 cmd | Per-process bandwidth cap for any command — no root needed | -u upload limit · -d download limit (in KB/s) |
| sysctl net.ipv4.tcp_tw_reuse=1 | Allow reuse of TIME_WAIT sockets for new outgoing connections | Persist via /etc/sysctl.d/ · safe, unlike the removed tcp_tw_recycle |