Infrastructure
Network Optimization: Kernel Bypass and the Art of Busy Polling
How the Linux network stack adds 50µs of latency, and the interrupt coalescing, busy polling, and AF_XDP techniques that eliminate it.
On a standard Linux server with a Mellanox ConnectX-6, ping shows 120µs RTT. After disabling interrupt coalescing and enabling busy polling, we hit 18µs.
The Linux network stack is optimized for throughput, not latency. Every packet traverses 7 layers of software before reaching your application. Each layer adds jitter.
This post documents the techniques to shave 100µs off your network RTT without resorting to full kernel bypass.
1. The Physics of Network Latency
When a packet arrives, the NIC holds it in a hardware buffer. It waits for either:
- Interrupt Coalescing Timeout (e.g., 100µs): The NIC batches interrupts to reduce CPU load.
- Interrupt Coalescing Threshold (e.g., 64 packets): The NIC interrupts when the buffer is full.
This is great for throughput. It is terrible for latency.
2. The Decision Matrix
| Approach | RTT Impact | CPU Cost | Verdict |
|---|---|---|---|
| A. Default (Coalescing On) | Baseline (~120µs) | Low | Optimized for throughput. |
| B. Coalescing Off | -30µs | Medium | Better, but still has syscall overhead. |
| C. Coalescing Off + Busy Polling | -100µs (~18µs) | High (1 core) | Selected. Near-kernel-bypass performance. |
3. The Kill: Busy Polling Configuration
Step 1: Disable Interrupt Coalescing
# Disable adaptive coalescing
sudo ethtool -C eth0 adaptive-rx off adaptive-tx off
# Set coalescing to minimum
sudo ethtool -C eth0 rx-usecs 0 rx-frames 1
Step 2: Enable Busy Polling
Busy polling makes the recvmsg() syscall spin-wait on the NIC’s RX queue instead of sleeping for an interrupt.
# /etc/sysctl.conf
net.core.busy_poll = 50 # Poll for 50µs before blocking
net.core.busy_read = 50 # Same for read()
sudo sysctl -p
Step 3: Set Socket Option
Your application must opt-in.
int timeout = 50; // microseconds
setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &timeout, sizeof(timeout));
Verification:
# Before: ~120µs
ping -c 100 <target>
# After: ~18µs
# (Your mileage may vary based on NIC and driver)
4. The Tool: Auditing Network State
pip install latency-audit && latency-audit --check network
This checks interrupt coalescing settings (ethtool -c) and sysctl values for busy polling.
5. Systems Thinking: The Trade-offs
- CPU Burn: Busy polling consumes 100% of a CPU core while waiting for packets. This is acceptable for HFT; it may not be for general-purpose servers.
- Driver Support: Not all NIC drivers support busy polling. Mellanox (
mlx5) and Intel (ixgbe,i40e) do. AWS ENA has limited support. - Kernel Version: Busy polling performance improved significantly in Linux 4.4+. Use a modern kernel.
6. The Philosophy
The network stack is a trade-off between latency and efficiency. The kernel defaults to efficiency because most users care about throughput, not P99.
By disabling coalescing and enabling busy polling, you are telling the kernel: “I will pay the CPU cost. Give me my packets immediately.”
For HFT, a 100µs improvement is worth burning a core. For most applications, it is not. Know your SLA.
Up Next in Linux Infrastructure Deep Dives
Trading Infrastructure: First Principles That Scale
Architecture decisions that determine your latency ceiling. AWS, Kubernetes, monitoring, and security patterns for crypto trading systems.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| WebSocket infrastructure & orderbook design | Market Data Infrastructure: WebSocket Patterns That Scale |
| The 5 kernel settings that cost you latency | The $2M Millisecond: Linux Defaults That Cost You Money |
| StatefulSets, pod placement, EKS patterns | Kubernetes StatefulSets: Why Trading Systems Need State |
| Measuring without overhead using eBPF | eBPF Profiling: Nanoseconds Without Adding Any |