The Physics of Data: Kernel Bypass, SoftIRQs & Ring Buffers
Why the Linux Kernel is too slow for 10Gbps. The physics of DMA Ring Buffers, SoftIRQ Latency, and bypassing the OS with AF_XDP.
🎯 What You'll Learn
- Deconstruct a Network Packet's Journey (NIC -> RAM)
- Measure Interrupt Coalescing Latency (The 50µs Tax)
- Tune Ring Buffers for Zero-Loss vs Zero-Latency
- Implement Kernel Bypass with `AF_XDP`
- Trace SoftIRQ CPU usage (`si` in top)
📚 Prerequisites
Before this lesson, you should understand:
Introduction
At 10Gbps, a packet arrives every 67 nanoseconds. The Linux Kernel takes 2,000 nanoseconds just to handle an interrupt. Do the math. The math doesn’t work.
Standard Linux Networking is “Interrupt Driven.” High-Frequency Trading Networking is “Polling Driven.” This lesson explores the Physics of the NIC (Network Interface Card) and how to bypass the Kernel entirely.
The Physics: DMA Ring Buffers
The NIC does not “give” the packet to the CPU. It writes it to RAM using DMA (Direct Memory Access). It writes into a circular queue called a Ring Buffer.
- RX Ring: NIC writes packets here.
- TX Ring: CPU writes packets here.
- The Doorbell: A register on the NIC that the CPU “rings” to say “I added data”.
Latency Physics: If the Ring is too small -> Packet Loss (Microbursts). If the Ring is too big -> Bufferbloat (Old data). For HFT, we want Small Rings serviced extremely fast.
Interrupt Coalescing: The 50µs Tax
To capability save CPU, NICs wait for a batch of packets before interrupting the CPU.
Default: rx-usecs: 50 (Wait 50µs).
This is unacceptable.
# Check current settings
ethtool -c eth0
# HFT Setting: Interrupt Immediately (or disable IRQs typically)
ethtool -C eth0 rx-usecs 0 rx-frames 1
Consequence: CPU usage spikes to 100% processing Interrupts.
SoftIRQs: The Bottom Half
When the Hard IRQ fires, the CPU acknowledges it.
The “Real Work” (TCP/IP processing) happens in a SoftIRQ (Software Interrupt).
In top, this is the %si column.
If %si hits 100% on a core, you are dropping packets.
Tuning: Multi-Queue Hashing (RSS) Distribute the load across cores using the Tuple Hash (IP, Port).
ethtool -X eth0 equal 4 # Spread across 4 RX Queues
Kernel Bypass: AF_XDP
Why let the Kernel process TCP/IP if we just want the raw UDP multicast packet? AF_XDP (XDP Socket) allows userspace to read directly from the DMA Ring Buffer.
Performance:
- Standard Socket: ~15µs Latency.
- AF_XDP: ~2µs Latency.
// Concept Code: AF_XDP
// 1. Create XSK (XDP Socket)
xsk_socket__create(&xsk, "eth0", queue_id, umem, &rx, &tx, &cfg);
// 2. Load BPF program to redirect packets
// This runs INDSIDE the NIC driver
bpf_program__set_type(prog, BPF_PROG_TYPE_XDP);
Physics: Data never copies to Kernel Memory space. It stays in the “UMEM” (Userspace Memory) region. Zero Copy.
Practice Exercises
Exercise 1: The Coalescing Experiment (Beginner)
Task: sockperf ping-pong.
Action: Measure RTT with rx-usecs 50 vs rx-usecs 0.
Result: expect to see ~40µs improvement.
Exercise 2: Ring Buffer Tuning (Intermediate)
Task: ethtool -g eth0.
Action: Set ethtool -G eth0 rx 4096 (Max Throughput) vs rx 64 (Low Latency).
Risk: Low buffer size risks drops during bursts. Check ethtool -S eth0 | grep drops.
Exercise 3: SoftIRQ Profiling (Advanced)
Task: Run high bandwidth traffic (iperf).
Action: Run mpstat -P ALL 1. Watch the %soft column.
Goal: Ensure it is balanced across cores (RSS is working).
Knowledge Check
- What is the “Doorbell” in NIC terminology?
- Why does
rx-usecs 0increase CPU usage? - What does
%simean intop? - How does AF_XDP achieve Zero Copy?
- What happens if the RX Ring is full?
Answers
- A memory-mapped register on the NIC that signals new data is ready.
- More Interrupts. The CPU wakes up for every single packet.
- SoftIRQ Time. Time spent processing Protocol Stacks (TCP/IP).
- UMEM. The NIC writes directly to userspace-registered memory.
- Packet Drop. The NIC discards the packet immediately.
Summary
- Interrupts: Too slow for 10GbE.
- Polling: The only way to win.
- Ring Buffers: The queue between NIC and RAM.
- AF_XDP: The modern way to bypass the OS.
Pro Version: For production-grade implementation details, see the full research article: network-optimization-linux-latency
Pro Version: See the full research: Network Optimization for Linux Latency
Questions about this lesson? Working on related infrastructure?
Let's discuss