The Physics of Data: Kernel Bypass, SoftIRQs & Ring Buffers

Why the Linux Kernel is too slow for 10Gbps. The physics of DMA Ring Buffers, SoftIRQ Latency, and bypassing the OS with AF_XDP.

Intermediate • 45 min read • Expert Version →

🎯 What You'll Learn

Deconstruct a Network Packet's Journey (NIC -> RAM)
Measure Interrupt Coalescing Latency (The 50µs Tax)
Tune Ring Buffers for Zero-Loss vs Zero-Latency
Implement Kernel Bypass with `AF_XDP`
Trace SoftIRQ CPU usage (`si` in top)

📚 Prerequisites

Before this lesson, you should understand:

Introduction

At 10Gbps, a packet arrives every 67 nanoseconds. The Linux Kernel takes 2,000 nanoseconds just to handle an interrupt. Do the math. The math doesn’t work.

Standard Linux Networking is “Interrupt Driven.” High-Frequency Trading Networking is “Polling Driven.” This lesson explores the Physics of the NIC (Network Interface Card) and how to bypass the Kernel entirely.

The Physics: DMA Ring Buffers

The NIC does not “give” the packet to the CPU. It writes it to RAM using DMA (Direct Memory Access). It writes into a circular queue called a Ring Buffer.

RX Ring: NIC writes packets here.
TX Ring: CPU writes packets here.
The Doorbell: A register on the NIC that the CPU “rings” to say “I added data”.

Latency Physics: If the Ring is too small -> Packet Loss (Microbursts). If the Ring is too big -> Bufferbloat (Old data). For HFT, we want Small Rings serviced extremely fast.

Interrupt Coalescing: The 50µs Tax

To capability save CPU, NICs wait for a batch of packets before interrupting the CPU. Default: rx-usecs: 50 (Wait 50µs). This is unacceptable.

# Check current settings
ethtool -c eth0

# HFT Setting: Interrupt Immediately (or disable IRQs typically)
ethtool -C eth0 rx-usecs 0 rx-frames 1

Consequence: CPU usage spikes to 100% processing Interrupts.

SoftIRQs: The Bottom Half

When the Hard IRQ fires, the CPU acknowledges it. The “Real Work” (TCP/IP processing) happens in a SoftIRQ (Software Interrupt). In top, this is the %si column. If %si hits 100% on a core, you are dropping packets.

Tuning: Multi-Queue Hashing (RSS) Distribute the load across cores using the Tuple Hash (IP, Port).

ethtool -X eth0 equal 4 # Spread across 4 RX Queues

Kernel Bypass: AF_XDP

Why let the Kernel process TCP/IP if we just want the raw UDP multicast packet? AF_XDP (XDP Socket) allows userspace to read directly from the DMA Ring Buffer.

Performance:

Standard Socket: ~15µs Latency.
AF_XDP: ~2µs Latency.

// Concept Code: AF_XDP
// 1. Create XSK (XDP Socket)
xsk_socket__create(&xsk, "eth0", queue_id, umem, &rx, &tx, &cfg);

// 2. Load BPF program to redirect packets
// This runs INDSIDE the NIC driver
bpf_program__set_type(prog, BPF_PROG_TYPE_XDP);

Physics: Data never copies to Kernel Memory space. It stays in the “UMEM” (Userspace Memory) region. Zero Copy.

Practice Exercises

Exercise 1: The Coalescing Experiment (Beginner)

Task: sockperf ping-pong. Action: Measure RTT with rx-usecs 50 vs rx-usecs 0. Result: expect to see ~40µs improvement.

Exercise 2: Ring Buffer Tuning (Intermediate)

Task: ethtool -g eth0. Action: Set ethtool -G eth0 rx 4096 (Max Throughput) vs rx 64 (Low Latency). Risk: Low buffer size risks drops during bursts. Check ethtool -S eth0 | grep drops.

Exercise 3: SoftIRQ Profiling (Advanced)

Task: Run high bandwidth traffic (iperf). Action: Run mpstat -P ALL 1. Watch the %soft column. Goal: Ensure it is balanced across cores (RSS is working).

Knowledge Check

What is the “Doorbell” in NIC terminology?
Why does rx-usecs 0 increase CPU usage?
What does %si mean in top?
How does AF_XDP achieve Zero Copy?
What happens if the RX Ring is full?

Answers

A memory-mapped register on the NIC that signals new data is ready.
More Interrupts. The CPU wakes up for every single packet.
SoftIRQ Time. Time spent processing Protocol Stacks (TCP/IP).
UMEM. The NIC writes directly to userspace-registered memory.
Packet Drop. The NIC discards the packet immediately.

Summary

Interrupts: Too slow for 10GbE.
Polling: The only way to win.
Ring Buffers: The queue between NIC and RAM.
AF_XDP: The modern way to bypass the OS.

Pro Version: For production-grade implementation details, see the full research article: network-optimization-linux-latency

Pro Version: See the full research: Network Optimization for Linux Latency

Questions about this lesson? Working on related infrastructure?

Let's discuss