The Physics of Data: Kernel Bypass, SoftIRQs & Ring Buffers

Why the Linux Kernel is too slow for 10Gbps. The physics of DMA Ring Buffers, SoftIRQ Latency, and bypassing the OS with AF_XDP.

Intermediate 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct a Network Packet's Journey (NIC -> RAM)
  • Measure Interrupt Coalescing Latency (The 50µs Tax)
  • Tune Ring Buffers for Zero-Loss vs Zero-Latency
  • Implement Kernel Bypass with `AF_XDP`
  • Trace SoftIRQ CPU usage (`si` in top)

Introduction

At 10Gbps, a packet arrives every 67 nanoseconds. The Linux Kernel takes 2,000 nanoseconds just to handle an interrupt. Do the math. The math doesn’t work.

Standard Linux Networking is “Interrupt Driven.” High-Frequency Trading Networking is “Polling Driven.” This lesson explores the Physics of the NIC (Network Interface Card) and how to bypass the Kernel entirely.


The Physics: DMA Ring Buffers

The NIC does not “give” the packet to the CPU. It writes it to RAM using DMA (Direct Memory Access). It writes into a circular queue called a Ring Buffer.

  • RX Ring: NIC writes packets here.
  • TX Ring: CPU writes packets here.
  • The Doorbell: A register on the NIC that the CPU “rings” to say “I added data”.

Latency Physics: If the Ring is too small -> Packet Loss (Microbursts). If the Ring is too big -> Bufferbloat (Old data). For HFT, we want Small Rings serviced extremely fast.


Interrupt Coalescing: The 50µs Tax

To capability save CPU, NICs wait for a batch of packets before interrupting the CPU. Default: rx-usecs: 50 (Wait 50µs). This is unacceptable.

# Check current settings
ethtool -c eth0

# HFT Setting: Interrupt Immediately (or disable IRQs typically)
ethtool -C eth0 rx-usecs 0 rx-frames 1

Consequence: CPU usage spikes to 100% processing Interrupts.


SoftIRQs: The Bottom Half

When the Hard IRQ fires, the CPU acknowledges it. The “Real Work” (TCP/IP processing) happens in a SoftIRQ (Software Interrupt). In top, this is the %si column. If %si hits 100% on a core, you are dropping packets.

Tuning: Multi-Queue Hashing (RSS) Distribute the load across cores using the Tuple Hash (IP, Port).

ethtool -X eth0 equal 4 # Spread across 4 RX Queues

Kernel Bypass: AF_XDP

Why let the Kernel process TCP/IP if we just want the raw UDP multicast packet? AF_XDP (XDP Socket) allows userspace to read directly from the DMA Ring Buffer.

Performance:

  • Standard Socket: ~15µs Latency.
  • AF_XDP: ~2µs Latency.
// Concept Code: AF_XDP
// 1. Create XSK (XDP Socket)
xsk_socket__create(&xsk, "eth0", queue_id, umem, &rx, &tx, &cfg);

// 2. Load BPF program to redirect packets
// This runs INDSIDE the NIC driver
bpf_program__set_type(prog, BPF_PROG_TYPE_XDP);

Physics: Data never copies to Kernel Memory space. It stays in the “UMEM” (Userspace Memory) region. Zero Copy.


Practice Exercises

Exercise 1: The Coalescing Experiment (Beginner)

Task: sockperf ping-pong. Action: Measure RTT with rx-usecs 50 vs rx-usecs 0. Result: expect to see ~40µs improvement.

Exercise 2: Ring Buffer Tuning (Intermediate)

Task: ethtool -g eth0. Action: Set ethtool -G eth0 rx 4096 (Max Throughput) vs rx 64 (Low Latency). Risk: Low buffer size risks drops during bursts. Check ethtool -S eth0 | grep drops.

Exercise 3: SoftIRQ Profiling (Advanced)

Task: Run high bandwidth traffic (iperf). Action: Run mpstat -P ALL 1. Watch the %soft column. Goal: Ensure it is balanced across cores (RSS is working).


Knowledge Check

  1. What is the “Doorbell” in NIC terminology?
  2. Why does rx-usecs 0 increase CPU usage?
  3. What does %si mean in top?
  4. How does AF_XDP achieve Zero Copy?
  5. What happens if the RX Ring is full?
Answers
  1. A memory-mapped register on the NIC that signals new data is ready.
  2. More Interrupts. The CPU wakes up for every single packet.
  3. SoftIRQ Time. Time spent processing Protocol Stacks (TCP/IP).
  4. UMEM. The NIC writes directly to userspace-registered memory.
  5. Packet Drop. The NIC discards the packet immediately.

Summary

  • Interrupts: Too slow for 10GbE.
  • Polling: The only way to win.
  • Ring Buffers: The queue between NIC and RAM.
  • AF_XDP: The modern way to bypass the OS.


Pro Version: For production-grade implementation details, see the full research article: network-optimization-linux-latency


Pro Version: See the full research: Network Optimization for Linux Latency

Questions about this lesson? Working on related infrastructure?

Let's discuss