The Physics of Latency: Jitter & Coordinated Omission
Why your benchmarks are lying. Open Loop vs Closed Loop testing, System Management Interrupts (SMIs), and the physics of P99.
🎯 What You'll Learn
- Deconstruct 'Coordinated Omission' in Load Testing
- Differentiate Open Loop (Poisson) vs Closed Loop (Wait) Systems
- Analyze Hardware Jitter (SMIs and C-States)
- Trace a 'Hiccup' using `hdrhistogram`
- Calculate the cost of a 1ms GC Pause in an HFT system
📚 Prerequisites
Before this lesson, you should understand:
Introduction
“Average Latency” is a vanity metric. If you check Twitter while driving, your “Average Speed” is fine. But your “Tail Latency” (the crash) is fatal.
In HFT, Jitter (Variance) is worse than Latency. A consistent 10ms is better than a random 1ms-or-100ms. Why? Because Jitter breaks Determinism.
This lesson explores why 99% of benchmarks are wrong due to Coordinated Omission.
The Physics: Coordinated Omission
Most benchmarks (JMeter, ab) are Closed Loop:
Send Request -> Wait for Reply -> Send Next Request.
The Lie: If the server freezes for 10 seconds, the benchmark pauses. It stops sending requests. Result: The benchmark reports “10ms Latency” because it omitted the thousands of requests that should have been sent during the freeze.
The Truth (Open Loop):
Real traffic is Poisson Process. Users don’t wait for you to wake up.
Send Request -> Wait 1ms -> Send Request (regardless of reply).
Deep Dive: Hardware Jitter (SMIs)
You pin your thread. You isolate the CPU. You bypass the Kernel. And yet, every 5 minutes, you lose 100 microseconds. Why? System Management Interrupts (SMIs).
The BIOS (Firmware) steals the CPU to check thermal sensors or ECC RAM.
The OS doesn’t even know it happened.
Fix: Disable SMI in BIOS. Use hwlatdetect to find them.
The Math: P99 vs P99.99
P99 means “1 in 100 requests is slow.” If a webpage loads 100 assets, P99 is the median experience. Everyone sees the tail.
Hiccups: The Gil Tene definition of a Hiccup: “A distinct period of time where the system is unresponsive.” If your GC pauses for 10ms, that is a 10ms Hiccup. For an HFT Algo doing 100 round-trips per second, that is 1 missed trade.
Code: Measuring Jitter (Histogram)
We use HdrHistogram to capture the full range of latency without massive memory overhead.
import time
import math
# Simulating Open Loop Measurement
def measure_hiccups(duration_sec):
start = time.perf_counter_ns()
end = start + (duration_sec * 1e9)
expected_next = start
interval = 1_000_000 # 1ms expected interval
deltas = []
while time.perf_counter_ns() < end:
now = time.perf_counter_ns()
# Did we sleep too long?
if now > expected_next + interval:
jitter = now - expected_next
deltas.append(jitter) # Record the hiccup
expected_next += interval
# Busy spin to stay precise
while time.perf_counter_ns() < expected_next:
pass
return deltas # List of "Lost Time" events
Practice Exercises
Exercise 1: The Omission Trap (Beginner)
Scenario: Server stalls for 1 second. Closed Loop Benchmark: Sends 0 requests during stall. Reports 10ms avg. Open Loop Benchmark: Sends 1000 requests during stall. Reports 500ms avg. Task: Which one reflects user reality?
Exercise 2: C-States (Intermediate)
Scenario: Your CPU goes to sleep (C-State C6) to save power.
Task: A packet arrives. Wakeup time is 20 microseconds.
Fix: Set /dev/cpu_dma_latency to 0 to force C0 state (Max Power).
Exercise 3: Noise Floor (Advanced)
Task: Run cyclictest on a standard Linux kernel vs a PREEMPT_RT kernel.
Compare the “Max Latency” (Jitter). standard will be ~100us. RT will be ~10us.
Knowledge Check
- What is Coordinated Omission?
- Why does Open Loop testing reveal the truth?
- What is an SMI?
- Why is P99 relevant for a webpage with 100 resources?
- How do C-States affect latency?
Answers
- Benchmarking Bias. The test tool unintentionally coordinates with the system under test to hide slowness by backing off.
- Independence. It generates load regardless of system health, mimicking real-world traffic jams.
- System Management Interrupt. Hardware-level interrupt that bypasses the OS (The “God Mode” of interrupts).
- Probability. . There is a 64% chance a user hits at least one slow request.
- Wakeup Delay. Deeper sleep states take longer to wake up from.
Summary
- Benchmark: Open Loop > Closed Loop.
- Metric: P99 > Average.
- Enemy: SMIs and Power Saving.
Questions about this lesson? Working on related infrastructure?
Let's discuss