The Physics of Latency: Jitter & Coordinated Omission

Why your benchmarks are lying. Open Loop vs Closed Loop testing, System Management Interrupts (SMIs), and the physics of P99.

Beginner 45 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct 'Coordinated Omission' in Load Testing
  • Differentiate Open Loop (Poisson) vs Closed Loop (Wait) Systems
  • Analyze Hardware Jitter (SMIs and C-States)
  • Trace a 'Hiccup' using `hdrhistogram`
  • Calculate the cost of a 1ms GC Pause in an HFT system

Introduction

“Average Latency” is a vanity metric. If you check Twitter while driving, your “Average Speed” is fine. But your “Tail Latency” (the crash) is fatal.

In HFT, Jitter (Variance) is worse than Latency. A consistent 10ms is better than a random 1ms-or-100ms. Why? Because Jitter breaks Determinism.

This lesson explores why 99% of benchmarks are wrong due to Coordinated Omission.


The Physics: Coordinated Omission

Most benchmarks (JMeter, ab) are Closed Loop: Send Request -> Wait for Reply -> Send Next Request.

The Lie: If the server freezes for 10 seconds, the benchmark pauses. It stops sending requests. Result: The benchmark reports “10ms Latency” because it omitted the thousands of requests that should have been sent during the freeze.

The Truth (Open Loop): Real traffic is Poisson Process. Users don’t wait for you to wake up. Send Request -> Wait 1ms -> Send Request (regardless of reply).


Deep Dive: Hardware Jitter (SMIs)

You pin your thread. You isolate the CPU. You bypass the Kernel. And yet, every 5 minutes, you lose 100 microseconds. Why? System Management Interrupts (SMIs).

The BIOS (Firmware) steals the CPU to check thermal sensors or ECC RAM. The OS doesn’t even know it happened. Fix: Disable SMI in BIOS. Use hwlatdetect to find them.


The Math: P99 vs P99.99

P99 means “1 in 100 requests is slow.” If a webpage loads 100 assets, P99 is the median experience. Everyone sees the tail.

Hiccups: The Gil Tene definition of a Hiccup: “A distinct period of time where the system is unresponsive.” If your GC pauses for 10ms, that is a 10ms Hiccup. For an HFT Algo doing 100 round-trips per second, that is 1 missed trade.


Code: Measuring Jitter (Histogram)

We use HdrHistogram to capture the full range of latency without massive memory overhead.

import time
import math

# Simulating Open Loop Measurement
def measure_hiccups(duration_sec):
    start = time.perf_counter_ns()
    end = start + (duration_sec * 1e9)
    
    expected_next = start
    interval = 1_000_000 # 1ms expected interval
    
    deltas = []
    
    while time.perf_counter_ns() < end:
        now = time.perf_counter_ns()
        
        # Did we sleep too long?
        if now > expected_next + interval:
            jitter = now - expected_next
            deltas.append(jitter) # Record the hiccup
        
        expected_next += interval
        
        # Busy spin to stay precise
        while time.perf_counter_ns() < expected_next:
            pass
            
    return deltas # List of "Lost Time" events

Practice Exercises

Exercise 1: The Omission Trap (Beginner)

Scenario: Server stalls for 1 second. Closed Loop Benchmark: Sends 0 requests during stall. Reports 10ms avg. Open Loop Benchmark: Sends 1000 requests during stall. Reports 500ms avg. Task: Which one reflects user reality?

Exercise 2: C-States (Intermediate)

Scenario: Your CPU goes to sleep (C-State C6) to save power. Task: A packet arrives. Wakeup time is 20 microseconds. Fix: Set /dev/cpu_dma_latency to 0 to force C0 state (Max Power).

Exercise 3: Noise Floor (Advanced)

Task: Run cyclictest on a standard Linux kernel vs a PREEMPT_RT kernel. Compare the “Max Latency” (Jitter). standard will be ~100us. RT will be ~10us.


Knowledge Check

  1. What is Coordinated Omission?
  2. Why does Open Loop testing reveal the truth?
  3. What is an SMI?
  4. Why is P99 relevant for a webpage with 100 resources?
  5. How do C-States affect latency?
Answers
  1. Benchmarking Bias. The test tool unintentionally coordinates with the system under test to hide slowness by backing off.
  2. Independence. It generates load regardless of system health, mimicking real-world traffic jams.
  3. System Management Interrupt. Hardware-level interrupt that bypasses the OS (The “God Mode” of interrupts).
  4. Probability. 0.9910036%0.99^{100} \approx 36\%. There is a 64% chance a user hits at least one slow request.
  5. Wakeup Delay. Deeper sleep states take longer to wake up from.

Summary

  • Benchmark: Open Loop > Closed Loop.
  • Metric: P99 > Average.
  • Enemy: SMIs and Power Saving.

Questions about this lesson? Working on related infrastructure?

Let's discuss