The Physics of CPU Latency: Caches, Context Switches & Isolation
Why your code is slow. The physics of CPU Caches (L1/L2/L3), the 4µs cost of a Context Switch, and the `isolcpus` kernel boot parameter.
🎯 What You'll Learn
- Deconstruct the CPU Memory Hierarchy (L1 vs RAM)
- Measure the exact cost of a Context Switch (Syscall Physics)
- Configure Kernel Isolation (`isolcpus`, `nohz_full`)
- Pin processes to specific cores using `taskset`
- Analyze False Sharing (Cache Coherency Physics)
📚 Prerequisites
Before this lesson, you should understand:
Introduction
In High-Frequency Trading (HFT), we don’t think in milliseconds. We think in Clock Cycles. A 4GHz CPU executes 4 Billion cycles per second. 1 Cycle = 0.25 nanoseconds.
When your code waits for RAM, it wastes 400 cycles. When the OS switches tasks, it wastes 12,000 cycles. This lesson explores the Physics of the CPU-how to keep data hot in L1 cache and how to banish the Kernel Scheduler from your trading cores.
The Speed of Light: Cache Physics
Data does not move instantly. It travels through silicon.
| Storage | Latency (ns) | Cycles (4GHz) | Physics Metaphor |
|---|---|---|---|
| L1 Cache | 1 ns | 4 | Picking a pen from your desk. |
| L2 Cache | 4 ns | 16 | Picking a book from the shelf. |
| L3 Cache | 12 ns | 48 | Walking to the next room. |
| RAM | 100 ns | 400 | Walking to the warehouse. |
The Goal: Stick to L1. If you access a random memory address (Linked List), you hit RAM. If you access contiguous memory (Array), the CPU Prefetcher pulls it into L1.
Context Switches: The Invisible Tax
A Context Switch is when the CPU stops your code to run something else (another app, or the Kernel). It is catastrophic for latency.
The Physics:
- Save Registers: Using CPU to save state.
- Pollute L1 Cache: The new process overrides your hot data in L1.
- TLB Flush: The Translation Lookaside Buffer (Virtual Memory map) is wiped.
Cost: ~2-4 microseconds. (15,000 cycles). Solution: CPU Pinning.
Code: CPU Pinning & Isolation
We tell the Linux Scheduler: “Do not touch CPU 2 and 3”.
1. Boot Parameters (The Nuclear Option)
Edit /etc/default/grub:
# isolcpus: Remove from scheduler balancing
# nohz_full: Stop scheduling-clock ticks (1000Hz -> 1Hz)
# rcu_nocbs: Move RCU callbacks to housekeeping cores
GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"
Note: Run update-grub and reboot.
2. Runtime Pinning (taskset)
Now CPU 2 and 3 are empty. You must manually force your app onto them.
# Launch a python script on CPU 2
taskset -c 2 python3 my_trading_algo.py
# Check affinity (Physics Verification)
pid=$(pgrep -f my_trading_algo)
taskset -p $pid
# output: pid 1234's current affinity mask: 4 (Binary 100 for CPU 2)
False Sharing: The Concurrency Killer
Imagine two threads on different Cores writing to variables that sit next to each other in RAM. CPUs cache data in 64-byte Cache Lines.
- Thread A writes to
Variable X. - Thread B writes to
Variable Y. - If X and Y are in the same 64-byte line, Core A and Core B fight.
- Physics: The Cache Coherency Protocol (MESI) forces L1 invalidation constantly. Code slows down by 50x.
Fix: Pad your data structures to ensure separation.
Practice Exercises
Exercise 1: The Context Switch Cost (Beginner)
Task: Use perf to measure context switches.
Action: perf stat -e context-switches ./my_script.
Goal: Minimize this number to zero.
Exercise 2: Cache Miss Profiling (Intermediate)
Task: Run perf stat -e L1-dcache-load-misses ./my_script.
Action: Change a Linked List to an Array. Watch misses drop.
Exercise 3: Full Isolation (Advanced)
Task: Isolate CPU 3 via GRUB.
Action: Run a busy-loop on CPU 3.
Observation: Use htop. See that CPU 3 stays at 100% usage, but the Load Average doesn’t spike typically because regular scheduler tasks aren’t fighting for it.
Knowledge Check
- How many cycles does a RAM access cost?
- What is a Cache Line size?
- What does
isolcpusdo? - Why is a Linked List slower than an Array?
- What is False Sharing?
Answers
- ~400 cycles.
- 64 Bytes.
- Removes a CPU from the kernel scheduler’s balancing algorithms.
- Pointer chasing. Arrays are contiguous and prefetch-friendly; Linked Lists are random memory jumps (RAM hits).
- Two cores fighting over the same Cache Line due to proximity of variables.
Summary
- L1 Cache: The only fast storage.
- Context Switch: A 15,000 cycle penalty.
- isolcpus: Evicting the Scheduler.
- False Sharing: The invisible concurrency bug.
Pro Version: For production-grade implementation details, see the full research article: cpu-optimization-linux-latency
Questions about this lesson? Working on related infrastructure?
Let's discuss