The Physics of CPU Latency: Caches, Context Switches & Isolation

Why your code is slow. The physics of CPU Caches (L1/L2/L3), the 4µs cost of a Context Switch, and the `isolcpus` kernel boot parameter.

Intermediate • 45 min read • Expert Version →

🎯 What You'll Learn

Deconstruct the CPU Memory Hierarchy (L1 vs RAM)
Measure the exact cost of a Context Switch (Syscall Physics)
Configure Kernel Isolation (`isolcpus`, `nohz_full`)
Pin processes to specific cores using `taskset`
Analyze False Sharing (Cache Coherency Physics)

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In High-Frequency Trading (HFT), we don’t think in milliseconds. We think in Clock Cycles. A 4GHz CPU executes 4 Billion cycles per second. 1 Cycle = 0.25 nanoseconds.

When your code waits for RAM, it wastes 400 cycles. When the OS switches tasks, it wastes 12,000 cycles. This lesson explores the Physics of the CPU-how to keep data hot in L1 cache and how to banish the Kernel Scheduler from your trading cores.

The Speed of Light: Cache Physics

Data does not move instantly. It travels through silicon.

Storage	Latency (ns)	Cycles (4GHz)	Physics Metaphor
L1 Cache	1 ns	4	Picking a pen from your desk.
L2 Cache	4 ns	16	Picking a book from the shelf.
L3 Cache	12 ns	48	Walking to the next room.
RAM	100 ns	400	Walking to the warehouse.

The Goal: Stick to L1. If you access a random memory address (Linked List), you hit RAM. If you access contiguous memory (Array), the CPU Prefetcher pulls it into L1.

Context Switches: The Invisible Tax

A Context Switch is when the CPU stops your code to run something else (another app, or the Kernel). It is catastrophic for latency.

The Physics:

Save Registers: Using CPU to save state.
Pollute L1 Cache: The new process overrides your hot data in L1.
TLB Flush: The Translation Lookaside Buffer (Virtual Memory map) is wiped.

Cost: ~2-4 microseconds. (15,000 cycles). Solution: CPU Pinning.

Code: CPU Pinning & Isolation

We tell the Linux Scheduler: “Do not touch CPU 2 and 3”.

1. Boot Parameters (The Nuclear Option)

Edit /etc/default/grub:

# isolcpus: Remove from scheduler balancing
# nohz_full: Stop scheduling-clock ticks (1000Hz -> 1Hz)
# rcu_nocbs: Move RCU callbacks to housekeeping cores
GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"

Note: Run update-grub and reboot.

2. Runtime Pinning (`taskset`)

Now CPU 2 and 3 are empty. You must manually force your app onto them.

# Launch a python script on CPU 2
taskset -c 2 python3 my_trading_algo.py

# Check affinity (Physics Verification)
pid=$(pgrep -f my_trading_algo)
taskset -p $pid
# output: pid 1234's current affinity mask: 4 (Binary 100 for CPU 2)

Imagine two threads on different Cores writing to variables that sit next to each other in RAM. CPUs cache data in 64-byte Cache Lines.

Thread A writes to Variable X.
Thread B writes to Variable Y.
If X and Y are in the same 64-byte line, Core A and Core B fight.
Physics: The Cache Coherency Protocol (MESI) forces L1 invalidation constantly. Code slows down by 50x.

Fix: Pad your data structures to ensure separation.

Practice Exercises

Exercise 1: The Context Switch Cost (Beginner)

Task: Use perf to measure context switches. Action: perf stat -e context-switches ./my_script. Goal: Minimize this number to zero.

Exercise 2: Cache Miss Profiling (Intermediate)

Task: Run perf stat -e L1-dcache-load-misses ./my_script. Action: Change a Linked List to an Array. Watch misses drop.

Exercise 3: Full Isolation (Advanced)

Task: Isolate CPU 3 via GRUB. Action: Run a busy-loop on CPU 3. Observation: Use htop. See that CPU 3 stays at 100% usage, but the Load Average doesn’t spike typically because regular scheduler tasks aren’t fighting for it.

Knowledge Check

How many cycles does a RAM access cost?
What is a Cache Line size?
What does isolcpus do?
Why is a Linked List slower than an Array?
What is False Sharing?

Answers

~400 cycles.
64 Bytes.
Removes a CPU from the kernel scheduler’s balancing algorithms.
Pointer chasing. Arrays are contiguous and prefetch-friendly; Linked Lists are random memory jumps (RAM hits).
Two cores fighting over the same Cache Line due to proximity of variables.

Summary

L1 Cache: The only fast storage.
Context Switch: A 15,000 cycle penalty.
isolcpus: Evicting the Scheduler.
False Sharing: The invisible concurrency bug.

Pro Version: For production-grade implementation details, see the full research article: cpu-optimization-linux-latency

Questions about this lesson? Working on related infrastructure?

Let's discuss