The Physics of Matching Engines: Speed & Determinism
Why the fastest engines in the world are Single-Threaded. Ring Buffers, CPU Cache alignment, and the LMAX Disruptor.
🎯 What You'll Learn
- Deconstruct the 'Single-Threaded Determinism' rule
- Analyze the LMAX Disruptor Pattern (Ring Buffers)
- Calculate Cache Miss penalties (`std::map` vs Arrays)
- Trace a packet from NIC to Matching Logic (Kernel Bypass)
- Architect a Lock-Free Order Processing Pipeline
Introduction
In Web Development, we scale by adding threads. In High Frequency Trading, adding a thread is a death sentence.
Why? Locks. A context switch costs 2 microseconds. A lock contention costs 5 microseconds. A modern Matching Engine matches an order in 50 nanoseconds. If you use a lock, you are 100x slower.
This lesson explores the physics of building the fastest software on Earth.
The Physics: Single-Threaded Determinism
Markets must be Fair.
If Client A sends an order at t=1 and Client B sends at t=2, A must execute before B.
If you have multi-threaded matching, Thread B might win a race condition against Thread A. This is illegal.
The Solution: The Sequencer. All incoming packets are serialized into a single stream. A Single Thread processes this stream. No locks are needed because only one thread owns the memory.
Deep Dive: The Ring Buffer (LMAX Disruptor)
How do you pass data between the Network Thread (Writer) and the Matching Thread (Reader) without a lock? The Ring Buffer.
Imagine a circular array of 1 million slots.
- Writer: “I wrote to slot 100. I am moving my cursor to 101.”
- Reader: “I see the Writer is at 101. I can safely read 100.”
Physics:
- Memory Barrier: Used instead of Locks (Cost: ~10ns vs ~5000ns).
- Cache Locality: The array is contiguous in RAM, so the CPU pre-fetcher pulls the next order into L1 Cache before you even ask for it.
The Enemy: Cache Misses
Why std::map (Red-Black Tree) is banned in HFT:
It is a “Node-Based” container. Node A is at address 0x100. Node B is at 0x9000.
To traverse the tree, the CPU must fetch 0x100, wait 100ns, then fetch 0x9000.
The Fix: Flat Vectors.
We pre-allocate std::vector<Order>.
Scanning a vector is 100x faster than walking a tree, even if the algorithm is vs , because of Hardware Prefetching.
Code: The Disruptor Pattern
A conceptual implementation of a lock-free ring buffer.
class RingBuffer:
def __init__(self, size=1024):
self.buffer = [None] * size
self.size = size
self.writer_cursor = 0
self.reader_cursor = 0
def write(self, data):
# In C++, we would use memory_order_release here
next_slot = (self.writer_cursor + 1) % self.size
if next_slot == self.reader_cursor:
raise Exception("Buffer Full")
self.buffer[self.writer_cursor] = data
self.writer_cursor = next_slot # Atomic commit
def read(self):
# In C++, we would use memory_order_acquire here
if self.reader_cursor == self.writer_cursor:
return None # Empty
data = self.buffer[self.reader_cursor]
self.reader_cursor = (self.reader_cursor + 1) % self.size
return data
Practice Exercises
Exercise 1: Cache Miss Math (Beginner)
Scenario: L1 Cache hit = 1ns. RAM access = 100ns. Task: Algorithm A does 100 L1 hits. Algorithm B does 2 RAM accesses. Which is faster? (A = 100ns. B = 200ns. The “slower” algorithm wins if it fits in cache).
Exercise 2: Lock Contention (Intermediate)
Scenario: Two threads fight for a Mutex. Task: Why does the OS put the waiting thread to sleep? What occurs (Context Switch)? Why is this unacceptable? (Sleep = >10us latency).
Exercise 3: Sequence Numbers (Advanced)
Task: The Matching Engine crashes. How do you recover? Solution: Replay the Journal. Since the engine is deterministic, replaying the same input stream produces the exact same state.
Knowledge Check
- Why are Matching Engines single-threaded?
- What is a Ring Buffer?
- Why do HFTs avoid
new/mallocat runtime? - What is a “Memory Barrier”?
- Why is a flat array faster than a tree for small datasets?
Answers
- Determinism and Speed. No race conditions, no lock overhead.
- Circular Queue. A fixed-size array used to pass data between threads lock-free.
- OS Overhead. Allocation involves the Kernel and can trigger Page Faults.
- CPU Instruction. It forces the CPU to flush write buffers, ensuring other cores see the data. Cheaper than a lock.
- Cache Locality. Arrays are contiguous; Trees are scattered in RAM.
Summary
- Logic: Single-Threaded > Multi-Threaded.
- Data: Arrays > Trees.
- Communication: Ring Buffers > Queues.
Questions about this lesson? Working on related infrastructure?
Let's discuss