First Principles of Trading Infrastructure
Why N+1 Redundancy kills your PnL. The physics of Amdahl's Law, Capacity Planning (Microbursts), and the Buy vs Build Matrix.
🎯 What You'll Learn
- Deconstruct Amdahl's Law (Why Network > C++)
- Analyze Capacity Planning (Designing for Microbursts)
- Trace a Trade: The Critical Path vs Async Path
- Calculate the Cost of Serialization
- Audit a Build vs Buy Decision
📚 Prerequisites
Before this lesson, you should understand:
Introduction
Most trading systems are over-engineered in the wrong places. Engineers spend months optimizing C++ loops (10us gain) while running on AWS (10ms penalty). This violates Amdahl’s Law.
This lesson explores the First Principles of building a system that can handle $10B of flow without exploding.
The Physics: Amdahl’s Law
Amdahl’s Law states that the theoretical speedup of a task is limited by the part of the task that cannot be parallelized. In Trading, it means: Optimize the Bottleneck First.
The Physics:
- Network: 10,000 microseconds (10ms).
- Risk Check: 100 microseconds.
- Strategy Logic: 10 microseconds.
Improving Strategy Logic by 50% saves 5 microseconds. Improving Network by 50% saves 5,000 microseconds. Action: Move servers to Colocation before you rewrite the risk engine in Rust.
Deep Dive: Capacity Planning (Microbursts)
Trading volume is not a flat line. It is a Poisson Process. Systems fail during Microbursts: 10,000 orders in 1 millisecond. If your system is designed for “Average Load”, it will die during the crash-exactly when you need it most.
The Physics: Queues grow exponentially as Utilization approaches 100%. (). Rule: Design for 10x Average Load. Run at 10% utilization during normal hours.
Strategy: Build vs Buy Matrix
Should you build your own Matching Engine? No. Unless you are NASDAQ.
The Matrix:
- Commodity: Fix Parsers, Normalized Data, Connectivity. BUY. (Exegy, Vela).
- Secret Sauce: Strategy Logic, Alpha Models, Risk Parameters. BUILD.
- Hybrid: Order Management System (OMS). Buy base, extension points for custom logic.
Physics of “Not Invented Here”: Every line of code you write is a line of code you must debug at 3 AM.
Architecture: Critical Path vs Async
Not everything needs to happen before the order is sent.
Critical Path (Blocking):
- Market Data In.
- Alpha Signal.
- Pre-Trade Risk (Fat Finger).
- Order Out.
Async Path (Non-Blocking):
- Logging to Disk.
- Updating PnL Dashboard.
- Writing to Database.
- Post-Trade Analysis.
Safety: If the Logging Disk fills up, the Strategy should keep trading. Never block the Critical Path on I/O.
Code: Separating Critical Path
import queue
import threading
# The Async Worker
log_queue = queue.Queue()
def logger_thread():
while True:
msg = log_queue.get()
write_to_disk(msg)
# The Critical Path
def on_tick(tick):
# 1. Logic (Fast)
if tick.price > 100:
send_order()
# 2. Logging (Offloaded)
# Putting in queue is 100ns. Writing to disk is 1ms.
log_queue.put("Order Sent")
threading.Thread(target=logger_thread).start()
Practice Exercises
Exercise 1: The Bottleneck Hunt (Beginner)
Scenario: App takes 50ms to trade. Logic is 1ms. Database write is 49ms. Fix: Move Database write to background thread. Latency drops to 1ms.
Exercise 2: Redundancy Cost (Intermediate)
Scenario: You run 2 servers for redundancy (Active-Active). Problem: They must synchronize state. Synch takes 2ms. Tradeoff: You are now 2ms slower than a single server. Is the reliability worth the latency? (For HFT: No. Run Active-Passive with warm standby).
Exercise 3: The Burst (Advanced)
Scenario: Market crashes. 1M messages/sec. System: Queue fills up in 50ms. OOM (Out of Memory) Crash. Fix: Backpressure. Drop messages at the ingress if queue > 50%. It’s better to process some current data than all old data.
Knowledge Check
- What is Amdahl’s Law?
- Why is “Average Load” a bad metric?
- What belongs on the Critical Path?
- Should you build your own FIX Engine?
- What is Backpressure?
Answers
- Optimization limit. Performance is limited by the slowest component (Bottleneck).
- Bursts. Systems fail at peak load, not average load.
- Core Logic. Only steps strictly necessary to generate the order.
- No. It’s a commodity. Buy one. Focus on Alpha.
- Load Shedding. Dropping inputs to prevent system crash during overload.
Summary
- Bottlenecks: Find them (Network).
- Capacity: 10x Peak.
- Critical Path: Zero I/O.
Questions about this lesson? Working on related infrastructure?
Let's discuss