First Principles of Trading Infrastructure

Why N+1 Redundancy kills your PnL. The physics of Amdahl's Law, Capacity Planning (Microbursts), and the Buy vs Build Matrix.

Intermediate 40 min read Expert Version →

🎯 What You'll Learn

  • Deconstruct Amdahl's Law (Why Network > C++)
  • Analyze Capacity Planning (Designing for Microbursts)
  • Trace a Trade: The Critical Path vs Async Path
  • Calculate the Cost of Serialization
  • Audit a Build vs Buy Decision

📚 Prerequisites

Before this lesson, you should understand:

Introduction

Most trading systems are over-engineered in the wrong places. Engineers spend months optimizing C++ loops (10us gain) while running on AWS (10ms penalty). This violates Amdahl’s Law.

This lesson explores the First Principles of building a system that can handle $10B of flow without exploding.


The Physics: Amdahl’s Law

Amdahl’s Law states that the theoretical speedup of a task is limited by the part of the task that cannot be parallelized. In Trading, it means: Optimize the Bottleneck First.

The Physics:

  • Network: 10,000 microseconds (10ms).
  • Risk Check: 100 microseconds.
  • Strategy Logic: 10 microseconds.

Improving Strategy Logic by 50% saves 5 microseconds. Improving Network by 50% saves 5,000 microseconds. Action: Move servers to Colocation before you rewrite the risk engine in Rust.


Deep Dive: Capacity Planning (Microbursts)

Trading volume is not a flat line. It is a Poisson Process. Systems fail during Microbursts: 10,000 orders in 1 millisecond. If your system is designed for “Average Load”, it will die during the crash-exactly when you need it most.

The Physics: Queues grow exponentially as Utilization approaches 100%. (Delay=11ρ\text{Delay} = \frac{1}{1-\rho}). Rule: Design for 10x Average Load. Run at 10% utilization during normal hours.


Strategy: Build vs Buy Matrix

Should you build your own Matching Engine? No. Unless you are NASDAQ.

The Matrix:

  1. Commodity: Fix Parsers, Normalized Data, Connectivity. BUY. (Exegy, Vela).
  2. Secret Sauce: Strategy Logic, Alpha Models, Risk Parameters. BUILD.
  3. Hybrid: Order Management System (OMS). Buy base, extension points for custom logic.

Physics of “Not Invented Here”: Every line of code you write is a line of code you must debug at 3 AM.


Architecture: Critical Path vs Async

Not everything needs to happen before the order is sent.

Critical Path (Blocking):

  1. Market Data In.
  2. Alpha Signal.
  3. Pre-Trade Risk (Fat Finger).
  4. Order Out.

Async Path (Non-Blocking):

  1. Logging to Disk.
  2. Updating PnL Dashboard.
  3. Writing to Database.
  4. Post-Trade Analysis.

Safety: If the Logging Disk fills up, the Strategy should keep trading. Never block the Critical Path on I/O.


Code: Separating Critical Path

import queue
import threading

# The Async Worker
log_queue = queue.Queue()
def logger_thread():
    while True:
        msg = log_queue.get()
        write_to_disk(msg)

# The Critical Path
def on_tick(tick):
    # 1. Logic (Fast)
    if tick.price > 100:
        send_order()
        
        # 2. Logging (Offloaded)
        # Putting in queue is 100ns. Writing to disk is 1ms.
        log_queue.put("Order Sent") 

threading.Thread(target=logger_thread).start()

Practice Exercises

Exercise 1: The Bottleneck Hunt (Beginner)

Scenario: App takes 50ms to trade. Logic is 1ms. Database write is 49ms. Fix: Move Database write to background thread. Latency drops to 1ms.

Exercise 2: Redundancy Cost (Intermediate)

Scenario: You run 2 servers for redundancy (Active-Active). Problem: They must synchronize state. Synch takes 2ms. Tradeoff: You are now 2ms slower than a single server. Is the reliability worth the latency? (For HFT: No. Run Active-Passive with warm standby).

Exercise 3: The Burst (Advanced)

Scenario: Market crashes. 1M messages/sec. System: Queue fills up in 50ms. OOM (Out of Memory) Crash. Fix: Backpressure. Drop messages at the ingress if queue > 50%. It’s better to process some current data than all old data.


Knowledge Check

  1. What is Amdahl’s Law?
  2. Why is “Average Load” a bad metric?
  3. What belongs on the Critical Path?
  4. Should you build your own FIX Engine?
  5. What is Backpressure?
Answers
  1. Optimization limit. Performance is limited by the slowest component (Bottleneck).
  2. Bursts. Systems fail at peak load, not average load.
  3. Core Logic. Only steps strictly necessary to generate the order.
  4. No. It’s a commodity. Buy one. Focus on Alpha.
  5. Load Shedding. Dropping inputs to prevent system crash during overload.

Summary

  • Bottlenecks: Find them (Network).
  • Capacity: 10x Peak.
  • Critical Path: Zero I/O.

Questions about this lesson? Working on related infrastructure?

Let's discuss