Infrastructure

Trading Infrastructure: First Principles That Scale

Architecture decisions that determine your latency ceiling. AWS, Kubernetes, monitoring, and security patterns for crypto trading systems.

6 min
#trading #infrastructure #aws #kubernetes #architecture #sre #crypto

Infrastructure decisions made in month one determine your latency ceiling for years.

I’ve built trading infrastructure from scratch at multiple firms-HFT shops, crypto exchanges, DeFi protocols. Teams obsess over algorithm optimization while running on misconfigured infrastructure. The algorithm saves 10µs. The infrastructure costs 100µs.

This post covers the foundational decisions: AWS architecture, Kubernetes patterns, monitoring, and security. Not tweaks-first principles.

The Problem {#the-problem}

Crypto trading infrastructure faces unique challenges:

ChallengeTraditional HFTCrypto Trading
LocationColocated, bare-metalCloud (AWS) required
Latency target<10µs100µs-5ms acceptable
ProtocolProprietary, FIXWebSocket, REST, FIX varies
UptimeMarket hours24/7/365
Key managementHSMsHot wallets, MPC

The Physics: Traditional HFT optimizes for nanoseconds using kernel bypass (DPDK, RDMA) on dedicated hardware. Crypto trading operates on different physics: the limiting factor is network RTT to exchanges (50-200ms), not local processing. This means we optimize for reliability and observability first, then latency.

For kernel-level optimizations, see the deep dive series:

AWS Architecture for Trading {#aws}

VPC Design

Trading VPCs need:

  1. Private subnets for trading engines (no public IPs)
  2. NAT gateways for outbound exchange connectivity
  3. VPC endpoints for AWS services (no internet traversal)
# Terraform: Trading VPC
resource "aws_vpc" "trading" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.trading.id
  cidr_block        = cidrsubnet(aws_vpc.trading.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "trading-private-${count.index}"
  }
}

# VPC endpoint for Secrets Manager (no internet)
resource "aws_vpc_endpoint" "secrets" {
  vpc_id              = aws_vpc.trading.id
  service_name        = "com.amazonaws.${var.region}.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

Instance Selection

Use CaseInstance TypeWhy
Trading enginec6in.xlargeNetwork-optimized, 200 Gbps
Market datac6in.xlargeSame-network is bottleneck
Risk enginer6i.xlargeMemory-optimized for state
Monitoringt3.largeCost-effective, not latency-critical

Citation: AWS Instance Types.

Placement Groups

Critical for inter-instance latency:

resource "aws_placement_group" "trading" {
  name     = "trading-cluster"
  strategy = "cluster"  # Same rack
}

resource "aws_instance" "trading" {
  instance_type   = "c6in.xlarge"
  placement_group = aws_placement_group.trading.id
  subnet_id       = aws_subnet.private[0].id
}

Cluster placement puts instances on the same rack. Inter-instance latency: ~50µs (vs 500µs random placement).

Trade-off: Single AZ = single point of failure. Acceptable for trading engines; DR handled at application level.

Kubernetes Patterns {#kubernetes}

Why StatefulSets

Trading workloads need:

  • Persistent identity (pod-0 handles BTC, pod-1 handles ETH)
  • Ordered scaling (risk engines start before trading)
  • Persistent storage (state survives restarts)

Deployments don’t provide these. See Kubernetes Deep Dive for full details.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: trading-engine
  template:
    spec:
      containers:
      - name: engine
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        # POD_NAME = trading-engine-0, trading-engine-1, etc.

Resource Configuration

resources:
  requests:
    memory: "4Gi"
    cpu: "2000m"
  limits:
    memory: "8Gi"
    cpu: "4000m"

Requests vs limits: Requests are guaranteed. Limits are maximums. For trading:

  • Set request = expected usage
  • Set limit = 2x request (room for bursts without OOM kill)

Node Affinity

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-type
            operator: In
            values:
            - trading

Dedicated node groups prevent resource contention with other workloads.

Multi-Exchange Connectivity {#exchanges}

Architecture: Connector Per Exchange

Each exchange has different:

  • Rate limits
  • Message formats
  • Authentication
  • Reconnection behavior

Design principle: Fault isolation. Binance down shouldn’t affect Coinbase.

# Per-exchange deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: connector-binance
spec:
  replicas: 2  # Hot standby
  template:
    spec:
      containers:
      - name: connector
        env:
        - name: EXCHANGE
          value: "binance"
        - name: WS_ENDPOINT
          value: "wss://stream.binance.com:9443"

WebSocket Reliability

WebSockets silently disconnect. TCP keepalive isn’t reliable. See Orderbook Infrastructure for resilience patterns:

  • Heartbeat monitoring
  • Staleness detection
  • Automatic reconnection with backoff
  • Prometheus metrics for reliability

Monitoring That Matters {#monitoring}

The Mistake: Infrastructure Metrics

CPU, memory, disk-these are necessary but not sufficient. A server can have 5% CPU while:

  • Fills are 10x slower than expected
  • Positions have drifted from exchange
  • Market data is stale

Trading-Specific SLOs

See Monitoring Deep Dive for complete details.

Essential metrics:

from prometheus_client import Histogram, Gauge, Counter

# Fill latency (P50, P95, P99)
FILL_LATENCY = Histogram(
    'trading_fill_latency_seconds',
    'Order submission to fill confirmation',
    ['exchange'],
    buckets=[0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
)

# Position drift (internal vs exchange)
POSITION_DRIFT = Gauge(
    'trading_position_drift_percent',
    'Difference between calculated and exchange position',
    ['exchange', 'symbol']
)

# Market data staleness
MARKET_DATA_AGE = Gauge(
    'trading_market_data_age_seconds',
    'Time since last orderbook update',
    ['exchange', 'symbol']
)

Alert Hierarchy

PAGE IMMEDIATELY (wake me at 2am):
├── Position drift > 1%
├── No fills in 5 minutes (market hours)
├── WebSocket down > 30 seconds
└── Loss limit exceeded

SLACK (business hours):
├── Fill latency P99 > 1s
├── Rejection rate > 3%
└── Rate limit > 70%

Security at Scale {#security}

Defense in Depth

Layers:

  1. VPC isolation (private subnets)
  2. Security groups (minimal ports)
  3. Secrets management (Secrets Manager, no env vars)
  4. Key rotation (automated)
  5. Audit logging (CloudTrail)

API Key Management

# AWS Secrets Manager
resource "aws_secretsmanager_secret" "exchange_keys" {
  name = "trading/exchange-api-keys"
}

# Automatic rotation
resource "aws_secretsmanager_secret_rotation" "api_keys" {
  secret_id           = aws_secretsmanager_secret.exchange_keys.id
  rotation_lambda_arn = aws_lambda_function.rotator.arn
  
  rotation_rules {
    automatically_after_days = 30
  }
}

Hot Wallet Security

Principle: Minimize hot wallet exposure.

  • Cold storage: 95%+ of funds, air-gapped
  • Hot wallets: Trading capital only
  • MPC: No single point of compromise

Track record: $500M+ capital managed, zero breaches.

CI/CD for Trading {#cicd}

Zero-Downtime Deployments

ArgoCD rollout strategy:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: trading-engine
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10
      - pause: {duration: 5m}
      - analysis:
          templates:
          - templateName: trading-health
      - setWeight: 50
      - pause: {duration: 5m}

Pre-Deployment Checks

# GitHub Actions
jobs:
  pre-deploy:
    steps:
    - name: Run latency-audit
      run: pip install latency-audit && latency-audit --json
    
    - name: Test exchange connectivity
      run: ./scripts/test_exchanges.sh
    
    - name: Validate configs
      run: ./scripts/validate_configs.py

Design Philosophy {#design-philosophy}

First Principles

1. Latency is a system property, not a code property.

Your algorithm runs in 10µs. But:

  • Network adds 50µs (kernel stack)
  • Memory adds 100µs (THP compaction)
  • CPU adds 50µs (C-state wake)

Total: 210µs. The algorithm is 5% of the problem.

2. Reliability enables performance.

You can’t optimize a system that’s down. Build reliable first, fast second.

3. Observability drives optimization.

You can’t fix what you can’t measure. Instrument everything.

4. Security is non-negotiable.

One breach erases years of profits. Defense in depth, always.

When to Break Rules

These patterns are for production trading systems. For:

  • Development: Defaults are fine
  • Backtesting: Throughput matters more
  • Paper trading: Reliability testing, not latency

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
The 5 kernel settings that cost you latencyThe $2M Millisecond: Linux Defaults That Cost You Money
StatefulSets, pod placement, EKS patternsKubernetes StatefulSets: Why Trading Systems Need State
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
Measuring without overhead using eBPFeBPF Profiling: Nanoseconds Without Adding Any
Share: LinkedIn X