Infrastructure
Trading Infrastructure: First Principles That Scale
Architecture decisions that determine your latency ceiling. AWS, Kubernetes, monitoring, and security patterns for crypto trading systems.
Infrastructure decisions made in month one determine your latency ceiling for years.
I’ve built trading infrastructure from scratch at multiple firms-HFT shops, crypto exchanges, DeFi protocols. Teams obsess over algorithm optimization while running on misconfigured infrastructure. The algorithm saves 10µs. The infrastructure costs 100µs.
This post covers the foundational decisions: AWS architecture, Kubernetes patterns, monitoring, and security. Not tweaks-first principles.
The Problem {#the-problem}
Crypto trading infrastructure faces unique challenges:
| Challenge | Traditional HFT | Crypto Trading |
|---|---|---|
| Location | Colocated, bare-metal | Cloud (AWS) required |
| Latency target | <10µs | 100µs-5ms acceptable |
| Protocol | Proprietary, FIX | WebSocket, REST, FIX varies |
| Uptime | Market hours | 24/7/365 |
| Key management | HSMs | Hot wallets, MPC |
The Physics: Traditional HFT optimizes for nanoseconds using kernel bypass (DPDK, RDMA) on dedicated hardware. Crypto trading operates on different physics: the limiting factor is network RTT to exchanges (50-200ms), not local processing. This means we optimize for reliability and observability first, then latency.
For kernel-level optimizations, see the deep dive series:
AWS Architecture for Trading {#aws}
VPC Design
Trading VPCs need:
- Private subnets for trading engines (no public IPs)
- NAT gateways for outbound exchange connectivity
- VPC endpoints for AWS services (no internet traversal)
# Terraform: Trading VPC
resource "aws_vpc" "trading" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
}
resource "aws_subnet" "private" {
count = 2
vpc_id = aws_vpc.trading.id
cidr_block = cidrsubnet(aws_vpc.trading.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
tags = {
Name = "trading-private-${count.index}"
}
}
# VPC endpoint for Secrets Manager (no internet)
resource "aws_vpc_endpoint" "secrets" {
vpc_id = aws_vpc.trading.id
service_name = "com.amazonaws.${var.region}.secretsmanager"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
Instance Selection
| Use Case | Instance Type | Why |
|---|---|---|
| Trading engine | c6in.xlarge | Network-optimized, 200 Gbps |
| Market data | c6in.xlarge | Same-network is bottleneck |
| Risk engine | r6i.xlarge | Memory-optimized for state |
| Monitoring | t3.large | Cost-effective, not latency-critical |
Citation: AWS Instance Types.
Placement Groups
Critical for inter-instance latency:
resource "aws_placement_group" "trading" {
name = "trading-cluster"
strategy = "cluster" # Same rack
}
resource "aws_instance" "trading" {
instance_type = "c6in.xlarge"
placement_group = aws_placement_group.trading.id
subnet_id = aws_subnet.private[0].id
}
Cluster placement puts instances on the same rack. Inter-instance latency: ~50µs (vs 500µs random placement).
Trade-off: Single AZ = single point of failure. Acceptable for trading engines; DR handled at application level.
Kubernetes Patterns {#kubernetes}
Why StatefulSets
Trading workloads need:
- Persistent identity (pod-0 handles BTC, pod-1 handles ETH)
- Ordered scaling (risk engines start before trading)
- Persistent storage (state survives restarts)
Deployments don’t provide these. See Kubernetes Deep Dive for full details.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: trading-engine
spec:
serviceName: "trading-headless"
replicas: 3
podManagementPolicy: Parallel
selector:
matchLabels:
app: trading-engine
template:
spec:
containers:
- name: engine
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# POD_NAME = trading-engine-0, trading-engine-1, etc.
Resource Configuration
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
Requests vs limits: Requests are guaranteed. Limits are maximums. For trading:
- Set request = expected usage
- Set limit = 2x request (room for bursts without OOM kill)
Node Affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- trading
Dedicated node groups prevent resource contention with other workloads.
Multi-Exchange Connectivity {#exchanges}
Architecture: Connector Per Exchange
Each exchange has different:
- Rate limits
- Message formats
- Authentication
- Reconnection behavior
Design principle: Fault isolation. Binance down shouldn’t affect Coinbase.
# Per-exchange deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: connector-binance
spec:
replicas: 2 # Hot standby
template:
spec:
containers:
- name: connector
env:
- name: EXCHANGE
value: "binance"
- name: WS_ENDPOINT
value: "wss://stream.binance.com:9443"
WebSocket Reliability
WebSockets silently disconnect. TCP keepalive isn’t reliable. See Orderbook Infrastructure for resilience patterns:
- Heartbeat monitoring
- Staleness detection
- Automatic reconnection with backoff
- Prometheus metrics for reliability
Monitoring That Matters {#monitoring}
The Mistake: Infrastructure Metrics
CPU, memory, disk-these are necessary but not sufficient. A server can have 5% CPU while:
- Fills are 10x slower than expected
- Positions have drifted from exchange
- Market data is stale
Trading-Specific SLOs
See Monitoring Deep Dive for complete details.
Essential metrics:
from prometheus_client import Histogram, Gauge, Counter
# Fill latency (P50, P95, P99)
FILL_LATENCY = Histogram(
'trading_fill_latency_seconds',
'Order submission to fill confirmation',
['exchange'],
buckets=[0.001, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
)
# Position drift (internal vs exchange)
POSITION_DRIFT = Gauge(
'trading_position_drift_percent',
'Difference between calculated and exchange position',
['exchange', 'symbol']
)
# Market data staleness
MARKET_DATA_AGE = Gauge(
'trading_market_data_age_seconds',
'Time since last orderbook update',
['exchange', 'symbol']
)
Alert Hierarchy
PAGE IMMEDIATELY (wake me at 2am):
├── Position drift > 1%
├── No fills in 5 minutes (market hours)
├── WebSocket down > 30 seconds
└── Loss limit exceeded
SLACK (business hours):
├── Fill latency P99 > 1s
├── Rejection rate > 3%
└── Rate limit > 70%
Security at Scale {#security}
Defense in Depth
Layers:
- VPC isolation (private subnets)
- Security groups (minimal ports)
- Secrets management (Secrets Manager, no env vars)
- Key rotation (automated)
- Audit logging (CloudTrail)
API Key Management
# AWS Secrets Manager
resource "aws_secretsmanager_secret" "exchange_keys" {
name = "trading/exchange-api-keys"
}
# Automatic rotation
resource "aws_secretsmanager_secret_rotation" "api_keys" {
secret_id = aws_secretsmanager_secret.exchange_keys.id
rotation_lambda_arn = aws_lambda_function.rotator.arn
rotation_rules {
automatically_after_days = 30
}
}
Hot Wallet Security
Principle: Minimize hot wallet exposure.
- Cold storage: 95%+ of funds, air-gapped
- Hot wallets: Trading capital only
- MPC: No single point of compromise
Track record: $500M+ capital managed, zero breaches.
CI/CD for Trading {#cicd}
Zero-Downtime Deployments
ArgoCD rollout strategy:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: trading-engine
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- analysis:
templates:
- templateName: trading-health
- setWeight: 50
- pause: {duration: 5m}
Pre-Deployment Checks
# GitHub Actions
jobs:
pre-deploy:
steps:
- name: Run latency-audit
run: pip install latency-audit && latency-audit --json
- name: Test exchange connectivity
run: ./scripts/test_exchanges.sh
- name: Validate configs
run: ./scripts/validate_configs.py
Design Philosophy {#design-philosophy}
First Principles
1. Latency is a system property, not a code property.
Your algorithm runs in 10µs. But:
- Network adds 50µs (kernel stack)
- Memory adds 100µs (THP compaction)
- CPU adds 50µs (C-state wake)
Total: 210µs. The algorithm is 5% of the problem.
2. Reliability enables performance.
You can’t optimize a system that’s down. Build reliable first, fast second.
3. Observability drives optimization.
You can’t fix what you can’t measure. Instrument everything.
4. Security is non-negotiable.
One breach erases years of profits. Defense in depth, always.
When to Break Rules
These patterns are for production trading systems. For:
- Development: Defaults are fine
- Backtesting: Throughput matters more
- Paper trading: Reliability testing, not latency
Up Next in Linux Infrastructure Deep Dives
Kubernetes StatefulSets: Why Trading Systems Need State
Deep dive into StatefulSets vs Deployments, pod identity, PersistentVolumes, and graceful shutdown patterns for trading infrastructure.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| The 5 kernel settings that cost you latency | The $2M Millisecond: Linux Defaults That Cost You Money |
| StatefulSets, pod placement, EKS patterns | Kubernetes StatefulSets: Why Trading Systems Need State |
| SLOs, metrics that matter, alerting | Trading Metrics: What SRE Dashboards Miss |
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| Measuring without overhead using eBPF | eBPF Profiling: Nanoseconds Without Adding Any |