Infrastructure

Kubernetes StatefulSets: Why Trading Systems Need State

Deep dive into StatefulSets vs Deployments, pod identity, PersistentVolumes, and graceful shutdown patterns for trading infrastructure.

6 min
#kubernetes #trading #statefulsets #eks #infrastructure #devops

Deployments assume pods are fungible. Any instance can handle any request. Trading systems are the opposite.

Your trading engine holds state: exchange connections, position tracking, order IDs. Kill a pod, lose the state, lose money. Restart with a different identity, create duplicate orders.

This post covers why StatefulSets are essential for trading, how they work internally, and the complete configuration pattern.

The Problem {#the-problem}

Deployment failure mode:

# WRONG
apiVersion: apps/v1
kind: Deployment
metadata:
  name: trading-bot
spec:
  replicas: 3

What happens:

  1. All 3 pods connect to Binance
  2. All 3 receive same market data
  3. All 3 try to execute same trade
  4. 2/3 rejected as duplicates
  5. Rate limits exhausted

Root cause: Pods have random names (trading-bot-7d8f9-xyz). No identity assignment. No leader election. No state persistence.

For the broader architecture context, see First Principles. For kernel-level tuning on Kubernetes nodes, see CPU Optimization.

Background: Kubernetes Scheduling {#background}

How Deployments Work

Deployments manage ReplicaSets (controller source):

Deployment → ReplicaSet → Pods

Key behaviors:

  • Pods get random suffixes
  • Any pod can be killed first during scale-down
  • PersistentVolumeClaims are shared (if any)
  • No ordering guarantees

How StatefulSets Work

StatefulSets provide ordered, persistent identity (controller source):

StatefulSet → Pods with stable names

    pod-0, pod-1, pod-2 (always)

Key behaviors:

  • Pods get ordinal names: {statefulset}-0, {statefulset}-1
  • Ordered creation: 0 must be Running before 1 starts
  • Ordered deletion: N-1 deleted before N-2
  • Stable network identity via headless service
  • Per-pod PersistentVolumeClaims

Why This Matters for Trading

RequirementDeploymentStatefulSet
Stable identityNoYes
Per-pod storageShared onlyPer-pod
Ordered scalingNoYes
Network identityRandomStable DNS

Fix 1: StatefulSets for Identity {#statefulsets}

The Pattern

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel  # All start together (fast)
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
    spec:
      containers:
      - name: engine
        image: trading-engine:latest
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: ASSIGNED_MARKET
          # Application reads POD_NAME and derives assignment
          # trading-engine-0 → BTC
          # trading-engine-1 → ETH
          # trading-engine-2 → SOL

Application Logic

import os

POD_NAME = os.environ.get('POD_NAME', 'trading-engine-0')
POD_ORDINAL = int(POD_NAME.split('-')[-1])

MARKET_ASSIGNMENTS = {
    0: ['BTCUSDT', 'BTCUSD'],
    1: ['ETHUSDT', 'ETHUSD'],
    2: ['SOLUSDT', 'SOLUSD'],
}

my_markets = MARKET_ASSIGNMENTS.get(POD_ORDINAL, [])
print(f"Pod {POD_ORDINAL} handling markets: {my_markets}")

Expected Behavior

EventResult
Pod-0 crashesPod-0 restarts (same identity, same markets)
Scale to 4Pod-3 created, gets new market assignment
Scale to 2Pod-2 deleted first (reverse order)

Fix 2: Headless Services {#headless}

The Problem

ClusterIP services load-balance. You can’t connect to a specific pod.

The Fix

apiVersion: v1
kind: Service
metadata:
  name: trading-headless
spec:
  clusterIP: None  # Headless
  selector:
    app: trading-engine
  ports:
  - port: 8080
    name: http
  - port: 9090
    name: metrics

How It Works

With headless service, each pod gets stable DNS:

  • trading-engine-0.trading-headless.trading.svc.cluster.local
  • trading-engine-1.trading-headless.trading.svc.cluster.local

Your risk engine can connect directly to each trading engine:

TRADING_ENGINES = [
    "trading-engine-0.trading-headless.trading.svc.cluster.local:8080",
    "trading-engine-1.trading-headless.trading.svc.cluster.local:8080",
]

for engine in TRADING_ENGINES:
    position = get_position(engine)

No load balancer in the path. Direct TCP connections.

Fix 3: Persistent Volumes {#pv}

The Problem

Trading engines need persistent state:

  • Order history (for reconciliation)
  • Position snapshots (for crash recovery)
  • WAL logs (for replay)

Without persistence, restart = lost state.

The Fix

volumeClaimTemplates:
- metadata:
    name: trading-data
  spec:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "gp3-encrypted"
    resources:
      requests:
        storage: 50Gi

StorageClass for EKS:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-encrypted
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"

How It Works

Each pod gets its own PVC:

  • trading-data-trading-engine-0
  • trading-data-trading-engine-1

PVCs persist across pod restarts. Delete pod → PVC remains → New pod gets same PVC.

For EBS optimization, see Storage Deep Dive.

Fix 4: Graceful Shutdown {#shutdown}

The Problem

Default termination: SIGTERM → wait 30s → SIGKILL.

Trading needs:

  1. Cancel open orders (5-30s)
  2. Wait for exchange confirmations (10s)
  3. Flush state (1s)

30 seconds isn’t enough if exchange is slow.

The Fix

spec:
  terminationGracePeriodSeconds: 120  # 2 minutes
  containers:
  - name: engine
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - |
            # Signal application to stop trading
            curl -X POST http://localhost:8080/shutdown
            
            # Wait for order cancellations
            sleep 60
            
            # Final state flush happens in SIGTERM handler

Application Pattern

import signal
import time

shutdown_requested = False

def handle_sigterm(signum, frame):
    global shutdown_requested
    shutdown_requested = True
    
    # Cancel all open orders
    for order in get_open_orders():
        cancel_order(order)
    
    # Wait for confirmations
    while get_open_orders():
        time.sleep(1)
    
    # Flush state
    save_state_to_disk()
    
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)

Fix 5: Pod Disruption Budgets {#pdb}

The Problem

Kubernetes can evict pods during:

  • Node upgrades
  • Cluster autoscaler decisions
  • Spot instance reclaims

Without protection, all pods could be evicted simultaneously.

The Fix

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: trading-pdb
spec:
  minAvailable: 2  # At least 2 pods always running
  selector:
    matchLabels:
      app: trading-engine

How It Works

Voluntary disruptions (upgrades, autoscaler) respect PDB:

  • Want to evict pod-0
  • Check PDB: 3 running, need 2 minimum
  • Eviction allowed (3-1=2 ≥ 2)

Involuntary disruptions (node crash) don’t check PDB. You need multi-AZ for that.

Complete StatefulSet Example

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: trading-engine
  namespace: trading
spec:
  serviceName: "trading-headless"
  replicas: 3
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  selector:
    matchLabels:
      app: trading-engine
  template:
    metadata:
      labels:
        app: trading-engine
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
    spec:
      terminationGracePeriodSeconds: 120
      
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - trading
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: trading-engine
              topologyKey: topology.kubernetes.io/zone
      
      containers:
      - name: engine
        image: trading-engine:v1.2.3
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"
        
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics
        
        volumeMounts:
        - name: trading-data
          mountPath: /data
        
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        
        lifecycle:
          preStop:
            exec:
              command: 
              - /bin/sh
              - -c
              - "curl -X POST localhost:8080/shutdown && sleep 60"
  
  volumeClaimTemplates:
  - metadata:
      name: trading-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: "gp3-encrypted"
      resources:
        requests:
          storage: 50Gi

Design Philosophy {#design-philosophy}

Stateless vs Stateful

Kubernetes was designed for stateless. Original patterns assumed:

  • Ephemeral pods
  • Shared state in databases
  • Any pod handles any request

Trading is inherently stateful:

  • Exchange connections are stateful (WebSocket)
  • Position tracking requires memory
  • Order IDs need persistence

StatefulSets bridge this gap.

The Tradeoff

DeploymentStatefulSet
Simple scalingOrdered scaling
Fast rolloutsCareful rollouts
No identityStable identity
Shared statePer-pod state

StatefulSets are more complex. That complexity is the cost of correctness.


Audit Your Infrastructure

Running trading on Kubernetes? The underlying nodes still need kernel tuning. Run latency-audit on your node pools to check CPU governors, memory settings, and network configurations.

pip install latency-audit && latency-audit

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
Design philosophy & architecture decisionsTrading Infrastructure: First Principles That Scale
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
NIC offloads, IRQ affinity, kernel bypassNetwork Optimization: Kernel Bypass and the Art of Busy Polling
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
The 5 kernel settings that cost you latencyThe $2M Millisecond: Linux Defaults That Cost You Money
Share: LinkedIn X