Infrastructure
Trading Metrics: What SRE Dashboards Miss
Fill latency, position drift, market data staleness. The SLOs that prevent losses, not just track uptime. Prometheus, Grafana, and alerting patterns.
Your trading system has 99.99% uptime. Congratulations. You’re measuring the wrong thing.
I’ve seen systems with perfect infrastructure dashboards lose $50K in a day. CPU was at 5%. Memory was fine. All services green. But fill latency had degraded from 50ms to 500ms, and nobody noticed until the PnL reconciliation.
This post covers the metrics that actually matter for trading, how to instrument them, and how to alert before money is lost.
The Problem {#the-problem}
Standard SRE metrics:
- CPU usage
- Memory utilization
- Disk I/O
- Network throughput
- Service uptime
These are necessary but not sufficient. A server can have 5% CPU while:
- Fills are taking 10x longer than expected
- Positions have drifted from exchange state
- Market data is 5 seconds stale
- Rate limits are exhausted
The cost: By the time infrastructure metrics alert, you’ve already lost money.
For the infrastructure that these metrics run on, see:
- First Principles - Architecture
- Kubernetes for Trading - Pod monitoring
- The Observer Effect - Low-overhead measurement
Fill Latency {#fill-latency}
What It Measures
Time from order submission to fill confirmation.
Why it matters: Fill latency directly impacts execution quality. A 500ms delay on a 500+ in slippage during volatile markets.
Implementation
from prometheus_client import Histogram
import time
FILL_LATENCY = Histogram(
'trading_fill_latency_seconds',
'Time from order submission to fill confirmation',
['exchange', 'symbol'],
buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10]
)
class OrderManager:
def __init__(self):
self.pending_orders = {}
def submit_order(self, order):
order.submitted_at = time.time()
self.pending_orders[order.id] = order
self.exchange.submit(order)
def on_fill(self, fill):
order = self.pending_orders.pop(fill.order_id, None)
if order:
latency = time.time() - order.submitted_at
FILL_LATENCY.labels(
exchange=order.exchange,
symbol=order.symbol
).observe(latency)
Prometheus Recording Rules
groups:
- name: trading_slos
rules:
- record: trading:fill_latency:p50_5m
expr: histogram_quantile(0.50, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
- record: trading:fill_latency:p95_5m
expr: histogram_quantile(0.95, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
- record: trading:fill_latency:p99_5m
expr: histogram_quantile(0.99, sum(rate(trading_fill_latency_seconds_bucket[5m])) by (le, exchange))
Thresholds
| Environment | P50 | P95 | P99 |
|---|---|---|---|
| Cloud (acceptable) | <100ms | <500ms | <2s |
| Cloud (good) | <50ms | <200ms | <500ms |
| Collocated | <1ms | <5ms | <20ms |
Position Drift {#position-drift}
What It Measures
Difference between your calculated position and exchange’s recorded position.
Why it matters: Drift means:
- Missed fill messages
- Network packet loss
- Race conditions in reconciliation
- Wrong risk calculations
Implementation
from prometheus_client import Gauge
import asyncio
POSITION_DRIFT = Gauge(
'trading_position_drift_percent',
'Difference between calculated and exchange position',
['exchange', 'symbol']
)
POSITION_DRIFT_ABS = Gauge(
'trading_position_drift_absolute',
'Absolute difference in position',
['exchange', 'symbol']
)
async def reconcile_positions():
while True:
for symbol in trading_symbols:
calculated = database.get_position(symbol)
exchange = await client.get_position(symbol)
if calculated != 0:
drift_pct = abs(calculated - exchange) / abs(calculated)
else:
drift_pct = abs(exchange) # Any position when expecting 0 is 100% drift
drift_abs = abs(calculated - exchange)
POSITION_DRIFT.labels(
exchange='binance',
symbol=symbol
).set(drift_pct)
POSITION_DRIFT_ABS.labels(
exchange='binance',
symbol=symbol
).set(drift_abs)
await asyncio.sleep(60) # Reconcile every minute
Alert Rules
- alert: PositionDrift
expr: trading_position_drift_percent > 0.01 # 1%
for: 30s
labels:
severity: critical
annotations:
summary: "Position drift > 1% on {{ $labels.symbol }}"
description: "Calculated and exchange positions differ by {{ $value | humanizePercentage }}"
Thresholds
- Normal: drift = 0%
- Warning: drift > 0.5%
- Page immediately: drift > 1%
Market Data Staleness {#staleness}
What It Measures
Time since last orderbook update.
Why it matters: Trading on stale data = trading blind. Common causes:
- WebSocket silently disconnected
- Exchange rate limiting
- Network congestion
- Parser crash
Implementation
from prometheus_client import Gauge
import time
MARKET_DATA_AGE = Gauge(
'trading_market_data_age_seconds',
'Time since last orderbook update',
['exchange', 'symbol']
)
last_update = {}
def on_orderbook_update(exchange: str, symbol: str, data: dict):
last_update[(exchange, symbol)] = time.time()
# Process orderbook...
async def staleness_monitor():
while True:
now = time.time()
for (exchange, symbol), ts in last_update.items():
age = now - ts
MARKET_DATA_AGE.labels(
exchange=exchange,
symbol=symbol
).set(age)
await asyncio.sleep(1)
Alert Rules
- alert: StaleMarketData
expr: trading_market_data_age_seconds > 5
for: 10s
labels:
severity: critical
annotations:
summary: "No market data for 5s on {{ $labels.symbol }}"
runbook: "Check WebSocket connection, reconnect if needed"
Order Rejection Rate {#rejections}
What It Measures
Percentage of orders rejected by exchange.
Why it matters: High rejections indicate:
- Insufficient margin
- Invalid order sizes
- Rate limits exceeded
- Exchange maintenance
- Bugs in order construction
Implementation
from prometheus_client import Counter
ORDERS_SUBMITTED = Counter(
'trading_orders_submitted_total',
'Total orders submitted',
['exchange', 'type']
)
ORDERS_REJECTED = Counter(
'trading_orders_rejected_total',
'Total orders rejected',
['exchange', 'reason']
)
ORDERS_FILLED = Counter(
'trading_orders_filled_total',
'Total orders filled',
['exchange']
)
def submit_order(order):
ORDERS_SUBMITTED.labels(
exchange=order.exchange,
type=order.type
).inc()
# Submit...
def on_rejection(order, reason):
ORDERS_REJECTED.labels(
exchange=order.exchange,
reason=reason
).inc()
def on_fill(fill):
ORDERS_FILLED.labels(exchange=fill.exchange).inc()
Recording Rules
- record: trading:order_rejection_rate:5m
expr: |
sum(rate(trading_orders_rejected_total[5m])) by (exchange)
/
sum(rate(trading_orders_submitted_total[5m])) by (exchange)
Thresholds
- Normal: <1%
- Warning: >3%
- Critical: >5%
Rate Limit Headroom {#rate-limits}
What It Measures
How close you are to exchange rate limits.
Why it matters: Hit rate limits = orders rejected = missed opportunities.
Implementation
from prometheus_client import Gauge
RATE_LIMIT_USED = Gauge(
'trading_rate_limit_used_percent',
'Percentage of rate limit consumed',
['exchange', 'endpoint']
)
class RateLimitedClient:
def __init__(self, limit_per_minute: int):
self.limit = limit_per_minute
self.requests = []
def request(self, endpoint: str):
now = time.time()
self.requests = [t for t in self.requests if now - t < 60]
used_pct = len(self.requests) / self.limit
RATE_LIMIT_USED.labels(
exchange='binance',
endpoint=endpoint
).set(used_pct * 100)
self.requests.append(now)
# Make request...
Thresholds
- Normal: <50%
- Warning: >70%
- Critical: >90%
Alerting Strategy {#alerting}
Alert Hierarchy
# AlertManager config
route:
receiver: 'slack-trading'
routes:
- match:
severity: critical
receiver: 'pagerduty-trading'
continue: true
- match:
severity: warning
receiver: 'slack-trading'
receivers:
- name: 'pagerduty-trading'
pagerduty_configs:
- service_key: '{{ PAGERDUTY_KEY }}'
severity: critical
- name: 'slack-trading'
slack_configs:
- channel: '#trading-alerts'
send_resolved: true
What Wakes You Up
PAGE IMMEDIATELY (2am wake-up):
├── Position drift > 1%
├── No fills in 5 minutes (market hours)
├── Market data stale > 30 seconds
├── Loss limit exceeded
└── WebSocket down all exchanges
SLACK (check in morning):
├── Fill latency P99 > 1s
├── Rejection rate > 3%
├── Rate limit > 70%
└── High reconnection rate
Runbook Pattern
annotations:
summary: "Position drift detected"
description: "{{ $labels.symbol }}: {{ $value | humanizePercentage }} drift"
runbook: |
1. Check /debug/positions endpoint for details
2. Compare with exchange API positions
3. If exchange is source of truth, force reconciliation
4. If our system is wrong, investigate fill processing
Grafana Dashboard
{
"title": "Trading SLOs",
"panels": [
{
"title": "Fill Latency P99 by Exchange",
"type": "graph",
"targets": [
{
"expr": "trading:fill_latency:p99_5m"
}
]
},
{
"title": "Position Drift",
"type": "gauge",
"targets": [
{
"expr": "max(trading_position_drift_percent) * 100"
}
],
"fieldConfig": {
"defaults": {
"max": 5,
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 0.5, "color": "yellow"},
{"value": 1, "color": "red"}
]
}
}
}
},
{
"title": "Market Data Staleness",
"type": "stat",
"targets": [
{
"expr": "max(trading_market_data_age_seconds)"
}
]
}
]
}
Design Philosophy {#design-philosophy}
Infrastructure vs Business Metrics
| Metric Type | Example | What It Tells You |
|---|---|---|
| Infrastructure | CPU = 50% | System is working |
| Application | Requests/sec = 1000 | System is handling load |
| Business | Fill latency P99 = 500ms | Money is at risk |
Prioritize business metrics. Infrastructure metrics are necessary but not sufficient.
The SLO Hierarchy
- Don’t lose money (position drift, loss limits)
- Don’t miss opportunities (fill latency, staleness)
- Don’t get blocked (rate limits, rejections)
- Stay operational (uptime, errors)
Most teams monitor #4 first. Start with #1.
Up Next in Linux Infrastructure Deep Dives
eBPF Profiling: Nanoseconds Without Adding Any
Deep dive into eBPF, bpftrace, and kernel tracing. How to measure latency at nanosecond precision without the observer effect.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| Measuring without overhead using eBPF | eBPF Profiling: Nanoseconds Without Adding Any |
| Design philosophy & architecture decisions | Trading Infrastructure: First Principles That Scale |
| The 5 kernel settings that cost you latency | The $2M Millisecond: Linux Defaults That Cost You Money |
| StatefulSets, pod placement, EKS patterns | Kubernetes StatefulSets: Why Trading Systems Need State |