Trading System Metrics That Actually Matter
Fill latency, position drift, market data staleness. The SLOs that prevent losses-not just track uptime.
🎯 What You'll Learn
- Identify trading-specific metrics beyond standard SRE
- Define SLOs for fill latency and market data staleness
- Configure Prometheus/Grafana for trading dashboards
- Build alerts that prevent losses, not just track outages
Beyond Uptime: Trading SLOs
Standard SRE dashboards track uptime, error rates, and latency. For trading, that’s not enough.
Web app SLO: 99.9% availability, p99 < 200ms
Trading SLO: 99.99% availability, p99 < 500µs,
fill rate > 95%, market data staleness < 1ms
If your market data is 5ms stale, you’re trading on old prices. That’s not an “outage”-but it costs money.
What You’ll Learn
By the end of this lesson, you’ll understand:
- Trading-specific metrics - What to measure that SRE dashboards miss
- Latency SLOs - Why p99 isn’t enough, you need p99.9
- Market data staleness - The hidden cause of bad trades
- Position drift - Detecting when expected ≠ actual
The Foundation: The Four Trading Metrics
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Fill Latency | Time from order to execution | Slower = worse prices |
| Market Data Staleness | Age of latest price data | Stale data = wrong decisions |
| Position Drift | Expected vs. actual positions | Detects execution failures |
| Quote-to-Trade Ratio | Orders/trades | Indicates strategy health |
The “Aha!” Moment
Here’s what separates trading observability from generic SRE:
In trading, performance IS correctness. A web page that loads in 1s vs 100ms annoys users. A trade that executes in 10ms vs 100µs means you got a worse price-or no fill at all.
Your monitoring system needs to treat latency like a business metric, not only a technical one.
Let’s See It In Action: Key Metrics
1. Fill Latency
# Prometheus metrics (Python example)
from prometheus_client import Histogram
fill_latency = Histogram(
'order_fill_latency_seconds',
'Time from order submission to fill confirmation',
buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1], # 100µs to 100ms
labelnames=['exchange', 'instrument']
)
# In your order handler
with fill_latency.labels(exchange='binance', instrument='BTC-USD').time():
submit_and_wait_for_fill(order)
2. Market Data Staleness
from prometheus_client import Gauge
import time
data_staleness = Gauge(
'market_data_staleness_seconds',
'Age of latest market data',
labelnames=['exchange', 'symbol']
)
# In your market data handler
def on_tick(symbol, exchange_timestamp):
staleness = time.time() - exchange_timestamp
data_staleness.labels(exchange='binance', symbol=symbol).set(staleness)
3. Position Drift
position_drift = Gauge(
'position_drift_absolute',
'Difference between expected and actual position',
labelnames=['symbol']
)
# Periodic reconciliation
def reconcile_positions():
for symbol in tracked_symbols:
expected = internal_position[symbol]
actual = query_exchange_position(symbol)
drift = abs(expected - actual)
position_drift.labels(symbol=symbol).set(drift)
SLO Definitions
Define specific, measurable targets:
| Metric | Target | Alert Threshold | Business Impact |
|---|---|---|---|
| Fill Latency p99 | < 1ms | > 2ms for 1min | $10K/day lost |
| Fill Latency p99.9 | < 10ms | > 20ms for 30s | Strategy disabled |
| Market Data Staleness | < 500µs | > 2ms for 10s | Wrong pricing |
| Position Drift | 0 | > 0 for 5min | Inventory risk |
Prometheus Queries
Fill Latency Percentiles
# P99 fill latency by exchange
histogram_quantile(0.99,
sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)
# P99.9 for catching tail latency
histogram_quantile(0.999,
sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)
Market Data Freshness
# Alert if any symbol is stale > 2ms
max(market_data_staleness_seconds) by (exchange) > 0.002
Position Reconciliation
# Any position drift is a problem
sum(position_drift_absolute) > 0
Common Misconceptions
Myth: “p99 latency is enough.”
Reality: p99 hides your worst 1%. In trading, one 100ms spike during volatile markets can cause significant losses. Track p99.9 or even p99.99.
Myth: “We monitor latency, so we’re fine.”
Reality: Latency to where? You need to measure the complete path: tick-to-trade (market data received to order sent) and order-to-fill (order sent to fill confirmed).
Myth: “Position drift can wait for daily reconciliation.”
Reality: If you’re trading 1000 orders/day and one fails silently, you could have a large unhedged position for hours. Real-time reconciliation is mandatory.
Grafana Dashboard Layout
Recommended panels:
+------------------+------------------+------------------+
| Fill Latency | Market Data | Position Drift |
| (Heatmap) | Staleness | (per symbol) |
+------------------+------------------+------------------+
| Fill Rate % | Order Flow | PnL Real-time |
| (by venue) | (orders/sec) | (streaming) |
+------------------+------------------+------------------+
| Error Rate | Quote/Trade | System Alerts |
| (by type) | Ratio | (last 24h) |
+------------------+------------------+------------------+
Alerting Rules
# prometheus/alerts.yml
groups:
- name: trading
rules:
- alert: FillLatencyHigh
expr: histogram_quantile(0.99, sum(rate(order_fill_latency_seconds_bucket[5m])) by (le)) > 0.002
for: 1m
labels:
severity: warning
annotations:
summary: "Fill latency p99 above 2ms"
- alert: MarketDataStale
expr: max(market_data_staleness_seconds) > 0.005
for: 10s
labels:
severity: critical
annotations:
summary: "Market data > 5ms stale"
- alert: PositionDrift
expr: sum(position_drift_absolute) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Position mismatch detected"
Practice Exercises
Exercise 1: Implement Fill Latency
# Add timing around your order flow
# Measure: order_created → order_sent → exchange_ack → fill_confirmed
Exercise 2: Set Up Staleness Check
# Compare exchange timestamp to local time
# Alert if difference > 2ms
Exercise 3: Build a Grafana Dashboard
# Create panels for:
# - Fill latency heatmap (last 1 hour)
# - Staleness gauge (current value)
# - Position reconciliation status
Key Takeaways
- Trading metrics ≠ SRE metrics - Latency is a business metric, not just technical
- p99.9 matters more than p99 - Tail latency costs money
- Market data staleness is invisible - You’re trading on old prices without knowing
- Position drift = silent failure - Real-time reconciliation is mandatory
What’s Next?
🎯 Continue learning: eBPF Profiling
🔬 Expert version: Trading Metrics: What SRE Dashboards Miss
Now you know what to put on your trading dashboard. 📈
Pro Version: For production implementation, see Monitoring Trading Systems
Questions about this lesson? Working on related infrastructure?
Let's discuss