Trading System Metrics That Actually Matter

Fill latency, position drift, market data staleness. The SLOs that prevent losses-not just track uptime.

Intermediate 20 min read Expert Version →

🎯 What You'll Learn

  • Identify trading-specific metrics beyond standard SRE
  • Define SLOs for fill latency and market data staleness
  • Configure Prometheus/Grafana for trading dashboards
  • Build alerts that prevent losses, not just track outages

📚 Prerequisites

Before this lesson, you should understand:

Beyond Uptime: Trading SLOs

Standard SRE dashboards track uptime, error rates, and latency. For trading, that’s not enough.

Web app SLO:  99.9% availability, p99 < 200ms
Trading SLO: 99.99% availability, p99 < 500µs, 
             fill rate > 95%, market data staleness < 1ms

If your market data is 5ms stale, you’re trading on old prices. That’s not an “outage”-but it costs money.


What You’ll Learn

By the end of this lesson, you’ll understand:

  1. Trading-specific metrics - What to measure that SRE dashboards miss
  2. Latency SLOs - Why p99 isn’t enough, you need p99.9
  3. Market data staleness - The hidden cause of bad trades
  4. Position drift - Detecting when expected ≠ actual

The Foundation: The Four Trading Metrics

MetricWhat It MeasuresWhy It Matters
Fill LatencyTime from order to executionSlower = worse prices
Market Data StalenessAge of latest price dataStale data = wrong decisions
Position DriftExpected vs. actual positionsDetects execution failures
Quote-to-Trade RatioOrders/tradesIndicates strategy health

The “Aha!” Moment

Here’s what separates trading observability from generic SRE:

In trading, performance IS correctness. A web page that loads in 1s vs 100ms annoys users. A trade that executes in 10ms vs 100µs means you got a worse price-or no fill at all.

Your monitoring system needs to treat latency like a business metric, not only a technical one.


Let’s See It In Action: Key Metrics

1. Fill Latency

# Prometheus metrics (Python example)
from prometheus_client import Histogram

fill_latency = Histogram(
    'order_fill_latency_seconds',
    'Time from order submission to fill confirmation',
    buckets=[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1],  # 100µs to 100ms
    labelnames=['exchange', 'instrument']
)

# In your order handler
with fill_latency.labels(exchange='binance', instrument='BTC-USD').time():
    submit_and_wait_for_fill(order)

2. Market Data Staleness

from prometheus_client import Gauge
import time

data_staleness = Gauge(
    'market_data_staleness_seconds',
    'Age of latest market data',
    labelnames=['exchange', 'symbol']
)

# In your market data handler
def on_tick(symbol, exchange_timestamp):
    staleness = time.time() - exchange_timestamp
    data_staleness.labels(exchange='binance', symbol=symbol).set(staleness)

3. Position Drift

position_drift = Gauge(
    'position_drift_absolute',
    'Difference between expected and actual position',
    labelnames=['symbol']
)

# Periodic reconciliation
def reconcile_positions():
    for symbol in tracked_symbols:
        expected = internal_position[symbol]
        actual = query_exchange_position(symbol)
        drift = abs(expected - actual)
        position_drift.labels(symbol=symbol).set(drift)

SLO Definitions

Define specific, measurable targets:

MetricTargetAlert ThresholdBusiness Impact
Fill Latency p99< 1ms> 2ms for 1min$10K/day lost
Fill Latency p99.9< 10ms> 20ms for 30sStrategy disabled
Market Data Staleness< 500µs> 2ms for 10sWrong pricing
Position Drift0> 0 for 5minInventory risk

Prometheus Queries

Fill Latency Percentiles

# P99 fill latency by exchange
histogram_quantile(0.99, 
  sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)

# P99.9 for catching tail latency
histogram_quantile(0.999, 
  sum(rate(order_fill_latency_seconds_bucket[5m])) by (le, exchange)
)

Market Data Freshness

# Alert if any symbol is stale > 2ms
max(market_data_staleness_seconds) by (exchange) > 0.002

Position Reconciliation

# Any position drift is a problem
sum(position_drift_absolute) > 0

Common Misconceptions

Myth: “p99 latency is enough.”
Reality: p99 hides your worst 1%. In trading, one 100ms spike during volatile markets can cause significant losses. Track p99.9 or even p99.99.

Myth: “We monitor latency, so we’re fine.”
Reality: Latency to where? You need to measure the complete path: tick-to-trade (market data received to order sent) and order-to-fill (order sent to fill confirmed).

Myth: “Position drift can wait for daily reconciliation.”
Reality: If you’re trading 1000 orders/day and one fails silently, you could have a large unhedged position for hours. Real-time reconciliation is mandatory.


Grafana Dashboard Layout

Recommended panels:

+------------------+------------------+------------------+
|  Fill Latency    |  Market Data     |  Position Drift  |
|  (Heatmap)       |  Staleness       |  (per symbol)    |
+------------------+------------------+------------------+
|  Fill Rate %     |  Order Flow      |  PnL Real-time   |
|  (by venue)      |  (orders/sec)    |  (streaming)     |
+------------------+------------------+------------------+
|  Error Rate      |  Quote/Trade     |  System Alerts   |
|  (by type)       |  Ratio           |  (last 24h)      |
+------------------+------------------+------------------+

Alerting Rules

# prometheus/alerts.yml
groups:
  - name: trading
    rules:
      - alert: FillLatencyHigh
        expr: histogram_quantile(0.99, sum(rate(order_fill_latency_seconds_bucket[5m])) by (le)) > 0.002
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Fill latency p99 above 2ms"
      
      - alert: MarketDataStale
        expr: max(market_data_staleness_seconds) > 0.005
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: "Market data > 5ms stale"
      
      - alert: PositionDrift
        expr: sum(position_drift_absolute) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Position mismatch detected"

Practice Exercises

Exercise 1: Implement Fill Latency

# Add timing around your order flow
# Measure: order_created → order_sent → exchange_ack → fill_confirmed

Exercise 2: Set Up Staleness Check

# Compare exchange timestamp to local time
# Alert if difference > 2ms

Exercise 3: Build a Grafana Dashboard

# Create panels for:
# - Fill latency heatmap (last 1 hour)
# - Staleness gauge (current value)
# - Position reconciliation status

Key Takeaways

  1. Trading metrics ≠ SRE metrics - Latency is a business metric, not just technical
  2. p99.9 matters more than p99 - Tail latency costs money
  3. Market data staleness is invisible - You’re trading on old prices without knowing
  4. Position drift = silent failure - Real-time reconciliation is mandatory

What’s Next?

🎯 Continue learning: eBPF Profiling

🔬 Expert version: Trading Metrics: What SRE Dashboards Miss

Now you know what to put on your trading dashboard. 📈


Pro Version: For production implementation, see Monitoring Trading Systems

Questions about this lesson? Working on related infrastructure?

Let's discuss