Infrastructure

eBPF Profiling: Nanoseconds Without Adding Any

Deep dive into eBPF, bpftrace, and kernel tracing. How to measure latency at nanosecond precision without the observer effect.

6 min
#ebpf #profiling #observability #linux #latency #kernel #monitoring

You cannot debug a latency spike if your debugger causes the spike.

Traditional profiling tools add significant overhead: strace costs 100-500µs per syscall, perf sampling uses 5-20% CPU. In a trading system processing 100K messages/second, these tools destroy the behavior you’re trying to measure.

eBPF changes this. Run sandboxed programs in the kernel, measure at nanosecond precision, add nearly zero overhead.

The Problem {#the-problem}

Traditional tools and their costs:

ToolOverheadWhy
strace100-500µs/syscallptrace stops process, copies data
ltrace200-800µs/callLibrary interposition
perf (sampling)5-20% CPUInterrupt-based sampling
gdbProcess stoppedBreakpoint traps

The observer effect: The measurement changes the behavior. At trading speeds, this makes traditional tools useless for production debugging.

eBPF overhead: 100-500 nanoseconds per probe. 1000x less than strace.

For what to measure, see Monitoring Metrics. For kernel tuning that affects what you’ll observe, see the deep dive series:

Background: What Is eBPF? {#background}

The Architecture

eBPF (extended Berkeley Packet Filter) runs sandboxed programs in the Linux kernel (kernel source: kernel/bpf/).

User space program (bpftrace, bcc tool)

BPF bytecode compilation

Kernel verifier (safety checks)

JIT compilation to native code

Attached to kernel hook (kprobe, tracepoint, etc.)

Runs on every event, collects data

Maps (hash tables, arrays) export to user space

Why It’s Fast

  1. In-kernel execution: No context switches to user space
  2. JIT compiled: Runs at near-native speed
  3. Verified: Can’t crash the kernel
  4. No locks: Lock-free data structures for metrics

Available Hooks

Hook TypeWhat It TracesExample
kprobeFunction entrykprobe:tcp_sendmsg
kretprobeFunction returnkretprobe:tcp_sendmsg
tracepointStatic kernel eventstracepoint:sched:sched_switch
uprobeUser-space functionuprobe:/usr/bin/python:main
USDTUser-defined tracesApplication-specific

Citation: eBPF architecture documented in kernel documentation.

Tool 1: bpftrace One-Liners {#bpftrace}

Installation

# Ubuntu/Debian
sudo apt install bpftrace

# Amazon Linux 2023
sudo dnf install bpftrace

Syscall Latency Histogram

# Distribution of syscall latencies (microseconds)
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @usecs = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

Output:

@usecs:
[0]                  12453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   4521 |@@@@@@@@@@@@@@@@@                                   |
[2, 4)                1234 |@@@@@                                               |
[4, 8)                 342 |@                                                   |
[8, 16)                 45 |                                                    |

Per-Syscall Latency

# Which syscalls are slow?
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { 
  @start[tid] = nsecs; 
  @syscall[tid] = args->id;
}
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
  @latency[@syscall[tid]] = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
  delete(@syscall[tid]);
}'

Tool 2: Network Tracing {#network}

TCP Send Latency

# How long does tcp_sendmsg take in kernel?
sudo bpftrace -e '
kprobe:tcp_sendmsg { @start[tid] = nsecs; }
kretprobe:tcp_sendmsg /@start[tid]/ {
  @tcp_send_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

Packet Receive Path

# Latency from NIC to socket
sudo bpftrace -e '
tracepoint:net:netif_receive_skb { @rx_start[args->skbaddr] = nsecs; }
tracepoint:net:consume_skb /@rx_start[args->skbaddr]/ {
  @rx_time_ns = hist(nsecs - @rx_start[args->skbaddr]);
  delete(@rx_start[args->skbaddr]);
}'

Softirq Time

# Network softirq processing time
sudo bpftrace -e '
tracepoint:irq:softirq_entry /args->vec == 3/ { @start[tid] = nsecs; }
tracepoint:irq:softirq_exit /args->vec == 3 && @start[tid]/ {
  @net_softirq_ns = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

For network tuning that affects these measurements, see Network Deep Dive.

Tool 3: Scheduler Latency {#scheduler}

Run Queue Latency

# How long do threads wait for CPU?
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /args->prev_state == 0/ {
  if (@qtime[args->next_pid]) {
    @runq_latency_us = hist((nsecs - @qtime[args->next_pid]) / 1000);
    delete(@qtime[args->next_pid]);
  }
}'

Interpretation:

  • Spikes at 4ms = timer tick preemption (CONFIG_HZ=250)
  • Long tail = CPU contention

Off-CPU Time

# Why is my thread not running?
sudo bpftrace -e '
tracepoint:sched:sched_switch {
  @off_start[args->prev_pid] = nsecs;
  if (@off_start[args->next_pid]) {
    @off_cpu_us[args->next_comm] = hist((nsecs - @off_start[args->next_pid]) / 1000);
    delete(@off_start[args->next_pid]);
  }
}'

For scheduler tuning, see CPU Deep Dive.

Tool 4: Memory Allocation {#memory}

Malloc Latency

# How long does malloc take?
sudo bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { 
  @start[tid] = nsecs; 
  @size[tid] = arg0;
}
uretprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc /@start[tid]/ {
  @malloc_ns[@size[tid]] = hist(nsecs - @start[tid]);
  delete(@start[tid]);
  delete(@size[tid]);
}'

Page Faults

# Minor and major page fault latency
sudo bpftrace -e '
tracepoint:exceptions:page_fault_user { @fault_start[tid] = nsecs; }
tracepoint:exceptions:page_fault_user { @fault_end[tid] = nsecs; }

// Actually need different approach, use kprobes:
kprobe:handle_mm_fault { @start[tid] = nsecs; }
kretprobe:handle_mm_fault /@start[tid]/ {
  @page_fault_us = hist((nsecs - @start[tid]) / 1000);
  delete(@start[tid]);
}'

For memory tuning, see Memory Deep Dive.

Tool 5: Block I/O {#block-io}

Request Latency

# Block I/O latency by device
sudo bpftrace -e '
tracepoint:block:block_rq_issue { @issue[args->sector] = nsecs; }
tracepoint:block:block_rq_complete /@issue[args->sector]/ {
  @io_latency_us = hist((nsecs - @issue[args->sector]) / 1000);
  delete(@issue[args->sector]);
}'

I/O Size Distribution

# What sizes are we reading/writing?
sudo bpftrace -e '
tracepoint:block:block_rq_issue {
  @io_size_bytes = hist(args->bytes);
}'

For storage tuning, see Storage Deep Dive.

Production Deployment {#production}

Continuous Monitoring with BCC

#!/usr/bin/env python3
# syscall_latency_exporter.py
from bcc import BPF
from prometheus_client import start_http_server, Histogram
import time

bpf_program = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

TRACEPOINT_PROBE(raw_syscalls, sys_enter) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}

TRACEPOINT_PROBE(raw_syscalls, sys_exit) {
    u32 pid = bpf_get_current_pid_tgid();
    u64 *tsp = start.lookup(&pid);
    if (tsp) {
        u64 delta = bpf_ktime_get_ns() - *tsp;
        dist.increment(bpf_log2l(delta / 1000));  // microseconds
        start.delete(&pid);
    }
    return 0;
}
"""

SYSCALL_LATENCY = Histogram(
    'syscall_latency_microseconds',
    'Syscall latency',
    buckets=[1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
)

def main():
    b = BPF(text=bpf_program)
    start_http_server(9101)
    
    while True:
        time.sleep(10)
        for k, v in b["dist"].items():
            bucket_us = 2 ** k.value
            SYSCALL_LATENCY.observe(bucket_us)
        b["dist"].clear()

if __name__ == "__main__":
    main()

Kubernetes DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: ebpf-exporter
spec:
  template:
    spec:
      hostPID: true
      hostNetwork: true
      containers:
      - name: exporter
        image: ebpf-exporter:latest
        securityContext:
          privileged: true
        volumeMounts:
        - name: bpf
          mountPath: /sys/fs/bpf
        - name: debug
          mountPath: /sys/kernel/debug
      volumes:
      - name: bpf
        hostPath:
          path: /sys/fs/bpf
      - name: debug
        hostPath:
          path: /sys/kernel/debug

Design Philosophy {#design-philosophy}

Measure, Don’t Assume

Before optimizing:

  1. Measure current state (where is latency?)
  2. Make change
  3. Measure again (did it help?)

The pattern: Most optimization efforts target the wrong thing. Measurement reveals the true bottleneck.

The Observer Effect Quantified

ToolOverhead/EventEvents/sec Impact
printf10-50µsSevere
strace100-500µsUnusable
perf0.1-1µs5-20% CPU
eBPF0.1-0.5µs<1% CPU

The tradeoff: eBPF requires kernel knowledge to write and interpret. Worth it for production observability.

When to Use Each Tool

SituationTool
Quick debuggingbpftrace one-liner
Production monitoringBCC + Prometheus
Detailed analysisperf + flame graphs
Syscall liststrace (development only)

Audit Your Infrastructure

Profiling is only half the battle. Run latency-audit to verify your kernel is configured for low-latency workloads before you start measuring.

pip install latency-audit && latency-audit

Reading Path

Continue exploring with these related deep dives:

TopicNext Post
SLOs, metrics that matter, alertingTrading Metrics: What SRE Dashboards Miss
CPU governors, C-states, NUMA, isolationCPU Isolation for HFT: The isolcpus Lie and What Actually Works
THP, huge pages, memory locking, pre-allocationMemory Tuning for Low-Latency: The THP Trap and HugePage Mastery
NIC offloads, IRQ affinity, kernel bypassNetwork Optimization: Kernel Bypass and the Art of Busy Polling
Design philosophy & architecture decisionsTrading Infrastructure: First Principles That Scale
Share: LinkedIn X