eBPF Profiling Without the Observer Effect

How to measure latency at nanosecond precision without slowing down your application. eBPF, bpftrace, and kernel tracing.

Intermediate 25 min read Expert Version →

🎯 What You'll Learn

  • Understand why traditional profilers add latency
  • Use bpftrace for nanosecond-precision tracing
  • Write custom eBPF programs for application tracing
  • Avoid the observer effect in production profiling

📚 Prerequisites

Before this lesson, you should understand:

The Profiling Paradox

You want to measure your application’s latency. You attach a profiler. Suddenly, your application is 10-100x slower.

Without profiler: 10µs per operation
With strace:      500µs per operation (50x slower!)
With gdb:         Don't even try in production

This is the observer effect: measuring the system changes the system.


What You’ll Learn

By the end of this lesson, you’ll understand:

  1. Why profilers slow things down - Ptrace, context switches, memory overhead
  2. How eBPF bypasses the problem - Kernel-space tracing with JIT compilation
  3. Using bpftrace - One-liners for latency profiling
  4. Writing eBPF programs - Custom tracing for your application

The Foundation: Why Traditional Profilers Hurt

Traditional profilers use ptrace() to intercept syscalls:

Application syscall Context switch to profiler Profiler logs data Context switch back Syscall executes

That’s 3 extra context switches per syscall. Each context switch costs 1-5µs.


The “Aha!” Moment

Here’s what makes eBPF different:

eBPF programs run inside the kernel, not in a separate process. There’s no context switch. The tracing code executes in nanoseconds, not microseconds-and it’s JIT-compiled to native machine code.

eBPF adds ~100ns overhead. Traditional profilers add ~100µs. That’s a 1000x difference.


Let’s See It In Action: Compare strace vs bpftrace

# The slow way: strace (don't use in production)
strace -tt -T ./your_app 2>&1 | head -20
# Note: Your app is now 50x slower

# The fast way: bpftrace (production-safe)
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_write /pid == 12345/ { @writes = count(); }'
# Your app runs at normal speed

Measuring syscall latency with bpftrace

# Histogram of write() syscall latency
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_write /pid == 12345/ { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_write /pid == 12345/ {
  @latency = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

Output:

@latency:
[1K, 2K)    5234 |@@@@@@@@@@@@@@@@@@@@    |
[2K, 4K)    8123 |@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)    2341 |@@@@@@@                 |
[8K, 16K)    423 |@                       |

bpftrace One-Liners for Trading

1. Network latency per packet

# Time from packet arrival to application read
sudo bpftrace -e '
kprobe:__netif_receive_skb { @start[arg0] = nsecs; }
kprobe:tcp_recvmsg /@start[arg0]/ {
  @latency = hist(nsecs - @start[arg0]);
  delete(@start[arg0]);
}'

2. Scheduler latency (time waiting to run)

# How long threads wait for CPU
sudo bpftrace -e '
tracepoint:sched:sched_wakeup { @qtime[args->pid] = nsecs; }
tracepoint:sched:sched_switch /@qtime[args->next_pid]/ {
  @latency = hist(nsecs - @qtime[args->next_pid]);
  delete(@qtime[args->next_pid]);
}'

3. Lock contention

# Time spent waiting for mutex
sudo bpftrace -e '
uprobe:/path/to/app:pthread_mutex_lock { @start[tid] = nsecs; }
uretprobe:/path/to/app:pthread_mutex_lock /@start[tid]/ {
  @lock_time = hist(nsecs - @start[tid]);
  delete(@start[tid]);
}'

Writing Custom eBPF Programs

For complex tracing, write eBPF in C:

// latency_trace.bpf.c
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

struct {
    __uint(type, BPF_MAP_TYPE_HASH);
    __uint(max_entries, 10240);
    __type(key, u32);
    __type(value, u64);
} start_time SEC(".maps");

SEC("tracepoint/syscalls/sys_enter_write")
int trace_write_entry(struct trace_event_raw_sys_enter *ctx) {
    u32 tid = bpf_get_current_pid_tgid();
    u64 ts = bpf_ktime_get_ns();
    bpf_map_update_elem(&start_time, &tid, &ts, BPF_ANY);
    return 0;
}

SEC("tracepoint/syscalls/sys_exit_write")
int trace_write_exit(struct trace_event_raw_sys_exit *ctx) {
    u32 tid = bpf_get_current_pid_tgid();
    u64 *ts = bpf_map_lookup_elem(&start_time, &tid);
    if (ts) {
        u64 latency = bpf_ktime_get_ns() - *ts;
        // Log latency to perf buffer or map
        bpf_map_delete_elem(&start_time, &tid);
    }
    return 0;
}

char LICENSE[] SEC("license") = "GPL";

Compile and load:

clang -O2 -target bpf -c latency_trace.bpf.c -o latency_trace.o
bpftool prog load latency_trace.o /sys/fs/bpf/latency_trace

Common Misconceptions

Myth: “eBPF is only for network tracing.”
Reality: eBPF can trace any kernel or user-space function-syscalls, scheduler events, locks, memory allocation, file I/O, and more.

Myth: “eBPF is unsafe and can crash the kernel.”
Reality: The eBPF verifier checks every program before loading. It ensures programs terminate, don’t access invalid memory, and don’t cause kernel panics.

Myth: “I need to learn C to use eBPF.”
Reality: bpftrace provides a high-level scripting language. You can do most tracing without writing any C.


Overhead Comparison

ToolOverheadUse Case
bpftrace~100ns/eventProduction tracing
perf~500ns/eventSampling profiler
strace~50-100µs/eventDevelopment only
gdbStops processDevelopment only

Practice Exercises

Exercise 1: Trace Your Shell

# Count syscalls by type
sudo bpftrace -e 'tracepoint:syscalls:sys_enter_* /pid == '$BASHPID'/ { @[probe] = count(); }'

Exercise 2: Find Slow Writes

# Writes taking >1ms
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_write { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_write /@start[tid]/ {
  $lat = nsecs - @start[tid];
  if ($lat > 1000000) { printf("pid %d: %d ns\n", pid, $lat); }
  delete(@start[tid]);
}'

Exercise 3: Profile Network Path

# Time from NIC to application
sudo bpftrace -e '
kprobe:napi_gro_receive { @recv[arg1] = nsecs; }
kretprobe:tcp_recvmsg /@recv[arg0]/ {
  printf("network->app: %d ns\n", nsecs - @recv[arg0]);
  delete(@recv[arg0]);
}'

Key Takeaways

  1. Traditional profilers add 50-100µs overhead - Context switches kill latency
  2. eBPF runs in kernel space - ~100ns overhead, 1000x faster
  3. bpftrace for quick wins - One-liners for common traces
  4. Custom eBPF for production - Full control with C programs

What’s Next?

🎯 Continue learning: Monitoring Trading Systems

🔬 Expert version: eBPF Profiling: Nanoseconds Without Adding Any

Now you know how to measure without disturbing. 🔬

Questions about this lesson? Working on related infrastructure?

Let's discuss