Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery

A trading engine at a crypto exchange exhibited a perfect 50µs P99-until memory demand crossed 70% of available RAM. Then, P99 spiked to 15ms.

The root cause: Transparent Huge Pages (THP). The kernel was “helpfully” defragmenting memory in the background, stealing CPU cycles from the trading thread.

This post documents how to disable THP and pre-allocate explicit HugePages for deterministic memory access.

1. The Physics of Page Tables

When your application accesses memory, the CPU must translate a virtual address to a physical address. This translation uses the Translation Lookaside Buffer (TLB), a hardware cache.

A TLB miss is expensive: 10-100 CPU cycles.

The THP Trap

THP attempts to reduce TLB misses by using 2MB pages instead of 4KB pages. In theory, this is good. In practice, the compaction daemon (khugepaged) runs in the kernel, consuming CPU and holding spinlocks while it defragments memory.

App Allocates → THP Enabled? → khugepaged Runs → Spinlocks (1-15ms) → App Stalls

Source: Red Hat Performance Tuning Guide - THP

2. The Decision Matrix

Approach	TLB Pressure	Compaction Stalls	Verdict
A. THP Enabled (Default)	Low	High (Bad)	Unpredictable. Rejected.
B. THP Disabled, No HugePages	High	None	Stable but slower.
C. THP Disabled + Explicit HugePages	Low	None	Selected. Best of both worlds.

3. The Kill: Explicit HugePage Reservation

Step 1: Disable THP Globally

# /etc/default/grub
GRUB_CMDLINE_LINUX="transparent_hugepage=never"

sudo update-grub && sudo reboot

Verification:

cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: [always] madvise [never]
#           The 'never' should have brackets.

Step 2: Reserve Explicit HugePages at Boot

Calculate how many 2MB pages you need. For a 32GB heap: 32 * 1024 / 2 = 16384 pages

# /etc/sysctl.conf
vm.nr_hugepages = 16384

sudo sysctl -p

Step 3: Use HugePages in Your Application

Java (JVM):

java -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -jar myapp.jar

C/Rust (mmap):

void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

4. The Tool: Auditing HugePage State

pip install latency-audit && latency-audit --check memory

This checks that THP is disabled and that vm.nr_hugepages is set to a non-zero value.

5. Systems Thinking: The Trade-offs

Memory Fragmentation: Explicit HugePages reserve physical memory at boot. If you over-allocate, you starve other processes.
Non-Swappable: HugePages cannot be swapped to disk. Ensure you have enough physical RAM.
Application Support: Not all applications support HugePages. Test thoroughly before production.

6. The Philosophy

The kernel’s “smart” features assume you value average throughput. In HFT, you value worst-case latency. These are opposing goals.

THP is a prime example: it optimizes the common case (less TLB pressure) at the cost of the rare case (catastrophic compaction stalls). For HFT, the rare case is your SLA.

Disable the kernel’s intelligence. Replace it with your own, explicit configuration.

Topic	Next Post
CPU governors, C-states, NUMA, isolation	CPU Isolation for HFT: The isolcpus Lie and What Actually Works
I/O schedulers, Direct I/O, EBS tuning	I/O Schedulers: Why the Kernel Reorders Your Writes
The 5 kernel settings that cost you latency	The $2M Millisecond: Linux Defaults That Cost You Money
Measuring without overhead using eBPF	eBPF Profiling: Nanoseconds Without Adding Any
Design philosophy & architecture decisions	Trading Infrastructure: First Principles That Scale

Nikhil Padala