Infrastructure
Memory Tuning for Low-Latency: The THP Trap and HugePage Mastery
How Transparent Huge Pages cause unpredictable latency spikes, and the explicit HugePage reservation strategy that eliminates memory stalls.
A trading engine at a crypto exchange exhibited a perfect 50µs P99-until memory demand crossed 70% of available RAM. Then, P99 spiked to 15ms.
The root cause: Transparent Huge Pages (THP). The kernel was “helpfully” defragmenting memory in the background, stealing CPU cycles from the trading thread.
This post documents how to disable THP and pre-allocate explicit HugePages for deterministic memory access.
1. The Physics of Page Tables
When your application accesses memory, the CPU must translate a virtual address to a physical address. This translation uses the Translation Lookaside Buffer (TLB), a hardware cache.
A TLB miss is expensive: 10-100 CPU cycles.
The THP Trap
THP attempts to reduce TLB misses by using 2MB pages instead of 4KB pages. In theory, this is good. In practice, the compaction daemon (khugepaged) runs in the kernel, consuming CPU and holding spinlocks while it defragments memory.
2. The Decision Matrix
| Approach | TLB Pressure | Compaction Stalls | Verdict |
|---|---|---|---|
| A. THP Enabled (Default) | Low | High (Bad) | Unpredictable. Rejected. |
| B. THP Disabled, No HugePages | High | None | Stable but slower. |
| C. THP Disabled + Explicit HugePages | Low | None | Selected. Best of both worlds. |
3. The Kill: Explicit HugePage Reservation
Step 1: Disable THP Globally
# /etc/default/grub
GRUB_CMDLINE_LINUX="transparent_hugepage=never"
sudo update-grub && sudo reboot
Verification:
cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: [always] madvise [never]
# The 'never' should have brackets.
Step 2: Reserve Explicit HugePages at Boot
Calculate how many 2MB pages you need. For a 32GB heap:
32 * 1024 / 2 = 16384 pages
# /etc/sysctl.conf
vm.nr_hugepages = 16384
sudo sysctl -p
Step 3: Use HugePages in Your Application
Java (JVM):
java -XX:+UseLargePages -XX:LargePageSizeInBytes=2m -jar myapp.jar
C/Rust (mmap):
void* ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
4. The Tool: Auditing HugePage State
pip install latency-audit && latency-audit --check memory
This checks that THP is disabled and that vm.nr_hugepages is set to a non-zero value.
5. Systems Thinking: The Trade-offs
- Memory Fragmentation: Explicit HugePages reserve physical memory at boot. If you over-allocate, you starve other processes.
- Non-Swappable: HugePages cannot be swapped to disk. Ensure you have enough physical RAM.
- Application Support: Not all applications support HugePages. Test thoroughly before production.
6. The Philosophy
The kernel’s “smart” features assume you value average throughput. In HFT, you value worst-case latency. These are opposing goals.
THP is a prime example: it optimizes the common case (less TLB pressure) at the cost of the rare case (catastrophic compaction stalls). For HFT, the rare case is your SLA.
Disable the kernel’s intelligence. Replace it with your own, explicit configuration.
Up Next in Linux Infrastructure Deep Dives
Network Optimization: Kernel Bypass and the Art of Busy Polling
How the Linux network stack adds 50µs of latency, and the interrupt coalescing, busy polling, and AF_XDP techniques that eliminate it.
Reading Path
Continue exploring with these related deep dives:
| Topic | Next Post |
|---|---|
| CPU governors, C-states, NUMA, isolation | CPU Isolation for HFT: The isolcpus Lie and What Actually Works |
| I/O schedulers, Direct I/O, EBS tuning | I/O Schedulers: Why the Kernel Reorders Your Writes |
| The 5 kernel settings that cost you latency | The $2M Millisecond: Linux Defaults That Cost You Money |
| Measuring without overhead using eBPF | eBPF Profiling: Nanoseconds Without Adding Any |
| Design philosophy & architecture decisions | Trading Infrastructure: First Principles That Scale |