Q: What is NUMA and how does it affect trading systems?

NUMA (Non-Uniform Memory Access) means modern servers with 2+ CPU sockets have memory local to each socket. Accessing local memory takes 70ns, but accessing memory on another socket takes 130ns (cross-socket penalty). If your trading thread on Socket 0 reads order book data from Socket 1's memory, you pay an extra 60ns per access. At billions of accesses per day, this adds significant latency. Fix: use numactl --membind=0 --cpunodebind=0 to pin process and memory to the same socket.

Q: Why is single-threaded design better for low-latency trading?

A single-threaded hot path eliminates mutex contention which causes unpredictable tail latency. Uncontended mutex acquisition takes ~20ns, but contended mutex can spike to 10,000-100,000ns (10-100µs). This destroys P99 latency. The pattern: use a single-threaded event loop for order handling and communicate with other threads via lock-free SPSC (Single-Producer Single-Consumer) queues. This is how LMAX Disruptor achieved millions of operations per second with predictable latency.

Q: What is pre-allocation and why should trading systems avoid malloc?

malloc() and new are not constant-time operations. They search free lists, may call mmap(), and can trigger page faults. Average malloc is 50ns, but P99 can spike to 100,000ns when memory is fragmented. The fix: pre-allocate everything at startup using object pools or arena allocators. Your hot path should never call malloc. Pattern: allocate a large buffer at startup and slice from it as needed.

Q: What latency can I expect from colocation vs cloud?

Colocated bare-metal (3m from exchange matcher): 300-800 nanoseconds round-trip. Optimized cloud (AWS Local Zones): 18-45 microseconds. Standard cloud: 2-20 milliseconds. The physics is simple: light travels ~300km per millisecond. If you're 5km from the exchange, that's an absolute floor of 50µs round-trip. Top HFT firms pay $100k+/month for colo space within 100 meters of matching engines. Anything above 100µs is not competitive HFT. It is regular trading.

Question 1

What is kernel bypass (DPDK) and why does it reduce latency?

Accepted Answer

Kernel bypass with DPDK (Data Plane Development Kit) eliminates the overhead of the Linux network stack by allowing applications to directly poll the NIC from user space. A normal packet traverses 5+ layers: NIC → Driver → Kernel Space → TCP/IP Stack → Socket Buffer → User Space. Each transition costs 1-5µs due to context switches and memory copies. DPDK maps the NIC directly to user-space memory, reducing per-packet latency from ~15µs to ~1µs. The trade-off is you lose the kernel's TCP implementation and must handle protocol logic yourself.

Question 2

What are isolcpus and nohz_full kernel parameters?

Accepted Answer

isolcpus tells the Linux scheduler to never schedule any processes on the specified CPUs unless explicitly asked. nohz_full disables the kernel's timer tick (normally 250Hz) on those CPUs, eliminating interruptions. Combined, they ensure your trading thread runs uninterrupted, reducing P99 latency from 50µs to 5µs. Typical usage: GRUB_CMDLINE_LINUX="isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7"

Question 3

How do CPU C-states affect trading latency?

Accepted Answer

C-states are CPU power-saving sleep states. When a CPU is idle, it enters deeper C-states to save power: C1 has 2µs wake-up time, C3 has 50µs, C6 has 100µs+. If your trading thread is waiting for a packet and the CPU enters C3, you pay 50µs just to wake up before processing. For HFT, disable C-states via BIOS or kernel parameters: intel_idle.max_cstate=0 processor.max_cstate=0

Question 4

What are huge pages and why do they matter for trading?

Accepted Answer

Huge pages (2MB or 1GB) reduce TLB (Translation Lookaside Buffer) misses. With standard 4KB pages, a 1GB dataset needs 262,144 pages, but TLB only holds ~1,500 entries causing constant misses. With 2MB pages, the same dataset uses only 512 pages which fits in TLB. Each TLB miss costs 10-100 cycles (5-50ns). For memory-intensive HFT workloads like order books, this adds up to 1-5µs per operation. Enable with: echo 1024 > /proc/sys/vm/nr_hugepages

Question 5

What is NIC IRQ affinity and how do I configure it?

Accepted Answer

IRQ affinity controls which CPU handles interrupts from a network card. By default, Linux can route NIC interrupts to any CPU. If your trading thread is on CPU 2 and the IRQ lands on CPU 5, data must cross the CPU interconnect (10-50µs). Worse, if the IRQ lands on the same CPU as your trading thread, it interrupts your critical path. The fix: pin NIC IRQs to dedicated cores separate from trading threads. Command: echo 2 > /proc/irq/42/smp_affinity (where 42 is the IRQ number)

Question 6

What is hardware timestamping and why is it more accurate?

Accepted Answer

Hardware timestamping uses the NIC's internal clock to stamp packets at nanosecond precision, directly in hardware. Software timestamps via clock_gettime() involve a syscall (500ns+) and potential drift if the kernel is busy. Hardware timestamps have <50ns error vs 1-10µs error for software timestamps. This is essential for proving execution time to regulators, detecting MEV manipulation, and accurate latency measurement. Check support with: ethtool -T eth0

Question 7

What is NUMA and how does it affect trading systems?

Accepted Answer

NUMA (Non-Uniform Memory Access) means modern servers with 2+ CPU sockets have memory local to each socket. Accessing local memory takes 70ns, but accessing memory on another socket takes 130ns (cross-socket penalty). If your trading thread on Socket 0 reads order book data from Socket 1's memory, you pay an extra 60ns per access. At billions of accesses per day, this adds significant latency. Fix: use numactl --membind=0 --cpunodebind=0 to pin process and memory to the same socket.

Question 8

Why is single-threaded design better for low-latency trading?

Accepted Answer

A single-threaded hot path eliminates mutex contention which causes unpredictable tail latency. Uncontended mutex acquisition takes ~20ns, but contended mutex can spike to 10,000-100,000ns (10-100µs). This destroys P99 latency. The pattern: use a single-threaded event loop for order handling and communicate with other threads via lock-free SPSC (Single-Producer Single-Consumer) queues. This is how LMAX Disruptor achieved millions of operations per second with predictable latency.

Question 9

What is pre-allocation and why should trading systems avoid malloc?

Accepted Answer

malloc() and new are not constant-time operations. They search free lists, may call mmap(), and can trigger page faults. Average malloc is 50ns, but P99 can spike to 100,000ns when memory is fragmented. The fix: pre-allocate everything at startup using object pools or arena allocators. Your hot path should never call malloc. Pattern: allocate a large buffer at startup and slice from it as needed.

Question 10

What latency can I expect from colocation vs cloud?

Accepted Answer

Colocated bare-metal (3m from exchange matcher): 300-800 nanoseconds round-trip. Optimized cloud (AWS Local Zones): 18-45 microseconds. Standard cloud: 2-20 milliseconds. The physics is simple: light travels ~300km per millisecond. If you're 5km from the exchange, that's an absolute floor of 50µs round-trip. Top HFT firms pay $100k+/month for colo space within 100 meters of matching engines. Anything above 100µs is not competitive HFT. It is regular trading.

Question 11

What is MEV and how can I protect against it?

Accepted Answer

MEV (Maximal Extractable Value) is profit extracted by reordering, inserting, or censoring transactions. Common attacks: frontrunning (seeing your transaction and trading ahead), sandwiching (buying before you, then selling after), and backrunning. Protection methods: use private mempools (Flashbots Protect), set tight slippage tolerance (max 0.5% for large orders), set deadline (revert if not mined within 3 blocks), and use MEV-aware DEX aggregators that route through protected relays.

Question 12

How do DEX aggregators find optimal swap routes?

Accepted Answer

DEX aggregators model all liquidity pools as a weighted directed graph where edges represent swap paths weighted by price impact. They use modified Dijkstra's algorithm to find multi-hop routes (USDC→ETH→WBTC can be cheaper than USDC→WBTC) and dynamic programming to optimize split routes across multiple pools. For a $1M order, splitting across 3 pools might reduce slippage from 2.5% to 0.8%. The best aggregators also factor in gas costs: more hops = higher gas, so sometimes one-hop is optimal for small orders despite worse price.

Question 13

What are bitmap storage patterns in Solidity and how do they save gas?

Accepted Answer

Bitmap storage packs 256 boolean values into a single uint256 storage slot. Standard mapping(address => bool) costs 20k gas per write because it uses a full storage slot per address. With bitmaps, you divide the address into a bucket (which storage slot) and position (which bit), then use bitwise operations: 5k gas per write. For a voting contract processing 10,000 votes, this saves 150 million gas ($300+ at 50 gwei). Uniswap V3 uses this pattern for tick initialization.

Nikhil Padala

Low-Latency Infrastructure FAQ

Kernel & OS