The Physics of FPGA: Hardware Acceleration

Why Software is too slow. The physics of Tick-to-Trade, Logic Gates, and Pipeline Determinism.

Beginner • 50 min read • Expert Version →

🎯 What You'll Learn

Deconstruct Tick-to-Trade Latency (Wire-to-Wire)
Analyze the von Neumann Bottleneck (why CPUs are slow)
Trace a packet through a Pipelined FPGA Architecture
Calculate the throughput of a 300MHz FPGA core
Audit a Verilog State Machine for Order Processing

📚 Prerequisites

Before this lesson, you should understand:

Introduction

In the Nanosecond Economy, the CPU is the bottleneck. A CPU is a “Juggler”: It handles thousands of tasks (OS, Network, Logic) by switching between them very fast. An FPGA is an “Assembly Line”: It does one thing, perfectly, in parallel, with zero interruptions.

This lesson explores why we burn custom silicon to save 700 nanoseconds.

The Physics: Tick-to-Trade (T2T)

The Metric: Time from “First Bit of Market Data In” to “First Bit of Order Out”.

Software (C++): ~2-5 microseconds.
FPGA (Hardware): ~40-100 nanoseconds.

Physics: Software pays the “von Neumann Tax”:

Interrupt fires.
Context Switch to Kernel Mode.
Copy Packet to RAM.
Context Switch to User Mode.
CPU reads RAM into Cache (Cache Miss?).
CPU executes instruction.

FPGA pays Zero Tax. The electrons flow through the logic gates like water through a pipe.

Deep Dive: Pipelining (The Assembly Line)

How does an FPGA process 100 Million Insert messages per second? Pipelining.

The Physics: Imagine a packet takes 100 clock cycles to process.

CPU: Must finish Packet A before starting Packet B. Throughput = 1/100.
FPGA: Splits the task into 100 stages.
- Cycle 1: Stage 1 processes Packet A.
- Cycle 2: Stage 2 processes Packet A. Stage 1 processes Packet B.
- Result: Throughput = 1 packet per cycle. Massive Parallelism.

Architecture: Hybrid Systems (Solarflare)

Most firms don’t just use FPGA. They use SmartNICs (e.g., Solarflare X3522). The FPGA sits on the Network Card.

Filtering: FPGA drops 99% of “Noise” packets.
Forwarding: Sends only “Signal” packets to the CPU over PCIe.
Result: CPU load drops; crucial latency improves.

Code: Verifying an Order (Verilog)

In Software, if (price > limit) is compiled to assembly. In Hardware, if (price > limit) allows electrons to flow to the “Buy” wire.

module OrderTrigger (
    input wire clk,
    input wire [31:0] market_price,
    input wire [31:0] limit_price,
    output reg buy_signal
);

    always @(posedge clk) begin
        // The comparison happens physically in 1 clock cycle (3ns)
        if (market_price < limit_price) begin
            buy_signal <= 1'b1;
        end else begin
            buy_signal <= 1'b0;
        end
    end

endmodule

Practice Exercises

Exercise 1: The Jitter (Beginner)

Scenario: Measure latency of 1000 orders. Software: Min 5us, Max 50us (OS Jitter). FPGA: Min 80ns, Max 82ns (Deterministic). Lesson: FPGA wins on Consistency (Standard Deviation).

Exercise 2: Throughput Math (Intermediate)

Scenario: FPGA Clock = 300 MHz (3.3ns per cycle). Pipeline: Can accept 1 packet every cycle. ** Throughput:** $300,000,000 \text{ packets/sec}$ . Bandwidth: If packet is 64 bytes: $300M \times 64B \approx 19.2 \text{ GB/s}$ . (Line Rate 25Gbps is saturated).

Exercise 3: Development Cost (Advanced)

Scenario: A bug is found in algo logic. Software: Fix + Compile = 30 seconds. FPGA: Fix + Synthesis + Place & Route = 4 to 12 hours. Tradeoff: FPGAs are inflexible. Only use them for logic that rarely changes (e.g., Feed Parsing, Risk Checks).

Knowledge Check

What is the “von Neumann Tax”?
Why is FPGA latency “Deterministic”?
What is Pipelining?
Why is FPGA development slower than C++?
What is a SmartNIC?

Answers

Memory Access. Fetching instructions and data from RAM dominates execution time.
No OS. No scheduling, no interrupts. Every clock cycle does exactly the same work.
Parallel Stages. Processing different parts of multiple packets simultaneously.
Synthesis. Compiling code into physical circuit layout is mathematically complex (NP-Hard).
FPGA + NIC. A network card with a programmable chip for offloading tasks.

Summary

Software: Juggling.
FPGA: Assembly Line.
Latency: Nanoseconds.

Questions about this lesson? Working on related infrastructure?

Let's discuss