Load Balancing: The Physics of Queues
Why adding servers doesn't always make things faster. Little's Law, the Thundering Herd, and Layer 7 Traffic Shaping.
🎯 What You'll Learn
- Apply Little's Law ($L = \lambda W$) to system capacity
- Differentiate L4 (Packet) vs L7 (Request) Load Balancing
- Mitigate the 'Thundering Herd' problem
- Configure Nginx Upstream blocks
- Analyze Sticky Sessions vs Stateless Routing
📚 Prerequisites
Before this lesson, you should understand:
Introduction
A Load Balancer is a traffic cop. It is a Queue Manager.
Every server is a queue.
- The CPU has a Run Queue.
- The Network Card has a Ring Buffer.
- The Database has a Lock Queue.
If you understand Queueing Theory, you understand Load Balancing. If you don’t, you add servers until you go bankrupt.
The Physics: Little’s Law
The fundamental law of system capacity is:
- : Average number of items in the system (Queue Length).
- : Average arrival rate (Requests per Second).
- : Average wait time (Latency).
The Insight: If Latency () doubles (e.g., Database slows down), then Queue Length () doubles-even if traffic () stays constant. Your Load Balancer’s job is to detect this and stop sending requests to the slow server before it crashes.
L4 vs. L7: The Layers of Traffic
How deep does the Load Balancer look?
Layer 4 (Transport)
- What it sees: IP + Port. “Packet from 1.2.3.4 to 5.6.7.8”.
- Action: Forwards packets. Fast. Dumb.
- Example: LVS, Maglev.
Layer 7 (Application)
- What it sees: HTTP Headers, Cookies, URL. “GET /api/user?id=5”.
- Action: Terminates TCP, reads request, opens new TCP to backend. Smart. Slow.
- Example: Nginx, HAProxy, AWS ALB.
Deep Dive: The Thundering Herd
Imagine 10,000 users are waiting for a cache entry. The entry expires. Suddenly, 10,000 requests hit the Backend DB simultaneously. The DB crashes. The LB retries. The DB stays dead.
Solution: Shepherd the Herd.
- Request Coalescing: The LB holds 9,999 requests, sends one to the backend, and serves the result to all 10,000.
- Jitter: processing requests with slight random delays to desynchronize spikes.
Code: Weighted Round Robin
A simple Round Robin is often not enough. You need Weights.
class WeightedRR:
def __init__(self, servers):
# servers = {"srv1": 5, "srv2": 1, "srv3": 1}
self.servers = servers
self.state = {k: 0 for k in servers}
def get_server(self):
# Find server with highest (CurrentWeight + EffectiveWeight)
# (This is a simplified Nginx-like algorithm)
best = None
total_weight = 0
for srv, weight in self.servers.items():
self.state[srv] += weight
total_weight += weight
if best is None or self.state[srv] > self.state[best]:
best = srv
self.state[best] -= total_weight
return best
# Physics: This ensures "Smooth" distribution, not "Bursty" distribution.
# Srv1 doesn't get 5 requests in a row; it's interleaved.
Practice Exercises
Exercise 1: Capacity Planning (Beginner)
Scenario:
- You process 1000 Req/Sec ().
- Avg Latency is 0.5 Sec (). Task: According to Little’s Law, how many concurrent connections () must your server support?
Exercise 2: Nginx Config (Intermediate)
Task: Configure an Nginx upstream block that:
- Loads balances 3 servers.
- Send 2x traffic to
srv_heavy. - Marks a server “down” if it fails 3 times.
Exercise 3: Sticky Sessions (Advanced)
Scenario: A user logs in on Server A. Their session is in Server A’s RAM.
Task: Why does Round Robin break this? How does ip_hash fix it? What is the downside of ip_hash during a DDoS attack?
Knowledge Check
- What happens to system capacity if latency increases?
- Why is L7 Load Balancing slower than L4?
- What is the “Thundering Herd” problem?
- Why do we need health checks?
- Does adding more servers always fix high latency?
Answers
- Capacity Drops. (Or Queue Queue Length explodes). . If W goes up, L goes up.
- Context Switching. L7 requires terminating the TCP connection, buffering the request, parsing headers, and creating a new connection. L4 just rewrites packets.
- Massive concurrency. When a cache misses, all concurrent requests hit the DB at once.
- To avoid Black Holes. Sending traffic to a dead server results in 100% error rates.
- No. If the bottleneck is the Database (shared resource), adding more Web Servers just increases the queue pressure on the DB.
Summary
- Little’s Law: Latency kills throughput.
- Algorithms: Use Weighted Round Robin for heterogeneity.
- Layers: L4 for speed, L7 for intelligence.
Questions about this lesson? Working on related infrastructure?
Let's discuss