The Physics of Processes: Life, Death, and Zombies
Why fork() is fast (COW), why Context Switching is slow (TLB Flush), and how the Kernel manages the illusion of multitasking.
🎯 What You'll Learn
- Deconstruct the `task_struct` (Process Control Block)
- Analyze Copy-On-Write (COW) memory physics during `fork()`
- Trace a Context Switch (Register Save + TLB Flush)
- Debug Zombie Processes (`Z+`) using `pstree`
- Differentiate `exec()` system calls
📚 Prerequisites
Before this lesson, you should understand:
Introduction
To a user, a Process is a window.
To the Kernel, a Process is a data structure (struct task_struct) containing:
- files: Open file descriptors.
- mm: Memory pointers.
- thread_info: CPU limits and capabilities.
When you run htop, you are just iterating over a linked list of these structures.
The Physics: Fork and Copy-On-Write (COW)
In early Unix, fork() was slow because it copied all memory.
In Linux, fork() is instant.
The Trick: The Kernel copies the Page Tables, not the RAM. Pointer arithmetic is cheap. Both Parent and Child point to the same physical RAM, marked as “Read-Only”.
The Event: When the Child tries to write to a variable:
- CPU traps (Page Fault).
- Kernel pauses execution.
- Kernel copies only that specific 4KB page to a new location.
- Kernel marks both pages “Read-Write”.
- Child resumes.
Physics: This is why creating 1000 processes (e.g., Nginx workers) consumes almost zero extra RAM until they start diverging.
Deep Dive: Context Switching Overhead
Multitasking is an illusion. The CPU runs one thing at a time. Switching from Process A to Process B costs approximately 1-5 microseconds. Why?
- Register Swap: Save A’s registers to RAM, load B’s. (Fast).
- Cache Pollution: L1/L2 caches contain A’s data. B will miss every cache hit initially. (Slow).
- TLB Flush: The CPU forgets A’s virtual-to-physical address mappings. (Very Slow).
Impact: This is why “High Frequency Trading” avoids context switches like the plague (using CPU Pinning).
Code: The Zombie Maker
A Zombie (Z) is not a process. It is an entry in the process table.
It is the Kernel saying: “This child died, and its parent hasn’t asked ‘How?’ yet.”
#include <unistd.h>
#include <stdlib.h>
int main() {
pid_t pid = fork();
if (pid > 0) {
// Parent: Sleeps forever, never calling wait()
// The child will die, but stay "Zombie" until parent exits
while(1) sleep(1);
} else {
// Child: Exits immediately
exit(0);
}
return 0;
}
Observation:
Run this. Check ps aux | grep Z.
You cannot kill -9 a Zombie. It is already dead. You must kill the Parent to clean it (re-parenting to Process 1).
Practice Exercises
Exercise 1: COW Verification (Beginner)
Scenario: Fork a process. In the child, print the pointer address of a variable. Task: Is it the same as the parent? (Yes). Does it point to the same Physical RAM? (Yes, initially).
Exercise 2: PID Exhaustion (Intermediate)
Scenario: sysctl kernel.pid_max is 32768 by default.
Task: Write a fork bomb (safely, in a VM!) that exhausts PIDs. What specific error does the shell return?
Exercise 3: Context Switch Profiling (Advanced)
Task: Use perf stat -e context-switches ./myprogram to measure how many times your program was kicked off the CPU.
Knowledge Check
- Why is
fork()faster thanmalloc()ing the same amount of memory? - What actually happens when you Copy-On-Write?
- Why can’t you kill a Zombie process?
- How does CPU Pinning improve performance?
- What does Process 1 (init/systemd) do with orphans?
Answers
- It doesn’t allocate RAM. It only copies metadata (Page Tables).
- Page Fault. The hardware interrupts the kernel, which lazily duplicates the 4KB page involved.
- It’s already dead. It has no code, no memory, no CPU time. It is a line in a ledger.
- Avoids Cache/TLB flushing. By staying on one core, the cache remains “warm” with relevant data.
- Adoption. It calls
wait()on them, cleaning up their exit status and removing the zombie entry.
Summary
- Fork: Lazy copying (COW).
- Context Switch: The cost of multitasking (Cache Penalties).
- Zombies: Administrative leftovers, not memory leaks.
Questions about this lesson? Working on related infrastructure?
Let's discuss