Crypto Exchange Architecture on AWS

Building a crypto exchange infrastructure on AWS. VPC design, security groups, HSM integration, and disaster recovery.

Intermediate 25 min read Expert Version →

🎯 What You'll Learn

  • Design a secure VPC for exchange infrastructure
  • Implement proper security group rules
  • Integrate AWS HSM for key management
  • Plan for disaster recovery and failover

📚 Prerequisites

Before this lesson, you should understand:

Why AWS for Crypto Exchanges?

Despite latency disadvantages, many crypto exchanges use AWS because:

  • Fast iteration - Go live in days, not months
  • Security certifications - SOC2, ISO27001 out of the box
  • Global presence - Regions near major crypto markets
  • Managed services - Less operational burden

This lesson covers architecture patterns for exchange infrastructure on AWS.


What You’ll Learn

By the end of this lesson, you’ll understand:

  1. VPC architecture - Network isolation and segmentation
  2. Security groups - Principle of least privilege
  3. Key management - HSM integration for crypto operations
  4. Disaster recovery - Multi-region failover strategies

The Foundation: VPC Design

A proper exchange VPC has multiple layers:

┌─────────────────────────────────────────────────────────────┐
│                         VPC (10.0.0.0/16)                   │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Public Subnet (10.0.1.0/24)                            │ │
│  │   [ALB] [NAT Gateway] [Bastion]                        │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Private Subnet - App (10.0.2.0/24)                     │ │
│  │   [API Servers] [Matching Engine] [Order Manager]      │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Private Subnet - Data (10.0.3.0/24)                    │ │
│  │   [RDS] [ElastiCache] [DocumentDB]                     │ │
│  └────────────────────────────────────────────────────────┘ │
│                              │                               │
│  ┌────────────────────────────────────────────────────────┐ │
│  │ Private Subnet - HSM (10.0.4.0/24)                     │ │
│  │   [CloudHSM] [Key Management]                          │ │
│  └────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

The “Aha!” Moment

Here’s what separates secure exchanges from hacked ones:

The matching engine should NEVER be directly accessible from the internet. All external traffic goes through API gateways in the public subnet. The matching engine lives in a private subnet with NO inbound rules except from the API layer.

Network segmentation is your first defense.


Let’s See It In Action: Terraform VPC

# exchange-vpc.tf
resource "aws_vpc" "exchange" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name        = "exchange-vpc"
    Environment = "production"
  }
}

# Public subnet for ALB, NAT, Bastion
resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.exchange.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "us-east-1a"
  map_public_ip_on_launch = true
}

# Private subnet for application servers
resource "aws_subnet" "app" {
  vpc_id            = aws_vpc.exchange.id
  cidr_block        = "10.0.2.0/24"
  availability_zone = "us-east-1a"
}

# Private subnet for databases
resource "aws_subnet" "data" {
  vpc_id            = aws_vpc.exchange.id
  cidr_block        = "10.0.3.0/24"
  availability_zone = "us-east-1a"
}

# Private subnet for HSM
resource "aws_subnet" "hsm" {
  vpc_id            = aws_vpc.exchange.id
  cidr_block        = "10.0.4.0/24"
  availability_zone = "us-east-1a"
}

Security Groups: Least Privilege

# ALB security group - only public entry point
resource "aws_security_group" "alb" {
  name        = "exchange-alb"
  description = "Allow HTTPS from internet"
  vpc_id      = aws_vpc.exchange.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# App server - only from ALB
resource "aws_security_group" "app" {
  name        = "exchange-app"
  description = "API servers"
  vpc_id      = aws_vpc.exchange.id

  ingress {
    from_port       = 8080
    to_port         = 8080
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]  # Only ALB!
  }
}

# Matching engine - only from app servers
resource "aws_security_group" "matching" {
  name        = "exchange-matching"
  description = "Matching engine - no internet access"
  vpc_id      = aws_vpc.exchange.id

  ingress {
    from_port       = 9000
    to_port         = 9000
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]  # Only app servers!
  }
  
  # No egress to internet
  egress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.rds.id]  # Only DB
  }
}

CloudHSM for Key Management

Crypto exchanges need HSM for:

  • Hot wallet signing keys
  • API key encryption
  • User password hashing
# Simplified CloudHSM integration
import boto3
from cloudhsm_pkcs11 import sign_transaction

class SecureWallet:
    def __init__(self, hsm_cluster_id: str):
        self.client = boto3.client('cloudhsmv2')
        self.cluster_id = hsm_cluster_id
        
    def sign_withdrawal(self, transaction: bytes, key_label: str) -> bytes:
        """Sign transaction using key stored in HSM."""
        # Key never leaves the HSM
        signature = sign_transaction(
            pkcs11_library="/opt/cloudhsm/lib/libcloudhsm_pkcs11.so",
            pin=os.environ['HSM_PIN'],
            key_label=key_label,
            data=transaction
        )
        return signature

Cost: ~$5,000/month for CloudHSM cluster (2 HSMs minimum for HA)


Common Misconceptions

Myth: “Security groups are like firewalls-set once and forget.”
Reality: Security groups should be audited monthly. Developers add rules for debugging and forget to remove them. Use AWS Config to detect violations.

Myth: “Multi-AZ RDS is enough for disaster recovery.”
Reality: Multi-AZ protects against AZ failure, not region failure. For a crypto exchange, you need cross-region replication and a DR runbook.

Myth: “AWS manages security, so I don’t need to.”
Reality: AWS secures the infrastructure; you secure the configuration. Most breaches are misconfigured S3 buckets or overly permissive security groups.


Disaster Recovery Strategy

TierRTORPOStrategyCost
Backup & RestoreHoursHoursS3 cross-region$
Pilot LightMinutesSecondsStandby DB in DR region$$
Warm StandbySecondsSecondsScaled-down DR region$$$
Active-Active00Full production in 2 regions$$$$

For exchanges, Warm Standby minimum. Active-Active for serious operations.


High-Availability Architecture

Region: us-east-1                    Region: us-west-2 (DR)
┌─────────────────────┐              ┌─────────────────────┐
│ [ALB] ─── [API]     │              │ [ALB] ─── [API]     │
│      ├── [Match]    │    ────────▶ │      ├── [Match]    │
│      └── [RDS-Pri]  │   Replication│      └── [RDS-Read] │
└─────────────────────┘              └─────────────────────┘
         │                                     │
         └──────── Route 53 Health Checks ─────┘

Practice Exercises

Exercise 1: Draw Your VPC

Create a VPC diagram for your exchange:
- How many subnets?
- What goes in each?
- Which can reach the internet?

Exercise 2: Security Group Audit

For each security group, answer:
1. Who/what can initiate connections?
2. To what ports?
3. Why?

If you can't answer "why," the rule shouldn't exist.

Exercise 3: DR Runbook

Write steps for region failover:
1. How do you detect the outage?
2. How do you fail over DNS?
3. How do you promote the DR database?
4. How do you fail back?

Key Takeaways

  1. Network segmentation is fundamental - Public, app, data, HSM subnets
  2. Security groups = least privilege - Only what’s needed, nothing more
  3. HSM for critical keys - Signing keys never leave hardware
  4. Plan for failure - DR strategy before you need it

What’s Next?

🎯 Continue learning: Security Architecture for Trading

🔬 Expert version: Building a Crypto Exchange on AWS

Now you can architect secure exchange infrastructure. 🏗️

Questions about this lesson? Working on related infrastructure?

Let's discuss