Rate Limiting | System Design Library

Why Rate Limit?

Prevent abuse: Stop malicious users from hammering your API
Control costs: Prevent runaway clients from exhausting your resources
Ensure fairness: Give all users a fair share of capacity
SLA enforcement: Different user tiers get different quotas (free vs. paid)
DDoS mitigation: Shed excess load before it reaches your servers

Rate Limiting Algorithms

Token Bucket

Each user has a "bucket" with a maximum capacity of N tokens. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Request cost: 1 token

→ User can burst up to 100 req immediately
→ Then sustains 10 req/sec long-term

Pros: Allows bursting. Smooth, intuitive behavior. Cons: Slightly complex to implement distributed.

Leaky Bucket

Requests enter a queue (the bucket). They're processed at a fixed rate. If the bucket is full, new requests are dropped.

Pros: Enforces a perfectly smooth output rate — good for protecting downstream services. Cons: Adds latency (requests queue instead of being served immediately).

Fixed Window Counter

Count requests in fixed time windows (e.g., 0–59 seconds, 60–119 seconds). Reject if count > limit.

Pros: Simple to implement. Cons: Boundary problem — a user can send 2x the limit by sending requests at the end of one window and start of the next.

Sliding Window Log

Keep a log of request timestamps. On each request, count how many timestamps are within the last N seconds.

Pros: Accurate. Cons: Memory-intensive for high-volume clients.

Sliding Window Counter

Hybrid: use fixed windows but weight by how much of the window has elapsed. Approximates the sliding window log with much less memory.

Most common production choice — accurate and memory-efficient.

Rate Limiting Dimensions

By user/API key: Limit individual users
By IP: Limit unauthenticated requests
By endpoint: Limit expensive endpoints more aggressively
Global: Limit total system throughput

Distributed Rate Limiting

Single-server rate limiting is easy. Multi-server is harder: each server needs to know the total request count across all servers.

Redis + Lua scripts is the standard solution:

RED IS atomic INCR/EXPIRE operations count requests across all servers
Lua scripts ensure atomicity (check + increment in one operation)

Cloud API gateways (AWS API Gateway, Kong, Nginx) handle this out of the box.

Response Headers

Always return rate limit info to clients:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 750
X-RateLimit-Reset: 1735689600
Retry-After: 30  (when limit is exceeded)

Return HTTP 429 Too Many Requests when the limit is exceeded.

Rate Limiting Location

| Layer | Pros | Cons | |---|---|---| | API Gateway | Centralized, easy | Additional hop | | Application code | Flexible, no infra | Must implement yourself | | Load balancer | Very fast | Limited flexibility | | CDN edge | Global DDoS protection | Limited to L7 |

Interview Tips

Mention rate limiting when designing any public API — it's a sign of production thinking
Know that Token Bucket is usually the best answer: it's flexible, allows bursts, and is widely used
Redis is the standard distributed rate limiting backend — INCR is atomic and very fast
Discuss granularity: per-user, per-IP, and per-endpoint limits serve different purposes
Mention circuit breakers as a related pattern for service-to-service rate protection

NetworkingRate Limiting