Problem
Design a rate limiter that works across multiple application server nodes behind a load balancer. The rate limiter must enforce global rate limits (e.g., 100 requests per minute per API key) regardless of which server handles the request.
Requirements
- Global Rate Limiting: Enforce a single rate limit across all server nodes. If the limit is 100 req/min, the total requests from a client across all nodes must not exceed 100.
- Multiple Algorithms: Support both fixed-window and sliding-window rate limiting.
- Tiered Limits: Different API keys may have different rate limits (free: 60/min, pro: 600/min, enterprise: 6000/min).
- Burst Handling: Allow short bursts above the sustained rate (e.g., token bucket with burst capacity).
- Response Headers: Return standard rate limit headers (
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset).
- Graceful Degradation: If the centralized rate limit store is unavailable, the system should degrade gracefully rather than blocking all requests.
Constraints
- 20 application server nodes behind a load balancer.
- Combined throughput: 50,000 requests per second.
- Rate limit check must add less than 2ms of latency (p99).
- Rate limit state should not consume more than 10 GB of memory.
- The system must handle clock skew between nodes (up to 100ms).
What to Design
- The algorithm for each rate limiting strategy (fixed window, sliding window, token bucket)
- How state is synchronized across nodes
- The data structure in the centralized store
- Your graceful degradation strategy when the store is unavailable
- How you handle race conditions in concurrent rate limit checks