Designing a Practical Rate Limiter

01 May, 2025

TL;DR

Rate limiters protect systems from excessive traffic by restricting request volumes within defined time periods. The optimal design isn't about adding more components but achieving maximum efficiency with minimal overhead. A well-designed rate limiter uses Redis for in-memory key-value storage with expiration capabilities, implements the rate limiting logic as a library rather than a service to reduce network hops, and scales through sharding rather than read replicas due to write-heavy workloads.

Three Interesting Things

Library over Service Architecture: One of the most counter-intuitive yet brilliant insights from this design is treating the rate limiter as a library rather than a service. In many system designs, the default approach is to create a dedicated service with its own API. However, this introduces additional network hops that increase latency for every request. By embedding the rate limiting logic directly into the service or proxy as a library, we eliminate these network hops, resulting in significantly lower latency—a critical requirement for rate limiters that must process every incoming request.

Redis with Key Expiration for Automatic Counter Reset: Using Redis's built-in key expiration feature provides an elegant solution to the time-window problem in rate limiting. Instead of explicitly tracking time windows and manually resetting counters, the system simply sets an expiration time on each counter key. When the time period expires, Redis automatically removes the key, effectively resetting the counter to zero when a new request comes in. This approach simplifies the implementation while leveraging Redis's optimized time-based operations.

Scaling Strategy Determined by Workload Characteristics: The design demonstrates how understanding workload characteristics drives architecture decisions. Since rate limiters perform a write operation (incrementing a counter) for every request, they create a write-heavy workload. This insight leads to choosing sharding over read replicas for scaling, as adding read replicas wouldn't address the bottleneck. This illustrates how proper system design flows from deeply understanding your data access patterns rather than applying generic scaling solutions.

Notes and Quick Explanation

Rate limiters serve as defensive mechanisms that protect backend services from being overwhelmed by traffic. They ensure system stability by enforcing request quotas, typically defined as a maximum number of requests allowed within a specific time period. When implemented effectively, they provide a first line of defense against both malicious attacks and unexpected traffic spikes that could otherwise cause system outages.

The presentation emphasizes creating designs that are "extremely realistic" rather than overly complex. The central philosophy is that adding more components doesn't necessarily create better systems—in fact, it often introduces unnecessary overhead and potential points of failure. This pragmatic approach drives all the subsequent design decisions.

Core Requirements and Principles

The rate limiter design addresses three fundamental requirements:

Limit requests based on configuration: The system must enforce configurable thresholds for request volume.
Support granular threshold configuration: Developers need the ability to set rate limits at individual API endpoint levels using simple annotations or declarations.
Minimal overhead: Checking rate limits should add negligible latency to requests, as every request must pass through this check.

The rate limiter serves as the first evaluation point for incoming requests. When a request is received, it's immediately checked against configured limits. If allowed, the request proceeds to the backend service; if denied, it's rejected with an HTTP 429 "Too Many Requests" status code, preventing the backend service from even receiving the request.

Placement Options

Rate limiters can be positioned at two primary locations in the request flow:

At the frontend proxy/load balancer level: The proxy intercepts all incoming requests, consults the rate limiter, and either forwards allowed requests or immediately rejects those that exceed limits. This approach prevents excessive requests from ever reaching backend services.
Within the backend service: The service itself checks request rates before processing, typically implemented as middleware that executes before the main request handler. Requests exceeding limits are rejected at this point.

The placement choice depends on your architecture. If you have a centralized entry point like a proxy, the first approach provides better protection. Without such a gateway, embedding rate limiting directly in services offers more straightforward implementation.

Rate Limiter Implementation

The core of the rate limiter is remarkably simple:

For each request, identify the key to rate limit against (user ID, IP address, or API token)
Increment a counter for that key in the current time window
If the counter exceeds the configured threshold, reject the request
Otherwise, allow the request to proceed

This implementation uses a "fixed window" algorithm—one of several possible approaches—where counters are reset at fixed time intervals.

Data Storage Requirements

The rate limiter's storage needs are driven by three operations:

Writing/updating counters for each request (write-intensive)
Reading current counter values to check against thresholds
Tracking time windows and resetting counters when periods expire

These requirements point to a key-value store with:

Fast read/write operations
Built-in expiration capabilities
In-memory performance for minimal latency

Redis emerges as the ideal solution, offering all these features plus atomic increment operations. By storing mappings like {user_id: request_count} with appropriate expiration times, Redis handles both counting and time window management efficiently.

The Library Approach

The most innovative aspect of this design is implementing the rate limiter as a library rather than a service. Traditional designs might create a rate limiter service with its own API, but this introduces additional network hops:

Request → Frontend Proxy → Load Balancer → Rate Limiter Service → Redis → Back through the chain
Then if approved: Frontend Proxy → Backend Service

This approach adds substantial latency to every request. Instead, the design proposes embedding rate limiting logic directly into either the proxy or the backend service as a library. This eliminates network hops, allowing direct communication with Redis:

Request → Frontend Proxy (with rate limiter library) → Redis
Then if approved: Proxy → Backend Service

The library contains all the business logic for rate limiting decisions, while Redis serves simply as the data store. This approach dramatically reduces latency while maintaining the same functionality.

Scaling the Rate Limiter

Scaling the rate limiter effectively means scaling Redis to handle growing request volumes. Three scaling strategies are considered:

Vertical scaling: Simply increasing Redis server capacity works initially but has limits.
Read replicas: Since rate limiting is write-heavy (incrementing counters for each request), read replicas provide little benefit.
Sharding: Distributing data across multiple Redis instances based on key hash ensures even distribution of load.

Sharding emerges as the optimal scaling approach. When implemented, each service or proxy with the rate limiter library connects to a central configuration store that maintains information about available Redis shards. The library then uses consistent hashing to determine which shard should handle each key, directing requests appropriately.

Implementation Considerations

The actual implementation uses two primary components:

Redis database cluster: Stores counters with appropriate expiration times
Rate limiter library: Embedded in services/proxies, contains rate limiting logic

When a service starts up, the library:

Reads shard configuration from a central source
Establishes connections to all Redis shards
For each request, determines the appropriate shard using key hashing
Performs atomic increment and check operations on that shard

This architecture can be supplemented with a simple admin interface for visibility and management, allowing internal teams to monitor rate limiting activity and adjust configurations as needed.

System Design and Implementation

The final architecture represents a minimalist yet highly effective approach. By eliminating unnecessary components and focusing on the core requirements, the system achieves maximum performance with minimal complexity.

The rate limiter's storage requirements are modest—approximately 20 bytes per entry (combining user ID/IP/token and counter value). Even with 100 million users, total storage would be around 2GB. The challenge isn't storage capacity but compute throughput—handling increments for every incoming request across distributed nodes.

The system design makes several key trade-offs:

Trading off code duplication (library in multiple services) for reduced latency (no network hops)
Accepting potential consistency issues during configuration changes for improved performance
Choosing sharding over other scaling techniques based on workload characteristics

These decisions reflect a deep understanding of the problem domain and pragmatic engineering.

Key Lessons / Takeaways

Simplicity Trumps Complexity: The design demonstrates that simpler architectures often outperform more complex ones. By resisting the urge to add unnecessary components and focusing on the essential requirements, we create systems that are both more efficient and easier to maintain.

Understand Your Workload: The decision to use sharding rather than read replicas comes from understanding the write-heavy nature of rate limiting workloads. System design should always be driven by a deep understanding of access patterns rather than applying generic scaling solutions.

Libraries vs. Services: There's a tendency in microservice architectures to create a service for every piece of functionality. This design shows that libraries can often be more efficient, especially for functionality that needs to execute on every request path. The choice between library and service should consider performance requirements, not just architectural purity.

Pragmatic Engineering: The design consistently prioritizes practical concerns like latency and throughput over theoretical elegance. This approach—focusing on real-world requirements rather than architectural ideals—produces systems that perform better in production environments.

Questions to Consider

How would you implement more sophisticated rate limiting algorithms like sliding window or leaky bucket using Redis?
What strategies would ensure consistency across Redis shards during configuration changes or shard rebalancing?
How might the design change if rate limits needed to be shared across multiple data centers or regions?
What monitoring and alerting would you implement to detect when rate limiting is actively protecting your services?
How would you handle rate limit exhaustion gracefully from a user experience perspective? Are there alternatives to simple rejection?
Could machine learning be incorporated to create adaptive rate limits based on historical traffic patterns?

References

Redis documentation on key expiration: https://redis.io/commands/expire
Rate limiting algorithms: Leaky Bucket, Fixed Window, and Sliding Window implementations
Consistent hashing for distributed systems
Microservice vs. library architecture trade-offs