Rate Limits Guide
Protect your API from abuse Configure rate limits per virtual key.
Overview​
Rate limits prevent abuse and ensure fair usage. Each virtual key can have:
- Request limits - Maximum requests per time period
- Token limits - Maximum tokens per time period
- Reset periods - Hourly, daily, weekly, or monthly
Default Limits​
| Plan | Requests/Hour | Tokens/Hour |
|---|---|---|
| Free | 100 | 100,000 |
| Pro | 1,000 | 1,000,000 |
| Enterprise | Custom | Custom |
Configuration​
Per Virtual Key​
{
"governance": {
"rate_limits": [
{
"id": "standard-limit",
"name": "Standard Rate Limit",
"request_max_limit": 1000,
"request_reset_duration": "1h",
"token_max_limit": 1000000,
"token_reset_duration": "1h"
}
],
"virtual_keys": [
{
"id": "key-1",
"name": "Production Key",
"rate_limit_id": "standard-limit"
}
]
}
}
Rate Limit Behavior​
When Limits Are Exceeded​
{
"error": {
"message": "Rate limit exceeded: 1000 requests/hour",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"retry_after": 3600
}
}
Response Headers​
All responses include rate limit info:
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 523
X-RateLimit-Reset: 1707456000
Handling Rate Limits​
Exponential Backoff​
import time
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8084/v1",
api_key="sk-bf-YOUR_VIRTUAL_KEY"
)
def make_request_with_backoff(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="anthropic/claude-sonnet-4-5-20250929",
messages=messages
)
except RateLimitError as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise
# Use
response = make_request_with_backoff([{"role": "user", "content": "Hello"}])
Check Rate Limit Before Request​
import requests
def check_rate_limit(api_key):
response = requests.get(
"http://localhost:8081/api/governance/virtual-keys/limits",
headers={"x-bf-vk": api_key}
)
return response.json()
# Usage
limits = check_rate_limit("sk-bf-YOUR_KEY")
if limits['remaining_requests'] < 10:
print("Warning: Approaching rate limit")
Best Practices​
1. Implement Queuing​
import queue
import threading
request_queue = queue.Queue()
def worker():
while True:
messages = request_queue.get()
try:
response = client.chat.completions.create(messages=messages)
print(response.choices[0].message.content)
except RateLimitError:
request_queue.put(messages) # Re-queue
time.sleep(60)
finally:
request_queue.task_done()
# Start workers
for _ in range(5):
threading.Thread(target=worker, daemon=True).start()
# Queue requests
request_queue.put([{"role": "user", "content": "Hello"}])
2. Use Batch Processing​
# Instead of multiple small requests:
for question in questions:
client.chat.completions.create(messages=[{"role": "user", "content": question}])
# Batch them:
batched = "\n\n".join(questions)
client.chat.completions.create(messages=[{"role": "user", "content": f"Answer each:\n{batched}"}])
3. Monitor Rate Limits​
def monitor_rate_limit(response):
remaining = int(response.response_headers.get('X-RateLimit-Remaining', 0))
limit = int(response.response_headers.get('X-RateLimit-Limit', 1))
usage_percent = (limit - remaining) / limit * 100
if usage_percent > 80:
print(f"Warning: {usage_percent:.0f}% of rate limit used")
return usage_percent
# Use
response = client.chat.completions.create(...)
usage = monitor_rate_limit(response)
Rate Limit Strategies​
Strategy 1: Conservative​
Set low limits to prevent unexpected costs:
{
"request_max_limit": 100,
"request_reset_duration": "1h",
"token_max_limit": 100000,
"token_reset_duration": "1h"
}
Strategy 2: Balanced​
For production applications:
{
"request_max_limit": 1000,
"request_reset_duration": "1h",
"token_max_limit": 1000000,
"token_reset_duration": "1h"
}
Strategy 3: Aggressive​
For high-volume applications:
{
"request_max_limit": 10000,
"request_reset_duration": "1h",
"token_max_limit": 10000000,
"token_reset_duration": "1h"
}
Increasing Limits​
To increase your rate limits:
- Contact support@korad.ai
- Provide justification for higher limits
- Enterprise plans get custom limits
Monitoring​
Dashboard​
View rate limit usage in real-time at: https://dashboard.korad.ai/rate-limits
API​
curl http://localhost:8081/api/governance/virtual-keys/key-1/rate-limit
Response:
{
"request_limit": 1000,
"request_remaining": 523,
"request_reset_at": "2025-02-08T12:00:00Z",
"token_limit": 1000000,
"token_remaining": 523456,
"token_reset_at": "2025-02-08T12:00:00Z"
}
Protect your API with appropriate rate limits.