Skip to main content

Rate Limits Guide

Protect your API from abuse Configure rate limits per virtual key.

Overview​

Rate limits prevent abuse and ensure fair usage. Each virtual key can have:

  • Request limits - Maximum requests per time period
  • Token limits - Maximum tokens per time period
  • Reset periods - Hourly, daily, weekly, or monthly

Default Limits​

PlanRequests/HourTokens/Hour
Free100100,000
Pro1,0001,000,000
EnterpriseCustomCustom

Configuration​

Per Virtual Key​

{
"governance": {
"rate_limits": [
{
"id": "standard-limit",
"name": "Standard Rate Limit",
"request_max_limit": 1000,
"request_reset_duration": "1h",
"token_max_limit": 1000000,
"token_reset_duration": "1h"
}
],
"virtual_keys": [
{
"id": "key-1",
"name": "Production Key",
"rate_limit_id": "standard-limit"
}
]
}
}

Rate Limit Behavior​

When Limits Are Exceeded​

{
"error": {
"message": "Rate limit exceeded: 1000 requests/hour",
"type": "rate_limit_error",
"code": "rate_limit_exceeded",
"retry_after": 3600
}
}

Response Headers​

All responses include rate limit info:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 523
X-RateLimit-Reset: 1707456000

Handling Rate Limits​

Exponential Backoff​

import time
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8084/v1",
api_key="sk-bf-YOUR_VIRTUAL_KEY"
)

def make_request_with_backoff(messages, max_retries=5):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="anthropic/claude-sonnet-4-5-20250929",
messages=messages
)
except RateLimitError as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt
print(f"Rate limited. Waiting {wait_time}s...")
time.sleep(wait_time)
else:
raise

# Use
response = make_request_with_backoff([{"role": "user", "content": "Hello"}])

Check Rate Limit Before Request​

import requests

def check_rate_limit(api_key):
response = requests.get(
"http://localhost:8081/api/governance/virtual-keys/limits",
headers={"x-bf-vk": api_key}
)
return response.json()

# Usage
limits = check_rate_limit("sk-bf-YOUR_KEY")
if limits['remaining_requests'] < 10:
print("Warning: Approaching rate limit")

Best Practices​

1. Implement Queuing​

import queue
import threading

request_queue = queue.Queue()

def worker():
while True:
messages = request_queue.get()
try:
response = client.chat.completions.create(messages=messages)
print(response.choices[0].message.content)
except RateLimitError:
request_queue.put(messages) # Re-queue
time.sleep(60)
finally:
request_queue.task_done()

# Start workers
for _ in range(5):
threading.Thread(target=worker, daemon=True).start()

# Queue requests
request_queue.put([{"role": "user", "content": "Hello"}])

2. Use Batch Processing​

# Instead of multiple small requests:
for question in questions:
client.chat.completions.create(messages=[{"role": "user", "content": question}])

# Batch them:
batched = "\n\n".join(questions)
client.chat.completions.create(messages=[{"role": "user", "content": f"Answer each:\n{batched}"}])

3. Monitor Rate Limits​

def monitor_rate_limit(response):
remaining = int(response.response_headers.get('X-RateLimit-Remaining', 0))
limit = int(response.response_headers.get('X-RateLimit-Limit', 1))
usage_percent = (limit - remaining) / limit * 100

if usage_percent > 80:
print(f"Warning: {usage_percent:.0f}% of rate limit used")

return usage_percent

# Use
response = client.chat.completions.create(...)
usage = monitor_rate_limit(response)

Rate Limit Strategies​

Strategy 1: Conservative​

Set low limits to prevent unexpected costs:

{
"request_max_limit": 100,
"request_reset_duration": "1h",
"token_max_limit": 100000,
"token_reset_duration": "1h"
}

Strategy 2: Balanced​

For production applications:

{
"request_max_limit": 1000,
"request_reset_duration": "1h",
"token_max_limit": 1000000,
"token_reset_duration": "1h"
}

Strategy 3: Aggressive​

For high-volume applications:

{
"request_max_limit": 10000,
"request_reset_duration": "1h",
"token_max_limit": 10000000,
"token_reset_duration": "1h"
}

Increasing Limits​

To increase your rate limits:

  1. Contact support@korad.ai
  2. Provide justification for higher limits
  3. Enterprise plans get custom limits

Monitoring​

Dashboard​

View rate limit usage in real-time at: https://dashboard.korad.ai/rate-limits

API​

curl http://localhost:8081/api/governance/virtual-keys/key-1/rate-limit

Response:

{
"request_limit": 1000,
"request_remaining": 523,
"request_reset_at": "2025-02-08T12:00:00Z",
"token_limit": 1000000,
"token_remaining": 523456,
"token_reset_at": "2025-02-08T12:00:00Z"
}

Protect your API with appropriate rate limits.