Rate Limits Guide

Protect your API from abuse Configure rate limits per virtual key.

Overview

Rate limits prevent abuse and ensure fair usage. Each virtual key can have:

Request limits - Maximum requests per time period
Token limits - Maximum tokens per time period
Reset periods - Hourly, daily, weekly, or monthly

Default Limits

Plan	Requests/Hour	Tokens/Hour
Free	100	100,000
Pro	1,000	1,000,000
Enterprise	Custom	Custom

Configuration

Per Virtual Key

{
  "governance": {
    "rate_limits": [
      {
        "id": "standard-limit",
        "name": "Standard Rate Limit",
        "request_max_limit": 1000,
        "request_reset_duration": "1h",
        "token_max_limit": 1000000,
        "token_reset_duration": "1h"
      }
    ],
    "virtual_keys": [
      {
        "id": "key-1",
        "name": "Production Key",
        "rate_limit_id": "standard-limit"
      }
    ]
  }
}

Rate Limit Behavior

When Limits Are Exceeded

{
  "error": {
    "message": "Rate limit exceeded: 1000 requests/hour",
    "type": "rate_limit_error",
    "code": "rate_limit_exceeded",
    "retry_after": 3600
  }
}

Response Headers

All responses include rate limit info:

X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 523
X-RateLimit-Reset: 1707456000

Handling Rate Limits

Exponential Backoff

import time
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY"
)

def make_request_with_backoff(messages, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="anthropic/claude-sonnet-4-5-20250929",
                messages=messages
            )
        except RateLimitError as e:
            if attempt < max_retries - 1:
                wait_time = 2 ** attempt
                print(f"Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise

# Use
response = make_request_with_backoff([{"role": "user", "content": "Hello"}])

Check Rate Limit Before Request

import requests

def check_rate_limit(api_key):
    response = requests.get(
        "http://localhost:8081/api/governance/virtual-keys/limits",
        headers={"x-bf-vk": api_key}
    )
    return response.json()

# Usage
limits = check_rate_limit("sk-bf-YOUR_KEY")
if limits['remaining_requests'] < 10:
    print("Warning: Approaching rate limit")

Best Practices

1. Implement Queuing

import queue
import threading

request_queue = queue.Queue()

def worker():
    while True:
        messages = request_queue.get()
        try:
            response = client.chat.completions.create(messages=messages)
            print(response.choices[0].message.content)
        except RateLimitError:
            request_queue.put(messages)  # Re-queue
            time.sleep(60)
        finally:
            request_queue.task_done()

# Start workers
for _ in range(5):
    threading.Thread(target=worker, daemon=True).start()

# Queue requests
request_queue.put([{"role": "user", "content": "Hello"}])

2. Use Batch Processing

# Instead of multiple small requests:
for question in questions:
    client.chat.completions.create(messages=[{"role": "user", "content": question}])

# Batch them:
batched = "\n\n".join(questions)
client.chat.completions.create(messages=[{"role": "user", "content": f"Answer each:\n{batched}"}])

3. Monitor Rate Limits

def monitor_rate_limit(response):
    remaining = int(response.response_headers.get('X-RateLimit-Remaining', 0))
    limit = int(response.response_headers.get('X-RateLimit-Limit', 1))
    usage_percent = (limit - remaining) / limit * 100

    if usage_percent > 80:
        print(f"Warning: {usage_percent:.0f}% of rate limit used")

    return usage_percent

# Use
response = client.chat.completions.create(...)
usage = monitor_rate_limit(response)

Rate Limit Strategies

Strategy 1: Conservative

Set low limits to prevent unexpected costs:

{
  "request_max_limit": 100,
  "request_reset_duration": "1h",
  "token_max_limit": 100000,
  "token_reset_duration": "1h"
}

Strategy 2: Balanced

For production applications:

{
  "request_max_limit": 1000,
  "request_reset_duration": "1h",
  "token_max_limit": 1000000,
  "token_reset_duration": "1h"
}

Strategy 3: Aggressive

For high-volume applications:

{
  "request_max_limit": 10000,
  "request_reset_duration": "1h",
  "token_max_limit": 10000000,
  "token_reset_duration": "1h"
}

Increasing Limits

To increase your rate limits:

Contact support@korad.ai
Provide justification for higher limits
Enterprise plans get custom limits

Monitoring

Dashboard

View rate limit usage in real-time at: https://dashboard.korad.ai/rate-limits

API

curl http://localhost:8081/api/governance/virtual-keys/key-1/rate-limit

Response:

{
  "request_limit": 1000,
  "request_remaining": 523,
  "request_reset_at": "2025-02-08T12:00:00Z",
  "token_limit": 1000000,
  "token_remaining": 523456,
  "token_reset_at": "2025-02-08T12:00:00Z"
}

Protect your API with appropriate rate limits.

Overview​

Default Limits​

Configuration​

Per Virtual Key​

Rate Limit Behavior​

When Limits Are Exceeded​

Response Headers​

Handling Rate Limits​

Exponential Backoff​

Check Rate Limit Before Request​

Best Practices​

1. Implement Queuing​

2. Use Batch Processing​

3. Monitor Rate Limits​

Rate Limit Strategies​

Strategy 1: Conservative​

Strategy 2: Balanced​

Strategy 3: Aggressive​

Increasing Limits​

Monitoring​

Dashboard​

API​