Cost Optimization Guide

Get the most out of your LLM spend Practical strategies for maximizing cost savings.

Understanding the Savings Waterfall

Korad.AI automatically applies optimizations in this priority order:

Savings Slider (if set) - Hard context cap
Vanishing Context (if enabled) - Document QA optimization
Recursive RLM (if enabled or >80k tokens) - Complex reasoning
Family-Locked Summary (default, >20k tokens) - Chat compression
Semantic Cache (built-in) - Duplicate request detection

Optimization Strategies

Strategy 1: Use Savings Slider for Cost Control

For budget-constrained applications:

headers = {
    "X-Savings-Level": "extreme"  # 89% savings, 16k cap
}

Use cases:

Free tier users
Cost-sensitive features
Testing/development

Strategy 2: Vanishing Context for Document QA

For document-heavy applications:

headers = {
    "X-Vanishing-Context": "true"
}

Use cases:

Legal document analysis
Technical documentation queries
Research paper synthesis
Contract review

Savings: Up to 99% on large documents

Strategy 3: RLM for Complex Reasoning

For multi-step analysis:

headers = {
    "X-Korad-RLM": "true"
}

Use cases:

Codebase analysis
Multi-document synthesis
Complex reasoning tasks

Savings: 50-97% depending on complexity

Strategy 4: Combine Strategies

Layer optimizations for maximum savings:

headers = {
    "X-Vanishing-Context": "true",  # Try this first
    "X-Savings-Level": "med"        # Fallback cap
}

Model Selection Strategies

Budget-Friendly Models

Model	Cost/1M Input	Best For
`claude-haiku`	$0.25	Fast responses
`gpt-4o-mini`	$0.15	General tasks
`deepseek-chat`	$0.14	Lowest cost
`gemini-2.0-flash`	$0.075	Fastest

Quality-Optimized Models

Model	Cost/1M Input	Best For
`claude-opus`	$15.00	Complex reasoning
`gpt-4o`	$2.50	Multimodal
`gemini-2.5-pro`	$1.25	Large context

Hybrid Approach

Use cheap models for initial processing, expensive models for final output:

# Step 1: Use Haiku to summarize (cheap)
summary = client.chat.completions.create(
    model="anthropic/claude-haiku-4-5-20251001",
    messages=[{"role": "user", "content": large_context}],
    max_tokens=500
)

# Step 2: Use Sonnet for final output (quality)
final = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-5-20250929",
    messages=[
        {"role": "user", "content": f"Based on this summary: {summary}"}
    ]
)

Application-Level Optimizations

1. Implement Semantic Cache

# Check local cache first
if prompt in cache:
    return cache[prompt]

# Make request
response = client.chat.completions.create(...)

# Cache the result
cache[prompt] = response

2. Batch Similar Requests

# Instead of multiple small requests:
# for question in questions:
#     answer(question)

# Batch them:
batched = "\n\n".join(questions)
response = client.chat.completions.create(
    messages=[{"role": "user", "content": f"Answer each:\n{batched}"}]
)

3. Use Streaming for Early Termination

stream = client.chat.completions.create(
    messages=[{"role": "user", "content": question}],
    stream=True
)

full_response = ""
for chunk in stream:
    full_response += chunk.choices[0].delta.content
    # Stop early if satisfied
    if "answer:" in full_response.lower():
        break

Monitoring and Alerts

Track Your Savings

def track_savings(response):
    return {
        "original": response.response_headers.get('X-Korad-Original-Tokens'),
        "optimized": response.response_headers.get('X-Korad-Optimized-Tokens'),
        "savings": response.response_headers.get('X-Korad-Savings-USD'),
        "strategy": response.response_headers.get('X-Korad-Strategy')
    }

# Use in your app
response = client.chat.completions.create(...)
stats = track_savings(response)
print(f"Saved {stats['savings']} using {stats['strategy']}")

Set Budget Alerts

monthly_budget = 100.00  # $100/month
current_spend = 0

def check_budget(cost):
    global current_spend
    current_spend += cost

    if current_spend > monthly_budget * 0.9:
        send_alert(f"90% of budget used: ${current_spend:.2f}")

    if current_spend >= monthly_budget:
        raise BudgetExceededError("Monthly budget exceeded")

Cost Calculator

Estimate your costs before implementation:

def estimate_cost(input_tokens, output_tokens, model, optimization_tier=None):
    """Estimate cost for a request."""

    # Pricing per 1M tokens (input)
    pricing = {
        "claude-opus": {"input": 15.00, "output": 75.00},
        "claude-sonnet": {"input": 3.00, "output": 15.00},
        "claude-haiku": {"input": 0.25, "output": 1.25},
        "deepseek-chat": {"input": 0.14, "output": 0.28},
        "gpt-4o": {"input": 2.50, "output": 10.00},
        "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    }

    # Optimization savings
    tier_savings = {
        "tier_2": 0.99,   # 99% savings
        "tier_3": 0.97,   # 97% savings
        "tier_4": 0.30,   # 30% savings
        "tier_5_extreme": 0.89,
        "tier_5_max": 0.78,
        "tier_5_med": 0.56,
        "tier_5_min": 0.12,
    }

    model_name = model.split("/")[-1]
    base_pricing = pricing.get(model_name, {"input": 3.00, "output": 15.00})

    theoretical_cost = (
        (input_tokens / 1_000_000) * base_pricing["input"] +
        (output_tokens / 1_000_000) * base_pricing["output"]
    )

    if optimization_tier:
        savings_rate = tier_savings.get(optimization_tier, 0)
        actual_cost = theoretical_cost * (1 - savings_rate)
    else:
        actual_cost = theoretical_cost

    return {
        "theoretical_cost": theoretical_cost,
        "actual_cost": actual_cost,
        "savings": theoretical_cost - actual_cost
    }

# Example
print(estimate_cost(100000, 5000, "anthropic/claude-sonnet-4-5-20250929", "tier_3"))
# {'theoretical_cost': 0.375, 'actual_cost': 0.01125, 'savings': 0.36375}

Real-World Examples

Example 1: Customer Support Chatbot

Challenge: 50,000 conversations/day, average 5,000 tokens each

Solution:

headers = {
    "X-Savings-Level": "med",  # 56% savings
    "X-Vanishing-Context": "true"  # For knowledge base queries
}

Result: 60% cost reduction, from $750/day to $300/day

Example 2: Document Analysis Platform

Challenge: Legal contract analysis, 100k+ token documents

Solution:

headers = {
    "X-Vanishing-Context": "true"  # 99% savings on large docs
}

Result: 97% cost reduction on document uploads

Example 3: Code Assistant

Challenge: Codebase analysis across multiple files

Solution:

headers = {
    "X-Korad-RLM": "true",  # Recursive summarization
    "X-Savings-Level": "max"  # Prevent runaway costs
}

Result: 85% cost reduction on large codebase queries

Maximize your savings with intelligent optimization strategies.

Understanding the Savings Waterfall​

Optimization Strategies​

Strategy 1: Use Savings Slider for Cost Control​

Strategy 2: Vanishing Context for Document QA​

Strategy 3: RLM for Complex Reasoning​

Strategy 4: Combine Strategies​

Model Selection Strategies​

Budget-Friendly Models​

Quality-Optimized Models​

Hybrid Approach​

Application-Level Optimizations​

1. Implement Semantic Cache​

2. Batch Similar Requests​

3. Use Streaming for Early Termination​

Monitoring and Alerts​

Track Your Savings​

Set Budget Alerts​

Cost Calculator​

Real-World Examples​

Example 1: Customer Support Chatbot​

Example 2: Document Analysis Platform​

Example 3: Code Assistant​

Understanding the Savings Waterfall

Optimization Strategies

Strategy 1: Use Savings Slider for Cost Control

Strategy 2: Vanishing Context for Document QA

Strategy 3: RLM for Complex Reasoning

Strategy 4: Combine Strategies

Model Selection Strategies

Budget-Friendly Models

Quality-Optimized Models

Hybrid Approach

Application-Level Optimizations

1. Implement Semantic Cache

2. Batch Similar Requests

3. Use Streaming for Early Termination

Monitoring and Alerts

Track Your Savings

Set Budget Alerts

Cost Calculator

Real-World Examples

Example 1: Customer Support Chatbot

Example 2: Document Analysis Platform

Example 3: Code Assistant