Cost Optimization Guide
Get the most out of your LLM spend Practical strategies for maximizing cost savings.
Understanding the Savings Waterfall​
Korad.AI automatically applies optimizations in this priority order:
- Savings Slider (if set) - Hard context cap
- Vanishing Context (if enabled) - Document QA optimization
- Recursive RLM (if enabled or >80k tokens) - Complex reasoning
- Family-Locked Summary (default, >20k tokens) - Chat compression
- Semantic Cache (built-in) - Duplicate request detection
Optimization Strategies​
Strategy 1: Use Savings Slider for Cost Control​
For budget-constrained applications:
headers = {
"X-Savings-Level": "extreme" # 89% savings, 16k cap
}
Use cases:
- Free tier users
- Cost-sensitive features
- Testing/development
Strategy 2: Vanishing Context for Document QA​
For document-heavy applications:
headers = {
"X-Vanishing-Context": "true"
}
Use cases:
- Legal document analysis
- Technical documentation queries
- Research paper synthesis
- Contract review
Savings: Up to 99% on large documents
Strategy 3: RLM for Complex Reasoning​
For multi-step analysis:
headers = {
"X-Korad-RLM": "true"
}
Use cases:
- Codebase analysis
- Multi-document synthesis
- Complex reasoning tasks
Savings: 50-97% depending on complexity
Strategy 4: Combine Strategies​
Layer optimizations for maximum savings:
headers = {
"X-Vanishing-Context": "true", # Try this first
"X-Savings-Level": "med" # Fallback cap
}
Model Selection Strategies​
Budget-Friendly Models​
| Model | Cost/1M Input | Best For |
|---|---|---|
claude-haiku | $0.25 | Fast responses |
gpt-4o-mini | $0.15 | General tasks |
deepseek-chat | $0.14 | Lowest cost |
gemini-2.0-flash | $0.075 | Fastest |
Quality-Optimized Models​
| Model | Cost/1M Input | Best For |
|---|---|---|
claude-opus | $15.00 | Complex reasoning |
gpt-4o | $2.50 | Multimodal |
gemini-2.5-pro | $1.25 | Large context |
Hybrid Approach​
Use cheap models for initial processing, expensive models for final output:
# Step 1: Use Haiku to summarize (cheap)
summary = client.chat.completions.create(
model="anthropic/claude-haiku-4-5-20251001",
messages=[{"role": "user", "content": large_context}],
max_tokens=500
)
# Step 2: Use Sonnet for final output (quality)
final = client.chat.completions.create(
model="anthropic/claude-sonnet-4-5-20250929",
messages=[
{"role": "user", "content": f"Based on this summary: {summary}"}
]
)
Application-Level Optimizations​
1. Implement Semantic Cache​
# Check local cache first
if prompt in cache:
return cache[prompt]
# Make request
response = client.chat.completions.create(...)
# Cache the result
cache[prompt] = response
2. Batch Similar Requests​
# Instead of multiple small requests:
# for question in questions:
# answer(question)
# Batch them:
batched = "\n\n".join(questions)
response = client.chat.completions.create(
messages=[{"role": "user", "content": f"Answer each:\n{batched}"}]
)
3. Use Streaming for Early Termination​
stream = client.chat.completions.create(
messages=[{"role": "user", "content": question}],
stream=True
)
full_response = ""
for chunk in stream:
full_response += chunk.choices[0].delta.content
# Stop early if satisfied
if "answer:" in full_response.lower():
break
Monitoring and Alerts​
Track Your Savings​
def track_savings(response):
return {
"original": response.response_headers.get('X-Korad-Original-Tokens'),
"optimized": response.response_headers.get('X-Korad-Optimized-Tokens'),
"savings": response.response_headers.get('X-Korad-Savings-USD'),
"strategy": response.response_headers.get('X-Korad-Strategy')
}
# Use in your app
response = client.chat.completions.create(...)
stats = track_savings(response)
print(f"Saved {stats['savings']} using {stats['strategy']}")
Set Budget Alerts​
monthly_budget = 100.00 # $100/month
current_spend = 0
def check_budget(cost):
global current_spend
current_spend += cost
if current_spend > monthly_budget * 0.9:
send_alert(f"90% of budget used: ${current_spend:.2f}")
if current_spend >= monthly_budget:
raise BudgetExceededError("Monthly budget exceeded")
Cost Calculator​
Estimate your costs before implementation:
def estimate_cost(input_tokens, output_tokens, model, optimization_tier=None):
"""Estimate cost for a request."""
# Pricing per 1M tokens (input)
pricing = {
"claude-opus": {"input": 15.00, "output": 75.00},
"claude-sonnet": {"input": 3.00, "output": 15.00},
"claude-haiku": {"input": 0.25, "output": 1.25},
"deepseek-chat": {"input": 0.14, "output": 0.28},
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
# Optimization savings
tier_savings = {
"tier_2": 0.99, # 99% savings
"tier_3": 0.97, # 97% savings
"tier_4": 0.30, # 30% savings
"tier_5_extreme": 0.89,
"tier_5_max": 0.78,
"tier_5_med": 0.56,
"tier_5_min": 0.12,
}
model_name = model.split("/")[-1]
base_pricing = pricing.get(model_name, {"input": 3.00, "output": 15.00})
theoretical_cost = (
(input_tokens / 1_000_000) * base_pricing["input"] +
(output_tokens / 1_000_000) * base_pricing["output"]
)
if optimization_tier:
savings_rate = tier_savings.get(optimization_tier, 0)
actual_cost = theoretical_cost * (1 - savings_rate)
else:
actual_cost = theoretical_cost
return {
"theoretical_cost": theoretical_cost,
"actual_cost": actual_cost,
"savings": theoretical_cost - actual_cost
}
# Example
print(estimate_cost(100000, 5000, "anthropic/claude-sonnet-4-5-20250929", "tier_3"))
# {'theoretical_cost': 0.375, 'actual_cost': 0.01125, 'savings': 0.36375}
Real-World Examples​
Example 1: Customer Support Chatbot​
Challenge: 50,000 conversations/day, average 5,000 tokens each
Solution:
headers = {
"X-Savings-Level": "med", # 56% savings
"X-Vanishing-Context": "true" # For knowledge base queries
}
Result: 60% cost reduction, from $750/day to $300/day
Example 2: Document Analysis Platform​
Challenge: Legal contract analysis, 100k+ token documents
Solution:
headers = {
"X-Vanishing-Context": "true" # 99% savings on large docs
}
Result: 97% cost reduction on document uploads
Example 3: Code Assistant​
Challenge: Codebase analysis across multiple files
Solution:
headers = {
"X-Korad-RLM": "true", # Recursive summarization
"X-Savings-Level": "max" # Prevent runaway costs
}
Result: 85% cost reduction on large codebase queries
Maximize your savings with intelligent optimization strategies.