Skip to main content

Cost Optimization Guide

Get the most out of your LLM spend Practical strategies for maximizing cost savings.

Understanding the Savings Waterfall​

Korad.AI automatically applies optimizations in this priority order:

  1. Savings Slider (if set) - Hard context cap
  2. Vanishing Context (if enabled) - Document QA optimization
  3. Recursive RLM (if enabled or >80k tokens) - Complex reasoning
  4. Family-Locked Summary (default, >20k tokens) - Chat compression
  5. Semantic Cache (built-in) - Duplicate request detection

Optimization Strategies​

Strategy 1: Use Savings Slider for Cost Control​

For budget-constrained applications:

headers = {
"X-Savings-Level": "extreme" # 89% savings, 16k cap
}

Use cases:

  • Free tier users
  • Cost-sensitive features
  • Testing/development

Strategy 2: Vanishing Context for Document QA​

For document-heavy applications:

headers = {
"X-Vanishing-Context": "true"
}

Use cases:

  • Legal document analysis
  • Technical documentation queries
  • Research paper synthesis
  • Contract review

Savings: Up to 99% on large documents

Strategy 3: RLM for Complex Reasoning​

For multi-step analysis:

headers = {
"X-Korad-RLM": "true"
}

Use cases:

  • Codebase analysis
  • Multi-document synthesis
  • Complex reasoning tasks

Savings: 50-97% depending on complexity

Strategy 4: Combine Strategies​

Layer optimizations for maximum savings:

headers = {
"X-Vanishing-Context": "true", # Try this first
"X-Savings-Level": "med" # Fallback cap
}

Model Selection Strategies​

Budget-Friendly Models​

ModelCost/1M InputBest For
claude-haiku$0.25Fast responses
gpt-4o-mini$0.15General tasks
deepseek-chat$0.14Lowest cost
gemini-2.0-flash$0.075Fastest

Quality-Optimized Models​

ModelCost/1M InputBest For
claude-opus$15.00Complex reasoning
gpt-4o$2.50Multimodal
gemini-2.5-pro$1.25Large context

Hybrid Approach​

Use cheap models for initial processing, expensive models for final output:

# Step 1: Use Haiku to summarize (cheap)
summary = client.chat.completions.create(
model="anthropic/claude-haiku-4-5-20251001",
messages=[{"role": "user", "content": large_context}],
max_tokens=500
)

# Step 2: Use Sonnet for final output (quality)
final = client.chat.completions.create(
model="anthropic/claude-sonnet-4-5-20250929",
messages=[
{"role": "user", "content": f"Based on this summary: {summary}"}
]
)

Application-Level Optimizations​

1. Implement Semantic Cache​

# Check local cache first
if prompt in cache:
return cache[prompt]

# Make request
response = client.chat.completions.create(...)

# Cache the result
cache[prompt] = response

2. Batch Similar Requests​

# Instead of multiple small requests:
# for question in questions:
# answer(question)

# Batch them:
batched = "\n\n".join(questions)
response = client.chat.completions.create(
messages=[{"role": "user", "content": f"Answer each:\n{batched}"}]
)

3. Use Streaming for Early Termination​

stream = client.chat.completions.create(
messages=[{"role": "user", "content": question}],
stream=True
)

full_response = ""
for chunk in stream:
full_response += chunk.choices[0].delta.content
# Stop early if satisfied
if "answer:" in full_response.lower():
break

Monitoring and Alerts​

Track Your Savings​

def track_savings(response):
return {
"original": response.response_headers.get('X-Korad-Original-Tokens'),
"optimized": response.response_headers.get('X-Korad-Optimized-Tokens'),
"savings": response.response_headers.get('X-Korad-Savings-USD'),
"strategy": response.response_headers.get('X-Korad-Strategy')
}

# Use in your app
response = client.chat.completions.create(...)
stats = track_savings(response)
print(f"Saved {stats['savings']} using {stats['strategy']}")

Set Budget Alerts​

monthly_budget = 100.00  # $100/month
current_spend = 0

def check_budget(cost):
global current_spend
current_spend += cost

if current_spend > monthly_budget * 0.9:
send_alert(f"90% of budget used: ${current_spend:.2f}")

if current_spend >= monthly_budget:
raise BudgetExceededError("Monthly budget exceeded")

Cost Calculator​

Estimate your costs before implementation:

def estimate_cost(input_tokens, output_tokens, model, optimization_tier=None):
"""Estimate cost for a request."""

# Pricing per 1M tokens (input)
pricing = {
"claude-opus": {"input": 15.00, "output": 75.00},
"claude-sonnet": {"input": 3.00, "output": 15.00},
"claude-haiku": {"input": 0.25, "output": 1.25},
"deepseek-chat": {"input": 0.14, "output": 0.28},
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}

# Optimization savings
tier_savings = {
"tier_2": 0.99, # 99% savings
"tier_3": 0.97, # 97% savings
"tier_4": 0.30, # 30% savings
"tier_5_extreme": 0.89,
"tier_5_max": 0.78,
"tier_5_med": 0.56,
"tier_5_min": 0.12,
}

model_name = model.split("/")[-1]
base_pricing = pricing.get(model_name, {"input": 3.00, "output": 15.00})

theoretical_cost = (
(input_tokens / 1_000_000) * base_pricing["input"] +
(output_tokens / 1_000_000) * base_pricing["output"]
)

if optimization_tier:
savings_rate = tier_savings.get(optimization_tier, 0)
actual_cost = theoretical_cost * (1 - savings_rate)
else:
actual_cost = theoretical_cost

return {
"theoretical_cost": theoretical_cost,
"actual_cost": actual_cost,
"savings": theoretical_cost - actual_cost
}

# Example
print(estimate_cost(100000, 5000, "anthropic/claude-sonnet-4-5-20250929", "tier_3"))
# {'theoretical_cost': 0.375, 'actual_cost': 0.01125, 'savings': 0.36375}

Real-World Examples​

Example 1: Customer Support Chatbot​

Challenge: 50,000 conversations/day, average 5,000 tokens each

Solution:

headers = {
"X-Savings-Level": "med", # 56% savings
"X-Vanishing-Context": "true" # For knowledge base queries
}

Result: 60% cost reduction, from $750/day to $300/day

Example 2: Document Analysis Platform​

Challenge: Legal contract analysis, 100k+ token documents

Solution:

headers = {
"X-Vanishing-Context": "true" # 99% savings on large docs
}

Result: 97% cost reduction on document uploads

Example 3: Code Assistant​

Challenge: Codebase analysis across multiple files

Solution:

headers = {
"X-Korad-RLM": "true", # Recursive summarization
"X-Savings-Level": "max" # Prevent runaway costs
}

Result: 85% cost reduction on large codebase queries


Maximize your savings with intelligent optimization strategies.