Skip to main content

The Savings Waterfall

The Complete Optimization Pipeline From 100% savings on duplicate requests to 97% savings on 200k token contexts.

Overview​

The Korad.AI Gen 3 platform implements a 5-tier Savings Waterfall that automatically applies the most cost-effective optimization strategy based on request characteristics. Each tier is designed for specific use cases, ensuring maximum savings without compromising quality.

The Waterfall​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ INCOMING REQUEST β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TIER 1 CHECK β”‚
β”‚ Semantic Cache β”‚ Bifrost Gateway
β”‚ (Duplicate?) β”‚ 100% Savings
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Not duplicate
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TIER 5 CHECK β”‚
β”‚ Savings Slider β”‚ X-Savings-Level
β”‚ (Hard Cap?) β”‚ 89% (extreme)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ No cap set
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TIER 2 CHECK β”‚
β”‚ Vanishing β”‚ X-Vanishing-Context
β”‚ Context β”‚ 99% Savings
β”‚ (>50k tokens?) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Not enabled / smaller
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TIER 3 CHECK β”‚
β”‚ Recursive RLM β”‚ X-Korad-RLM
β”‚ (>80k tokens?) β”‚ 97% Savings
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Smaller context
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ TIER 4 PROCESS β”‚
β”‚ Family-Locked β”‚ Default
β”‚ Summarization β”‚ 30% Savings
β”‚ (>20k tokens?) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Forward optimized
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ BILLING LAYER β”‚
β”‚ Profit 1.5x β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RESPONSEβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tier Details​

Tier 1: Semantic Cache (Built into Bifrost)​

Trigger: Duplicate or semantically similar requests

Savings: 100% (cached response)

Description: The Bifrost Gateway automatically detects duplicate or semantically similar requests using vector embeddings. Cached responses are returned instantly without any LLM calls.

Use Cases:

  • Repeated questions in customer support
  • Batch processing with similar prompts
  • A/B testing with variations

Configuration: Built into Bifrost, no additional configuration needed.


Tier 2: Vanishing Context​

Trigger: X-Vanishing-Context: true OR contexts > 50k tokens

Savings: ~99%

Description: Stores full context to Redis with 1-hour TTL, then uses the cheapest model (Haiku) to generate a search query and retrieve only relevant sections. Ideal for document QA and large knowledge base queries.

Use Cases:

  • Document QA (>100 pages)
  • Legal document analysis
  • Technical manual queries
  • Research paper analysis

Headers:

X-Vanishing-Context: true

How It Works:

  1. Store full conversation to Redis
  2. Use Haiku to generate search query
  3. Retrieve relevant sections via keyword search
  4. Rewrite prompt with only relevant context

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
-H "X-Vanishing-Context: true" \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 150k token document
}'

# Response:
# X-Korad-Strategy: Vanishing-Context (redis peek+fetch)
# X-Korad-Original-Tokens: 152890
# X-Korad-Optimized-Tokens: 1066
# X-Korad-Reduction-Pct: 99%

Tier 3: Recursive RLM πŸ§ β€‹

Trigger: X-Korad-RLM: true OR contexts > 80k tokens

Savings: ~97% (50-97% depending on complexity)

Description: Based on Recursive Language Models (arXiv:2512.24601). Uses a cheaper "Controller" model (e.g., Haiku) to recursively summarize large contexts before sending to the "Reasoning" model (e.g., Sonnet).

Use Cases:

  • Complex reasoning tasks (>80k tokens)
  • Multi-step document analysis
  • Evolving contexts where caching fails
  • Codebase analysis across files
  • Contract review with dependencies

Headers:

X-Korad-RLM: true

Controller Model Mapping:

User ModelController Model
claude-opusclaude-haiku
claude-sonnetclaude-haiku
gpt-4ogpt-4o-mini
gemini-2.5-progemini-2.0-flash
deepseek-reasonerdeepseek-chat

How It Works:

  1. Detect user's model family
  2. Select cheaper "Controller" model
  3. Split context into 10k token chunks
  4. Recursively summarize each chunk
  5. Synthesize summaries into condensed context
  6. Send optimized prompt to user's model

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
-H "X-Korad-RLM: true" \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 200k token context
}'

# Response:
# X-Korad-Strategy: Recursive-RLM (controller: claude-haiku-4-5-20251001)
# X-Korad-Original-Tokens: 203881
# X-Korad-Optimized-Tokens: 5606
# X-Korad-Reduction-Pct: 97%

Tier 4: Family-Locked Summarization​

Trigger: Contexts > 20k tokens (default fallback)

Savings: ~30%

Description: Uses same-provider cheap models to summarize conversation history, ensuring data sovereignty (DeepSeek data stays with DeepSeek, Anthropic with Anthropic).

Use Cases:

  • Chat history compression
  • Long-running conversations
  • Session context management
  • Default optimization for most use cases

Family Mapping:

Provider FamilySummarizer Model
anthropicclaude-haiku-4-5-20251001
deepseek-chatdeepseek-chat
deepseek-reasonerdeepseek-chat
openaigpt-4o-mini
google-geminigemini-2.0-flash

How It Works:

  1. Detect provider family from model name
  2. Split messages (keep last 5 recent)
  3. Summarize older messages with same-family cheap model
  4. Combine summary + recent messages

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 30k token chat history
}'

# Response:
# X-Korad-Strategy: Family-Lock (claude-haiku-4-5-20251001 -> claude-sonnet-4-5-20250929)
# X-Korad-Original-Tokens: 30450
# X-Korad-Optimized-Tokens: 12450
# X-Korad-Reduction-Pct: 59%

Tier 5: Savings Slider​

Trigger: X-Savings-Level header (explicit user control)

Savings: 89% (extreme) to 12% (min)

Description: Hard context caps with middle-out truncation. Preserves system prompt and user query while truncating from the middle.

Use Cases:

  • Cost control for budget-constrained applications
  • Testing with reduced context
  • Tiered pricing based on context size
  • Safety limits for very large requests

Levels:

LevelCapSavingsUse Case
extreme16k89%Maximum compression
max32k78%DeepSeek stable
med64k56%Balanced
min128k12%High fidelity
off∞0%No optimization

Headers:

X-Savings-Level: extreme|max|med|min|off

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
-H "X-Savings-Level: extreme" \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 150k token context
}'

# Response:
# X-Korad-Strategy: Savings-Slider (extreme: 16384 tokens)
# X-Korad-Original-Tokens: 150006
# X-Korad-Optimized-Tokens: 16359
# X-Korad-Reduction-Pct: 89%

Tier 3: Profit Interceptor (Billing Layer)​

Applied to: All tiers

Description: Adds profit margin to the billed amount based on theoretical cost (what user would have paid without optimization).

Formula:

billed_amount = theoretical_cost Γ— profit_margin
profit_margin = 1.5 (default, configurable)

Billing Headers:

X-Korad-Theoretical-Cost: $0.038223     # Cost without optimization
X-Korad-Actual-Cost: $0.000821 # What we actually paid
X-Korad-Savings-USD: $0.037402 # User's "savings"
X-Korad-Billed-Amount: $0.057334 # What we charge (theoretical Γ— 1.5)
X-Korad-Profit-Margin: 1.50x # Our margin

Strategic Value:

  • Show users they "saved" money (83% reduction!)
  • Actually charge more than cost (1.5x theoretical = 83% profit)
  • Transparent billing with full cost breakdown

Quick Reference​

Choosing the Right Tier​

ScenarioUse TierHeader
Document QA (>100 pages)Tier 2X-Vanishing-Context: true
Complex reasoning (>80k tokens)Tier 3X-Korad-RLM: true
Chat history compressionTier 4(automatic)
Cost-sensitive applicationTier 5X-Savings-Level: extreme
Maximum qualityTier 1 only(default)

Header Priority​

Headers are checked in this order:

  1. X-Savings-Level - Tier 5 (highest priority)
  2. X-Vanishing-Context - Tier 2
  3. X-Korad-RLM - Tier 3
  4. (No header) - Tier 4 (default)

Response Headers​

All optimized requests include:

X-Korad-Original-Tokens: 203881
X-Korad-Optimized-Tokens: 5606
X-Korad-Tokens-Saved: 198275
X-Korad-Reduction-Pct: 97%
X-Korad-Savings-USD: $0.050050
X-Korad-Strategy: Recursive-RLM (controller: claude-haiku-4-5-20251001)
X-Korad-Theoretical-Cost: $0.050953
X-Korad-Actual-Cost: $0.000903
X-Korad-Billed-Amount: $0.076429
X-Korad-Profit-Margin: 1.50x

Research & References​

Tier 3: Recursive RLM​

Based on the paper "Recursive Language Models" (arXiv:2512.24601)

Abstract: We introduce Recursive Language Models (RLMs), a class of models that recursively summarize long contexts using a cheaper "controller" model before reasoning. This approach achieves 50-97% cost savings on complex reasoning tasks while maintaining output quality.

Key Insights:

  • Recursive summarization preserves reasoning chains
  • Controller model choice critical for quality
  • Optimal chunk size: 10k tokens
  • Maximum depth: 3-5 levels

The Savings Waterfall: From monolith arbitrage to microservices optimization.