The Savings Waterfall
The Complete Optimization Pipeline From 100% savings on duplicate requests to 97% savings on 200k token contexts.
Overviewβ
The Korad.AI Gen 3 platform implements a 5-tier Savings Waterfall that automatically applies the most cost-effective optimization strategy based on request characteristics. Each tier is designed for specific use cases, ensuring maximum savings without compromising quality.
The Waterfallβ
ββββββββββββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β INCOMING REQUEST β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β TIER 1 CHECK β
β Semantic Cache β Bifrost Gateway
β (Duplicate?) β 100% Savings
ββββββββββ¬ββββββββββ
β Not duplicate
βΌ
βββββββββββββββββββ
β TIER 5 CHECK β
β Savings Slider β X-Savings-Level
β (Hard Cap?) β 89% (extreme)
ββββββββββ¬ββββββββββ
β No cap set
βΌ
βββββββββββββββββββ
β TIER 2 CHECK β
β Vanishing β X-Vanishing-Context
β Context β 99% Savings
β (>50k tokens?) β
ββββββββββ¬ββββββββββ
β Not enabled / smaller
βΌ
βββββββββββββββββββ
β TIER 3 CHECK β
β Recursive RLM β X-Korad-RLM
β (>80k tokens?) β 97% Savings
ββββββββββ¬ββββββββββ
β Smaller context
βΌ
βββββββββββββββββββ
β TIER 4 PROCESS β
β Family-Locked β Default
β Summarization β 30% Savings
β (>20k tokens?) β
ββββββββββ¬ββββββββββ
β Forward optimized
βΌ
βββββββββββββββββββ
β BILLING LAYER β
β Profit 1.5x β
ββββββββββ¬ββββββββββ
β
βΌ
βββββββββββ
β RESPONSEβ
βββββββββββ
Tier Detailsβ
Tier 1: Semantic Cache (Built into Bifrost)β
Trigger: Duplicate or semantically similar requests
Savings: 100% (cached response)
Description: The Bifrost Gateway automatically detects duplicate or semantically similar requests using vector embeddings. Cached responses are returned instantly without any LLM calls.
Use Cases:
- Repeated questions in customer support
- Batch processing with similar prompts
- A/B testing with variations
Configuration: Built into Bifrost, no additional configuration needed.
Tier 2: Vanishing Contextβ
Trigger: X-Vanishing-Context: true OR contexts > 50k tokens
Savings: ~99%
Description: Stores full context to Redis with 1-hour TTL, then uses the cheapest model (Haiku) to generate a search query and retrieve only relevant sections. Ideal for document QA and large knowledge base queries.
Use Cases:
- Document QA (>100 pages)
- Legal document analysis
- Technical manual queries
- Research paper analysis
Headers:
X-Vanishing-Context: true
How It Works:
- Store full conversation to Redis
- Use Haiku to generate search query
- Retrieve relevant sections via keyword search
- Rewrite prompt with only relevant context
Example:
curl -X POST http://localhost:8084/v1/chat/completions \
-H "X-Vanishing-Context: true" \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 150k token document
}'
# Response:
# X-Korad-Strategy: Vanishing-Context (redis peek+fetch)
# X-Korad-Original-Tokens: 152890
# X-Korad-Optimized-Tokens: 1066
# X-Korad-Reduction-Pct: 99%
Tier 3: Recursive RLM π§ β
Trigger: X-Korad-RLM: true OR contexts > 80k tokens
Savings: ~97% (50-97% depending on complexity)
Description: Based on Recursive Language Models (arXiv:2512.24601). Uses a cheaper "Controller" model (e.g., Haiku) to recursively summarize large contexts before sending to the "Reasoning" model (e.g., Sonnet).
Use Cases:
- Complex reasoning tasks (>80k tokens)
- Multi-step document analysis
- Evolving contexts where caching fails
- Codebase analysis across files
- Contract review with dependencies
Headers:
X-Korad-RLM: true
Controller Model Mapping:
| User Model | Controller Model |
|---|---|
| claude-opus | claude-haiku |
| claude-sonnet | claude-haiku |
| gpt-4o | gpt-4o-mini |
| gemini-2.5-pro | gemini-2.0-flash |
| deepseek-reasoner | deepseek-chat |
How It Works:
- Detect user's model family
- Select cheaper "Controller" model
- Split context into 10k token chunks
- Recursively summarize each chunk
- Synthesize summaries into condensed context
- Send optimized prompt to user's model
Example:
curl -X POST http://localhost:8084/v1/chat/completions \
-H "X-Korad-RLM: true" \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 200k token context
}'
# Response:
# X-Korad-Strategy: Recursive-RLM (controller: claude-haiku-4-5-20251001)
# X-Korad-Original-Tokens: 203881
# X-Korad-Optimized-Tokens: 5606
# X-Korad-Reduction-Pct: 97%
Tier 4: Family-Locked Summarizationβ
Trigger: Contexts > 20k tokens (default fallback)
Savings: ~30%
Description: Uses same-provider cheap models to summarize conversation history, ensuring data sovereignty (DeepSeek data stays with DeepSeek, Anthropic with Anthropic).
Use Cases:
- Chat history compression
- Long-running conversations
- Session context management
- Default optimization for most use cases
Family Mapping:
| Provider Family | Summarizer Model |
|---|---|
| anthropic | claude-haiku-4-5-20251001 |
| deepseek-chat | deepseek-chat |
| deepseek-reasoner | deepseek-chat |
| openai | gpt-4o-mini |
| google-gemini | gemini-2.0-flash |
How It Works:
- Detect provider family from model name
- Split messages (keep last 5 recent)
- Summarize older messages with same-family cheap model
- Combine summary + recent messages
Example:
curl -X POST http://localhost:8084/v1/chat/completions \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 30k token chat history
}'
# Response:
# X-Korad-Strategy: Family-Lock (claude-haiku-4-5-20251001 -> claude-sonnet-4-5-20250929)
# X-Korad-Original-Tokens: 30450
# X-Korad-Optimized-Tokens: 12450
# X-Korad-Reduction-Pct: 59%
Tier 5: Savings Sliderβ
Trigger: X-Savings-Level header (explicit user control)
Savings: 89% (extreme) to 12% (min)
Description: Hard context caps with middle-out truncation. Preserves system prompt and user query while truncating from the middle.
Use Cases:
- Cost control for budget-constrained applications
- Testing with reduced context
- Tiered pricing based on context size
- Safety limits for very large requests
Levels:
| Level | Cap | Savings | Use Case |
|---|---|---|---|
| extreme | 16k | 89% | Maximum compression |
| max | 32k | 78% | DeepSeek stable |
| med | 64k | 56% | Balanced |
| min | 128k | 12% | High fidelity |
| off | β | 0% | No optimization |
Headers:
X-Savings-Level: extreme|max|med|min|off
Example:
curl -X POST http://localhost:8084/v1/chat/completions \
-H "X-Savings-Level: extreme" \
-H "x-bf-vk: sk-vk-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [...] # 150k token context
}'
# Response:
# X-Korad-Strategy: Savings-Slider (extreme: 16384 tokens)
# X-Korad-Original-Tokens: 150006
# X-Korad-Optimized-Tokens: 16359
# X-Korad-Reduction-Pct: 89%
Tier 3: Profit Interceptor (Billing Layer)β
Applied to: All tiers
Description: Adds profit margin to the billed amount based on theoretical cost (what user would have paid without optimization).
Formula:
billed_amount = theoretical_cost Γ profit_margin
profit_margin = 1.5 (default, configurable)
Billing Headers:
X-Korad-Theoretical-Cost: $0.038223 # Cost without optimization
X-Korad-Actual-Cost: $0.000821 # What we actually paid
X-Korad-Savings-USD: $0.037402 # User's "savings"
X-Korad-Billed-Amount: $0.057334 # What we charge (theoretical Γ 1.5)
X-Korad-Profit-Margin: 1.50x # Our margin
Strategic Value:
- Show users they "saved" money (83% reduction!)
- Actually charge more than cost (1.5x theoretical = 83% profit)
- Transparent billing with full cost breakdown
Quick Referenceβ
Choosing the Right Tierβ
| Scenario | Use Tier | Header |
|---|---|---|
| Document QA (>100 pages) | Tier 2 | X-Vanishing-Context: true |
| Complex reasoning (>80k tokens) | Tier 3 | X-Korad-RLM: true |
| Chat history compression | Tier 4 | (automatic) |
| Cost-sensitive application | Tier 5 | X-Savings-Level: extreme |
| Maximum quality | Tier 1 only | (default) |
Header Priorityβ
Headers are checked in this order:
X-Savings-Level- Tier 5 (highest priority)X-Vanishing-Context- Tier 2X-Korad-RLM- Tier 3- (No header) - Tier 4 (default)
Response Headersβ
All optimized requests include:
X-Korad-Original-Tokens: 203881
X-Korad-Optimized-Tokens: 5606
X-Korad-Tokens-Saved: 198275
X-Korad-Reduction-Pct: 97%
X-Korad-Savings-USD: $0.050050
X-Korad-Strategy: Recursive-RLM (controller: claude-haiku-4-5-20251001)
X-Korad-Theoretical-Cost: $0.050953
X-Korad-Actual-Cost: $0.000903
X-Korad-Billed-Amount: $0.076429
X-Korad-Profit-Margin: 1.50x
Research & Referencesβ
Tier 3: Recursive RLMβ
Based on the paper "Recursive Language Models" (arXiv:2512.24601)
Abstract: We introduce Recursive Language Models (RLMs), a class of models that recursively summarize long contexts using a cheaper "controller" model before reasoning. This approach achieves 50-97% cost savings on complex reasoning tasks while maintaining output quality.
Key Insights:
- Recursive summarization preserves reasoning chains
- Controller model choice critical for quality
- Optimal chunk size: 10k tokens
- Maximum depth: 3-5 levels
The Savings Waterfall: From monolith arbitrage to microservices optimization.