The Savings Waterfall

The Complete Optimization Pipeline From 100% savings on duplicate requests to 97% savings on 200k token contexts.

Overview

The Korad.AI Gen 3 platform implements a 5-tier Savings Waterfall that automatically applies the most cost-effective optimization strategy based on request characteristics. Each tier is designed for specific use cases, ensuring maximum savings without compromising quality.

The Waterfall

┌─────────────────────────────────────────────────────────────────────────┐
│                           INCOMING REQUEST                               │
└────────────────────────────┬────────────────────────────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  TIER 1 CHECK   │
                    │  Semantic Cache │  Bifrost Gateway
                    │  (Duplicate?)    │  100% Savings
                    └────────┬─────────┘
                             │ Not duplicate
                             ▼
                    ┌─────────────────┐
                    │  TIER 5 CHECK   │
                    │  Savings Slider │  X-Savings-Level
                    │  (Hard Cap?)     │  89% (extreme)
                    └────────┬─────────┘
                             │ No cap set
                             ▼
                    ┌─────────────────┐
                    │  TIER 2 CHECK   │
                    │  Vanishing      │  X-Vanishing-Context
                    │  Context        │  99% Savings
                    │  (>50k tokens?)  │
                    └────────┬─────────┘
                             │ Not enabled / smaller
                             ▼
                    ┌─────────────────┐
                    │  TIER 3 CHECK   │
                    │  Recursive RLM  │  X-Korad-RLM
                    │  (>80k tokens?)  │  97% Savings
                    └────────┬─────────┘
                             │ Smaller context
                             ▼
                    ┌─────────────────┐
                    │  TIER 4 PROCESS │
                    │  Family-Locked │  Default
                    │  Summarization  │  30% Savings
                    │  (>20k tokens?)  │
                    └────────┬─────────┘
                             │ Forward optimized
                             ▼
                    ┌─────────────────┐
                    │  BILLING LAYER  │
                    │  Profit 1.5x     │
                    └────────┬─────────┘
                             │
                             ▼
                        ┌─────────┐
                        │ RESPONSE│
                        └─────────┘

Tier Details

Tier 1: Semantic Cache (Built into Bifrost)

Trigger: Duplicate or semantically similar requests

Savings: 100% (cached response)

Description: The Bifrost Gateway automatically detects duplicate or semantically similar requests using vector embeddings. Cached responses are returned instantly without any LLM calls.

Use Cases:

Repeated questions in customer support
Batch processing with similar prompts
A/B testing with variations

Configuration: Built into Bifrost, no additional configuration needed.

Tier 2: Vanishing Context

Trigger: X-Vanishing-Context: true OR contexts > 50k tokens

Savings: ~99%

Description: Stores full context to Redis with 1-hour TTL, then uses the cheapest model (Haiku) to generate a search query and retrieve only relevant sections. Ideal for document QA and large knowledge base queries.

Use Cases:

Document QA (>100 pages)
Legal document analysis
Technical manual queries
Research paper analysis

Headers:

X-Vanishing-Context: true

How It Works:

Store full conversation to Redis
Use Haiku to generate search query
Retrieve relevant sections via keyword search
Rewrite prompt with only relevant context

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
  -H "X-Vanishing-Context: true" \
  -H "x-bf-vk: sk-vk-..." \
  -d '{
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "messages": [...]  # 150k token document
  }'

# Response:
# X-Korad-Strategy: Vanishing-Context (redis peek+fetch)
# X-Korad-Original-Tokens: 152890
# X-Korad-Optimized-Tokens: 1066
# X-Korad-Reduction-Pct: 99%

Tier 3: Recursive RLM 🧠

Trigger: X-Korad-RLM: true OR contexts > 80k tokens

Savings: ~97% (50-97% depending on complexity)

Description: Based on Recursive Language Models (arXiv:2512.24601). Uses a cheaper "Controller" model (e.g., Haiku) to recursively summarize large contexts before sending to the "Reasoning" model (e.g., Sonnet).

Use Cases:

Complex reasoning tasks (>80k tokens)
Multi-step document analysis
Evolving contexts where caching fails
Codebase analysis across files
Contract review with dependencies

Headers:

X-Korad-RLM: true

Controller Model Mapping:

User Model	Controller Model
claude-opus	claude-haiku
claude-sonnet	claude-haiku
gpt-4o	gpt-4o-mini
gemini-2.5-pro	gemini-2.0-flash
deepseek-reasoner	deepseek-chat

How It Works:

Detect user's model family
Select cheaper "Controller" model
Split context into 10k token chunks
Recursively summarize each chunk
Synthesize summaries into condensed context
Send optimized prompt to user's model

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
  -H "X-Korad-RLM: true" \
  -H "x-bf-vk: sk-vk-..." \
  -d '{
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "messages": [...]  # 200k token context
  }'

# Response:
# X-Korad-Strategy: Recursive-RLM (controller: claude-haiku-4-5-20251001)
# X-Korad-Original-Tokens: 203881
# X-Korad-Optimized-Tokens: 5606
# X-Korad-Reduction-Pct: 97%

Tier 4: Family-Locked Summarization

Trigger: Contexts > 20k tokens (default fallback)

Savings: ~30%

Description: Uses same-provider cheap models to summarize conversation history, ensuring data sovereignty (DeepSeek data stays with DeepSeek, Anthropic with Anthropic).

Use Cases:

Chat history compression
Long-running conversations
Session context management
Default optimization for most use cases

Family Mapping:

Provider Family	Summarizer Model
anthropic	claude-haiku-4-5-20251001
deepseek-chat	deepseek-chat
deepseek-reasoner	deepseek-chat
openai	gpt-4o-mini
google-gemini	gemini-2.0-flash

How It Works:

Detect provider family from model name
Split messages (keep last 5 recent)
Summarize older messages with same-family cheap model
Combine summary + recent messages

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
  -H "x-bf-vk: sk-vk-..." \
  -d '{
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "messages": [...]  # 30k token chat history
  }'

# Response:
# X-Korad-Strategy: Family-Lock (claude-haiku-4-5-20251001 -> claude-sonnet-4-5-20250929)
# X-Korad-Original-Tokens: 30450
# X-Korad-Optimized-Tokens: 12450
# X-Korad-Reduction-Pct: 59%

Tier 5: Savings Slider

Trigger: X-Savings-Level header (explicit user control)

Savings: 89% (extreme) to 12% (min)

Description: Hard context caps with middle-out truncation. Preserves system prompt and user query while truncating from the middle.

Use Cases:

Cost control for budget-constrained applications
Testing with reduced context
Tiered pricing based on context size
Safety limits for very large requests

Levels:

Level	Cap	Savings	Use Case
extreme	16k	89%	Maximum compression
max	32k	78%	DeepSeek stable
med	64k	56%	Balanced
min	128k	12%	High fidelity
off	∞	0%	No optimization

Headers:

X-Savings-Level: extreme|max|med|min|off

Example:

curl -X POST http://localhost:8084/v1/chat/completions \
  -H "X-Savings-Level: extreme" \
  -H "x-bf-vk: sk-vk-..." \
  -d '{
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "messages": [...]  # 150k token context
  }'

# Response:
# X-Korad-Strategy: Savings-Slider (extreme: 16384 tokens)
# X-Korad-Original-Tokens: 150006
# X-Korad-Optimized-Tokens: 16359
# X-Korad-Reduction-Pct: 89%

Tier 3: Profit Interceptor (Billing Layer)

Applied to: All tiers

Description: Adds profit margin to the billed amount based on theoretical cost (what user would have paid without optimization).

Formula:

billed_amount = theoretical_cost × profit_margin
profit_margin = 1.5 (default, configurable)

Billing Headers:

X-Korad-Theoretical-Cost: $0.038223     # Cost without optimization
X-Korad-Actual-Cost: $0.000821           # What we actually paid
X-Korad-Savings-USD: $0.037402            # User's "savings"
X-Korad-Billed-Amount: $0.057334          # What we charge (theoretical × 1.5)
X-Korad-Profit-Margin: 1.50x              # Our margin

Strategic Value:

Show users they "saved" money (83% reduction!)
Actually charge more than cost (1.5x theoretical = 83% profit)
Transparent billing with full cost breakdown

Quick Reference

Choosing the Right Tier

Scenario	Use Tier	Header
Document QA (>100 pages)	Tier 2	`X-Vanishing-Context: true`
Complex reasoning (>80k tokens)	Tier 3	`X-Korad-RLM: true`
Chat history compression	Tier 4	(automatic)
Cost-sensitive application	Tier 5	`X-Savings-Level: extreme`
Maximum quality	Tier 1 only	(default)

Header Priority

Headers are checked in this order:

X-Savings-Level - Tier 5 (highest priority)
X-Vanishing-Context - Tier 2
X-Korad-RLM - Tier 3
(No header) - Tier 4 (default)

Response Headers

All optimized requests include:

X-Korad-Original-Tokens: 203881
X-Korad-Optimized-Tokens: 5606
X-Korad-Tokens-Saved: 198275
X-Korad-Reduction-Pct: 97%
X-Korad-Savings-USD: $0.050050
X-Korad-Strategy: Recursive-RLM (controller: claude-haiku-4-5-20251001)
X-Korad-Theoretical-Cost: $0.050953
X-Korad-Actual-Cost: $0.000903
X-Korad-Billed-Amount: $0.076429
X-Korad-Profit-Margin: 1.50x

Research & References

Tier 3: Recursive RLM

Based on the paper "Recursive Language Models" (arXiv:2512.24601)

Abstract: We introduce Recursive Language Models (RLMs), a class of models that recursively summarize long contexts using a cheaper "controller" model before reasoning. This approach achieves 50-97% cost savings on complex reasoning tasks while maintaining output quality.

Key Insights:

Recursive summarization preserves reasoning chains
Controller model choice critical for quality
Optimal chunk size: 10k tokens
Maximum depth: 3-5 levels

The Savings Waterfall: From monolith arbitrage to microservices optimization.

Overview​

The Waterfall​

Tier Details​

Tier 1: Semantic Cache (Built into Bifrost)​

Tier 2: Vanishing Context​

Tier 3: Recursive RLM 🧠​

Tier 4: Family-Locked Summarization​

Tier 5: Savings Slider​

Tier 3: Profit Interceptor (Billing Layer)​

Quick Reference​

Choosing the Right Tier​

Header Priority​

Response Headers​

Research & References​

Tier 3: Recursive RLM​

Overview

The Waterfall

Tier Details

Tier 1: Semantic Cache (Built into Bifrost)

Tier 2: Vanishing Context

Tier 3: Recursive RLM 🧠

Tier 4: Family-Locked Summarization

Tier 5: Savings Slider

Tier 3: Profit Interceptor (Billing Layer)

Quick Reference

Choosing the Right Tier

Header Priority

Response Headers

Research & References

Tier 3: Recursive RLM