Semantic Cache

Tier 1: 100% savings on duplicate requests Automatically detects and caches similar requests using vector embeddings.

Overview

Semantic Cache is the first tier in the Savings Waterfall. It automatically detects duplicate or semantically similar requests using vector embeddings and returns cached responses instantly without any LLM calls.

How It Works

┌─────────────┐
│ New Request │
└──────┬──────┘
       │
       ▼
┌─────────────────┐
│ Generate Vector │
│   Embedding     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Vector Search  │
│  (Redis Stack)  │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌───────┐  ┌──────────┐
│ Found │  │ Not Found│
└───┬───┘  └────┬─────┘
    │           │
    ▼           ▼
┌────────┐  ┌──────────┐
│ Return │  │ Proceed  │
│ Cached │  │ to Tier  │
│Response│  │    2     │
└────────┘  └──────────┘
  100% Savings

Use Cases

Semantic cache is most effective for:

Customer Support - Repeated questions about products/services
Code Generation - Similar coding tasks with slight variations
Documentation QA - Repeated queries about documentation
A/B Testing - Testing prompt variations
Batch Processing - Multiple similar requests in a batch

Configuration

Semantic cache is built into the Bifrost Gateway and requires no additional configuration.

TTL Settings

Configure cache TTL in config.dev.json:

{
  "cache": {
    "enabled": true,
    "ttl_seconds": 3600,
    "similarity_threshold": 0.95
  }
}

Performance

Hit Latency: ~10ms (vs 500ms-5s for LLM)
Savings: 100% (no LLM cost)
Cache Hit Rate: 15-30% for typical workloads

Example

First Request (Cache Miss)

curl -X POST http://localhost:8084/v1/chat/completions \
  -H "x-bf-vk: sk-bf-..." \
  -d '{
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "messages": [
      {"role": "user", "content": "Explain how JWT authentication works"}
    ]
  }'

# Response headers:
# X-Korad-Cache: MISS
# X-Korad-Strategy: Passthrough (no optimization needed)
# X-Korad-Billed-Amount: $0.002500

Second Similar Request (Cache Hit)

curl -X POST http://localhost:8084/v1/chat/completions \
  -H "x-bf-vk: sk-bf-..." \
  -d '{
    "model": "anthropic/claude-sonnet-4-5-20250929",
    "messages": [
      {"role": "user", "content": "How does JWT auth work?"}
    ]
  }'

# Response headers:
# X-Korad-Cache: HIT
# X-Korad-Strategy: Semantic-Cache
# X-Korad-Billed-Amount: $0.000000  ← 100% savings!

Semantic Similarity

The cache uses cosine similarity to find semantically similar requests:

Exact match: Similarity = 1.0
Very similar: Similarity > 0.95 (cache hit)
Somewhat similar: Similarity 0.85-0.95 (may hit depending on threshold)
Different: Similarity < 0.85 (cache miss)

Example Pairs

Request 1	Request 2	Similarity	Cached?
"Explain JWT"	"How does JWT work?"	0.97	✅ Yes
"What is OAuth?"	"Explain OAuth2"	0.94	✅ Yes
"Python list sort"	"Sort list in Python"	0.96	✅ Yes
"Create API"	"Debug API"	0.72	❌ No

Cache Invalidation

The cache respects these invalidation rules:

TTL Expiry - Cached responses expire after configured TTL
Model Change - Different model = cache miss
System Prompt - Different system prompt = cache miss
Parameters - Different temperature/max_tokens = cache miss

Monitoring

Check cache performance:

curl http://localhost:8081/metrics/cache

# Response:
# {
#   "hit_rate": 0.23,
#   "total_requests": 10000,
#   "cache_hits": 2300,
#   "cache_misses": 7700
# }

Best Practices

1. Enable for High-Traffic Endpoints

Semantic cache is most valuable for:

FAQ endpoints
Code generation helpers
Documentation query endpoints

2. Tune Similarity Threshold

Lower threshold = more hits but potentially less relevant:

{
  "cache": {
    "similarity_threshold": 0.90  // More permissive
  }
}

3. Set Appropriate TTL

Shorter TTL = fresh results but lower hit rate:

{
  "cache": {
    "ttl_seconds": 1800  // 30 minutes
  }
}

Technical Details

Vector Embedding Model

The platform uses Cohere embed-v3 for semantic similarity:

Dimension: 1024
Cost: $0.0001 per 1K tokens
Latency: ~50ms

Storage

Cached data is stored in Redis Stack with vector search:

# Check cache size
redis-cli DBSIZE

# View cache keys
redis-cli KEYS "cache:*"

100% savings on duplicate requests - automatically.

Overview​

How It Works​

Use Cases​

Configuration​

TTL Settings​

Performance​

Example​

First Request (Cache Miss)​

Second Similar Request (Cache Hit)​

Semantic Similarity​

Example Pairs​

Cache Invalidation​

Monitoring​

Best Practices​

1. Enable for High-Traffic Endpoints​

2. Tune Similarity Threshold​

3. Set Appropriate TTL​

Technical Details​

Vector Embedding Model​

Storage​