Semantic Cache
Tier 1: 100% savings on duplicate requests Automatically detects and caches similar requests using vector embeddings.
Overviewβ
Semantic Cache is the first tier in the Savings Waterfall. It automatically detects duplicate or semantically similar requests using vector embeddings and returns cached responses instantly without any LLM calls.
How It Worksβ
βββββββββββββββ
β New Request β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββββββ
β Generate Vector β
β Embedding β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Vector Search β
β (Redis Stack) β
ββββββββββ¬βββββββββ
β
ββββββ΄βββββ
β β
βΌ βΌ
βββββββββ ββββββββββββ
β Found β β Not Foundβ
βββββ¬ββββ ββββββ¬ββββββ
β β
βΌ βΌ
ββββββββββ ββββββββββββ
β Return β β Proceed β
β Cached β β to Tier β
βResponseβ β 2 β
ββββββββββ ββββββββββββ
100% Savings
Use Casesβ
Semantic cache is most effective for:
- Customer Support - Repeated questions about products/services
- Code Generation - Similar coding tasks with slight variations
- Documentation QA - Repeated queries about documentation
- A/B Testing - Testing prompt variations
- Batch Processing - Multiple similar requests in a batch
Configurationβ
Semantic cache is built into the Bifrost Gateway and requires no additional configuration.
TTL Settingsβ
Configure cache TTL in config.dev.json:
{
"cache": {
"enabled": true,
"ttl_seconds": 3600,
"similarity_threshold": 0.95
}
}
Performanceβ
- Hit Latency: ~10ms (vs 500ms-5s for LLM)
- Savings: 100% (no LLM cost)
- Cache Hit Rate: 15-30% for typical workloads
Exampleβ
First Request (Cache Miss)β
curl -X POST http://localhost:8084/v1/chat/completions \
-H "x-bf-vk: sk-bf-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [
{"role": "user", "content": "Explain how JWT authentication works"}
]
}'
# Response headers:
# X-Korad-Cache: MISS
# X-Korad-Strategy: Passthrough (no optimization needed)
# X-Korad-Billed-Amount: $0.002500
Second Similar Request (Cache Hit)β
curl -X POST http://localhost:8084/v1/chat/completions \
-H "x-bf-vk: sk-bf-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [
{"role": "user", "content": "How does JWT auth work?"}
]
}'
# Response headers:
# X-Korad-Cache: HIT
# X-Korad-Strategy: Semantic-Cache
# X-Korad-Billed-Amount: $0.000000 β 100% savings!
Semantic Similarityβ
The cache uses cosine similarity to find semantically similar requests:
- Exact match: Similarity = 1.0
- Very similar: Similarity > 0.95 (cache hit)
- Somewhat similar: Similarity 0.85-0.95 (may hit depending on threshold)
- Different: Similarity < 0.85 (cache miss)
Example Pairsβ
| Request 1 | Request 2 | Similarity | Cached? |
|---|---|---|---|
| "Explain JWT" | "How does JWT work?" | 0.97 | β Yes |
| "What is OAuth?" | "Explain OAuth2" | 0.94 | β Yes |
| "Python list sort" | "Sort list in Python" | 0.96 | β Yes |
| "Create API" | "Debug API" | 0.72 | β No |
Cache Invalidationβ
The cache respects these invalidation rules:
- TTL Expiry - Cached responses expire after configured TTL
- Model Change - Different model = cache miss
- System Prompt - Different system prompt = cache miss
- Parameters - Different temperature/max_tokens = cache miss
Monitoringβ
Check cache performance:
curl http://localhost:8081/metrics/cache
# Response:
# {
# "hit_rate": 0.23,
# "total_requests": 10000,
# "cache_hits": 2300,
# "cache_misses": 7700
# }
Best Practicesβ
1. Enable for High-Traffic Endpointsβ
Semantic cache is most valuable for:
- FAQ endpoints
- Code generation helpers
- Documentation query endpoints
2. Tune Similarity Thresholdβ
Lower threshold = more hits but potentially less relevant:
{
"cache": {
"similarity_threshold": 0.90 // More permissive
}
}
3. Set Appropriate TTLβ
Shorter TTL = fresh results but lower hit rate:
{
"cache": {
"ttl_seconds": 1800 // 30 minutes
}
}
Technical Detailsβ
Vector Embedding Modelβ
The platform uses Cohere embed-v3 for semantic similarity:
- Dimension: 1024
- Cost: $0.0001 per 1K tokens
- Latency: ~50ms
Storageβ
Cached data is stored in Redis Stack with vector search:
# Check cache size
redis-cli DBSIZE
# View cache keys
redis-cli KEYS "cache:*"
100% savings on duplicate requests - automatically.