Skip to main content

Semantic Cache

Tier 1: 100% savings on duplicate requests Automatically detects and caches similar requests using vector embeddings.

Overview​

Semantic Cache is the first tier in the Savings Waterfall. It automatically detects duplicate or semantically similar requests using vector embeddings and returns cached responses instantly without any LLM calls.

How It Works​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ New Request β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Generate Vector β”‚
β”‚ Embedding β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Vector Search β”‚
β”‚ (Redis Stack) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Found β”‚ β”‚ Not Foundβ”‚
β””β”€β”€β”€β”¬β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Return β”‚ β”‚ Proceed β”‚
β”‚ Cached β”‚ β”‚ to Tier β”‚
β”‚Responseβ”‚ β”‚ 2 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
100% Savings

Use Cases​

Semantic cache is most effective for:

  • Customer Support - Repeated questions about products/services
  • Code Generation - Similar coding tasks with slight variations
  • Documentation QA - Repeated queries about documentation
  • A/B Testing - Testing prompt variations
  • Batch Processing - Multiple similar requests in a batch

Configuration​

Semantic cache is built into the Bifrost Gateway and requires no additional configuration.

TTL Settings​

Configure cache TTL in config.dev.json:

{
"cache": {
"enabled": true,
"ttl_seconds": 3600,
"similarity_threshold": 0.95
}
}

Performance​

  • Hit Latency: ~10ms (vs 500ms-5s for LLM)
  • Savings: 100% (no LLM cost)
  • Cache Hit Rate: 15-30% for typical workloads

Example​

First Request (Cache Miss)​

curl -X POST http://localhost:8084/v1/chat/completions \
-H "x-bf-vk: sk-bf-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [
{"role": "user", "content": "Explain how JWT authentication works"}
]
}'

# Response headers:
# X-Korad-Cache: MISS
# X-Korad-Strategy: Passthrough (no optimization needed)
# X-Korad-Billed-Amount: $0.002500

Second Similar Request (Cache Hit)​

curl -X POST http://localhost:8084/v1/chat/completions \
-H "x-bf-vk: sk-bf-..." \
-d '{
"model": "anthropic/claude-sonnet-4-5-20250929",
"messages": [
{"role": "user", "content": "How does JWT auth work?"}
]
}'

# Response headers:
# X-Korad-Cache: HIT
# X-Korad-Strategy: Semantic-Cache
# X-Korad-Billed-Amount: $0.000000 ← 100% savings!

Semantic Similarity​

The cache uses cosine similarity to find semantically similar requests:

  • Exact match: Similarity = 1.0
  • Very similar: Similarity > 0.95 (cache hit)
  • Somewhat similar: Similarity 0.85-0.95 (may hit depending on threshold)
  • Different: Similarity < 0.85 (cache miss)

Example Pairs​

Request 1Request 2SimilarityCached?
"Explain JWT""How does JWT work?"0.97βœ… Yes
"What is OAuth?""Explain OAuth2"0.94βœ… Yes
"Python list sort""Sort list in Python"0.96βœ… Yes
"Create API""Debug API"0.72❌ No

Cache Invalidation​

The cache respects these invalidation rules:

  1. TTL Expiry - Cached responses expire after configured TTL
  2. Model Change - Different model = cache miss
  3. System Prompt - Different system prompt = cache miss
  4. Parameters - Different temperature/max_tokens = cache miss

Monitoring​

Check cache performance:

curl http://localhost:8081/metrics/cache

# Response:
# {
# "hit_rate": 0.23,
# "total_requests": 10000,
# "cache_hits": 2300,
# "cache_misses": 7700
# }

Best Practices​

1. Enable for High-Traffic Endpoints​

Semantic cache is most valuable for:

  • FAQ endpoints
  • Code generation helpers
  • Documentation query endpoints

2. Tune Similarity Threshold​

Lower threshold = more hits but potentially less relevant:

{
"cache": {
"similarity_threshold": 0.90 // More permissive
}
}

3. Set Appropriate TTL​

Shorter TTL = fresh results but lower hit rate:

{
"cache": {
"ttl_seconds": 1800 // 30 minutes
}
}

Technical Details​

Vector Embedding Model​

The platform uses Cohere embed-v3 for semantic similarity:

  • Dimension: 1024
  • Cost: $0.0001 per 1K tokens
  • Latency: ~50ms

Storage​

Cached data is stored in Redis Stack with vector search:

# Check cache size
redis-cli DBSIZE

# View cache keys
redis-cli KEYS "cache:*"

100% savings on duplicate requests - automatically.