LlamaIndex Integration

Use Korad.AI with LlamaIndex Build RAG applications with automatic cost optimization.

Overview

LlamaIndex works seamlessly with Korad.AI for building RAG (Retrieval-Augmented Generation) applications with automatic cost savings.

Installation

pip install llama-index
pip install llama-index-llms-openai

Basic Setup

from llama_index.llms.openai import OpenAI

# Configure for Korad.AI
llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929"
)

# Use LLM
response = llm.complete("Hello, world!")
print(response.text)

With Optimization Headers

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929",
    default_headers={
        "X-Vanishing-Context": "true"  # Document optimization
    }
)

RAG Pipeline

Basic RAG

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# Configure LLM
llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929"
)

# Load documents
documents = SimpleDirectoryReader('data').load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
query_engine = index.as_query_engine(llm=llm)

# Query
response = query_engine.query("What is the main topic?")
print(response)

With Optimization

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI

# Configure with Vanishing Context for large documents
llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929",
    default_headers={
        "X-Vanishing-Context": "true",  # Best for document QA
        "X-Savings-Level": "med"        # Fallback cap
    }
)

# Large document processing
documents = SimpleDirectoryReader('large_documents').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=llm)

Chat Engine

from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.chat_engine import CondenseQuestionChatEngine

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929"
)

documents = SimpleDirectoryReader('data').load_data()
index = VectorStoreIndex.from_documents(documents)

# Create chat engine
chat_engine = index.as_chat_engine(
    chat_mode="condense_question",
    llm=llm
)

# Chat
response = chat_engine.chat("What did I ask about earlier?")
print(response)

Streaming Responses

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929"
)

# Stream completion
for chunk in llm.stream("Tell me a story"):
    print(chunk.delta, end="")

Advanced Features

Custom Embeddings

from llama_index import ServiceContext
from llama_index.embeddings import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# Use Korad.AI for LLM, separate for embeddings
llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929"
)

embed_model = OpenAIEmbedding(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY"
)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model
)

Hybrid Search

from llama_index import VectorStoreIndex
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.retrievers import BM25Retriever

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929"
)

index = VectorStoreIndex.from_documents(documents)

# Hybrid vector + keyword search
bm25_retriever = BM25Retriever.from_defaults(index=index)
vector_retriever = index.as_retriever()

query_engine = RetrieverQueryEngine.from_args(
    retriever=vector_retriever,
    llm=llm
)

Multi-Document Agents

from llama_index.agent import OpenAIAgent
from llama_index.tools import QueryEngineTool
from llama_index.query_engine import RouterQueryEngine

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929",
    default_headers={
        "X-Korad-RLM": "true"  # For complex multi-document reasoning
    }
)

# Create tools for different document collections
tool1 = QueryEngineTool.from_defaults(
    query_engine=index1.as_query_engine(),
    name="docs1",
    description="Documentation 1"
)

tool2 = QueryEngineTool.from_defaults(
    query_engine=index2.as_query_engine(),
    name="docs2",
    description="Documentation 2"
)

# Create agent
agent = OpenAIAgent.from_tools(
    [tool1, tool2],
    llm=llm
)

response = agent.chat("Compare the approaches in both document sets")

Best Practices

1. Use Vanishing Context for Large Documents

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929",
    default_headers={"X-Vanishing-Context": "true"}
)

2. Use RLM for Complex Reasoning

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929",
    default_headers={"X-Korad-RLM": "true"}
)

3. Set Savings Cap for Budget Control

llm = OpenAI(
    base_url="http://localhost:8084/v1",
    api_key="sk-bf-YOUR_VIRTUAL_KEY",
    model="anthropic/claude-sonnet-4-5-20250929",
    default_headers={"X-Savings-Level": "med"}
)

LlamaIndex + Korad.AI = Enterprise RAG with automatic cost savings.

Overview​

Installation​

Basic Setup​

With Optimization Headers​

RAG Pipeline​

Basic RAG​

With Optimization​

Chat Engine​

Streaming Responses​

Advanced Features​

Custom Embeddings​

Hybrid Search​

Multi-Document Agents​

Best Practices​

1. Use Vanishing Context for Large Documents​

2. Use RLM for Complex Reasoning​

3. Set Savings Cap for Budget Control​