LLM API costs can spiral out of control quickly. Here's how we reduced our customers' average bill by 70%.
1. Semantic Caching
Cache responses for semantically similar queries:
from langchain.cache import RedisSemanticCache
cache = RedisSemanticCache(
redis_url="redis://localhost:6379",
embedding=OpenAIEmbeddings(),
score_threshold=0.95 # Similarity threshold
)
This alone can reduce costs by 40-50% for applications with repetitive queries.
2. Prompt Optimization
Shorter prompts = lower costs
GPT-4 charges per token. Optimize your prompts:
Before (847 tokens):
You are a helpful assistant that helps users with their questions...
After (156 tokens):
Role: Technical support agent
Task: Answer user questions concisely
Format: Markdown, max 3 paragraphs
Use system messages efficiently
System messages are sent with every request. Keep them minimal and move dynamic content to user messages.
3. Model Selection
Not every query needs GPT-4:
def select_model(query: str, context: dict) -> str:
if context.get("complexity") == "high":
return "gpt-4-turbo"
elif context.get("needs_code"):
return "gpt-3.5-turbo"
else:
return "gpt-3.5-turbo" # Default to cheaper model
Route 80% of queries to GPT-3.5 and save the expensive models for complex tasks.
4. Streaming Responses
Streaming doesn't reduce costs, but it improves perceived performance, letting you use fewer tokens for the same user satisfaction.
5. Token Budget Management
Set hard limits per user/request:
response = openai.chat.completions.create(
model="gpt-4",
messages=messages,
max_tokens=500, # Hard limit
temperature=0.7
)
Results
Implementing these strategies, our customers see:
| Strategy | Cost Reduction |
|----------|---------------|
| Semantic Caching | 40-50% |
| Prompt Optimization | 15-20% |
| Model Selection | 20-30% |
| Combined | 60-70% |
Start with caching - it's the quickest win.