Optimizing LLM Costs: Caching and Prompt Engineering

LLM API costs can spiral out of control quickly. Here's how we reduced our customers' average bill by 70%.

1. Semantic Caching

Cache responses for semantically similar queries:

from langchain.cache import RedisSemanticCache

cache = RedisSemanticCache(

redis_url="redis://localhost:6379",

embedding=OpenAIEmbeddings(),

score_threshold=0.95 # Similarity threshold

)

This alone can reduce costs by 40-50% for applications with repetitive queries.

2. Prompt Optimization

Shorter prompts = lower costs

GPT-4 charges per token. Optimize your prompts:

Before (847 tokens):

You are a helpful assistant that helps users with their questions...

After (156 tokens):

Role: Technical support agent

Task: Answer user questions concisely

Format: Markdown, max 3 paragraphs

Use system messages efficiently

System messages are sent with every request. Keep them minimal and move dynamic content to user messages.

3. Model Selection

Not every query needs GPT-4:

def select_model(query: str, context: dict) -> str:

if context.get("complexity") == "high":

return "gpt-4-turbo"

elif context.get("needs_code"):

return "gpt-3.5-turbo"

else:

return "gpt-3.5-turbo" # Default to cheaper model

Route 80% of queries to GPT-3.5 and save the expensive models for complex tasks.

4. Streaming Responses

Streaming doesn't reduce costs, but it improves perceived performance, letting you use fewer tokens for the same user satisfaction.

5. Token Budget Management

Set hard limits per user/request:

response = openai.chat.completions.create(

model="gpt-4",

messages=messages,

max_tokens=500, # Hard limit

temperature=0.7

)

Results

Implementing these strategies, our customers see:

| Strategy | Cost Reduction |

|----------|---------------|

| Semantic Caching | 40-50% |

| Prompt Optimization | 15-20% |

| Model Selection | 20-30% |

| Combined | 60-70% |

Start with caching - it's the quickest win.