The AI cost optimization playbook: 7 tactics that actually work
Practical cost reduction: tiered routing, prompt caching, output constraints, batch processing, usage alerts, and cache-aware architecture.
I audited an AI product's bill last month: $48,000 for 900 MAU. The team used Claude Opus for every request because 'it gives the best answers.' No tiered routing, no caching, no alerts. Two weeks of work brought the bill to $12,000 with no measurable quality regression. Here are the seven tactics that delivered those savings.
7 tactics ranked by impact
- Tiered model routing (30–40% savings): Classify every request. Simple extraction → fast tier. Chat → balanced tier. Complex reasoning → frontier tier. 'What is the return policy?' to Opus costs 5x more than Haiku for the same answer.
- Prompt caching (15–25% savings): Claude and Gemini offer prompt caching. Sonnet cache reads at $0.30/1M vs $3/1M input — 10x reduction. Put static content first in the messages array to maximize cache hits.
- Output token constraints (10–15% savings): Set max_tokens to the 95th percentile of actual usage per workflow, not the model maximum. Classification needs 50 tokens, not 16,384.
- Prompt compression (8–12% savings): Audit system prompts. A 4000-token prompt that could be 800 burns money every call. Move static knowledge to retrieval, not the prompt.
- Usage alerts and budgets (preventative): Hard spending caps per org, soft alerts at 50%/80%/95%. Anomaly detection when daily spend exceeds 2.5x the 7-day average.
- Batch non-interactive work (5–10% savings): Nightly reports and bulk classification at batch pricing (50% off with 24h turnaround). Real-time for users, batch for background.
- Track cost per feature (visibility): Tag every call with a feature ID. You might discover your free search feature costs $3K/month while core chat costs $9K.
Updated:
Ready to ship your AI product?
Start free, route across providers, and see honest cost + readiness from day one.
Related reading
- Models
How to choose the right AI model for every product workflow
A battle-tested model selection framework covering cost, latency, context window, tool calling, vision, and reasoning — with real numbers and a decision matrix.
- Operations
The AI billing pipeline: from token to invoice
Production AI billing needs usage events, idempotent payments, credit accounting, per-model cost breakdowns, and proactive balance alerts.
- Product
VeloxAI: the multi-model control plane for product teams
Why product teams need one API for models, agents, RAG, billing, analytics, and readiness instead of another thin provider proxy.