From 2023–24 we saw AI bills spike - and then drop by 40–60% - when teams added basic observability, routing, and context hygiene.
In early 2023, nobody knew what a normal LLM bill looked like. By 2024, many teams had at least one painful invoice that forced them to take optimisation seriously. The good news: a handful of straightforward changes usually delivered outsized savings - often 40–60% reductions without any meaningful quality regression.
The pattern we saw consistently: teams would launch an AI feature, ship it to production, and forget about cost until the next billing cycle. Then a large, surprising number would show up on the invoice, prompting a frantic optimisation sprint. The teams that avoided this cycle were the ones who treated cost as a first-class engineering metric from day one - not something finance worried about.
See the Cost First
Before optimising anything, we always started by breaking spend down per endpoint and per tenant. A typical pattern: 10–20% of requests accounted for 60–80% of tokens, usually because they used the largest model with the longest contexts. Once you can see that breakdown, trade-offs write themselves. You no longer need to discuss cost in the abstract - you can point to the specific endpoints and usage patterns that are driving the bill.
A real example from 2023
One client reduced their monthly LLM spend by 52% in two weeks without changing any model. The entire saving came from discovering that a single internal analytics endpoint was sending full conversation histories to a large model for a simple classification task. Moving that endpoint to a smaller model with a trimmed context cut its cost by 89%. The other endpoints were untouched.
Right-Sizing: The Single Biggest Lever
Model selection is the most powerful cost lever available. In 2023 many teams defaulted to the largest available model because it gave the best demo results. By 2024, teams that evaluated their actual production tasks found that a large proportion of them could be handled by smaller, cheaper models with no meaningful quality drop.
The right approach is empirical, not intuitive. Build a task-specific evaluation set, run it against multiple model sizes, and look at where the quality curves diverge. For many structured extraction, classification, and short-form generation tasks, models that cost 10x less produce outputs that are 95% as good. That 5% gap often matters for some tasks and not at all for others.
- Simple classification or routing: small models (GPT-3.5 tier, Claude Haiku) are almost always sufficient
- Structured extraction from short documents: mid-tier models handle this well with a good prompt
- Long-document summarisation or reasoning: large models earn their cost on genuine complexity
- User-facing conversational responses where quality perception matters: test both, measure user satisfaction, not just automated metrics
Context Hygiene: Stop Paying for Tokens You Do Not Need
One of the most common cost problems we found in 2023–24 was context bloat. Systems that were designed during rapid prototyping often sent entire conversation histories, full documents, or multiple redundant prompt sections on every call. In production at scale, those extra tokens become a major cost driver.
- 1Audit your prompt templates and count tokens - most teams are shocked by how large they are
- 2Trim conversation history to the last 5–10 turns rather than the full session, unless continuity genuinely requires more
- 3Use summarisation for long-running conversations - periodically compress earlier context into a short summary
- 4Filter retrieved documents by relevance score before including them - do not include documents just because they might help
- 5Remove redundant instructions from system prompts - duplicate safety instructions and verbose formatting guidance add up
Intelligent Routing: The 2024 Pattern That Stuck
By late 2024, the most cost-efficient production systems had adopted intelligent routing: a lightweight classifier at the front of the system that triaged incoming requests and sent simple ones to cheaper models while reserving expensive models for the queries that actually needed them.
Routing can be as simple as a rule-based system (short questions go to small model, long complex queries go to large model) or as sophisticated as a trained classifier that predicts which model tier will produce acceptable quality for a given input. Both approaches showed meaningful savings. The simple rule-based version was usually good enough as a first step and could be implemented in a day.
Caching: The Free Saving Most Teams Ignore
Semantic caching - storing responses to similar previous queries and returning cached results when a new query is semantically close enough - can dramatically reduce costs for use cases where users ask similar questions repeatedly. Internal knowledge assistants, product FAQ bots, and support tools are particularly good candidates.
- Exact match caching: free with Redis or a simple in-memory store; effective for templated queries
- Semantic caching: use vector similarity to find close-enough previous queries; GPTCache and similar tools made this accessible by 2024
- Cache invalidation strategy: define when a cached response is too old or the underlying data has changed enough to require a fresh generation
- Monitor cache hit rates: a low hit rate means your queries are too varied for caching to help significantly
When LLMs Are Not the Right Tool at All
A less comfortable but important cost observation: some workloads that were using LLMs in 2023–24 did not actually need them. Regex, keyword matching, classical ML classifiers, and simple rule engines solved a significant fraction of the tasks that had been naively handed to expensive LLMs because LLMs were the exciting new thing.
Teams that audited their LLM usage and ruthlessly moved non-generative tasks to cheaper components - while keeping LLMs for tasks that genuinely required language understanding or generation - achieved the largest cost reductions. This required honest evaluation of what each task actually needed, rather than a default assumption that LLMs were the right tool for everything.
Building something in this space?
We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.
Book a 30-minute callKey takeaways
- Most AI cost was driven by a small number of endpoints and tenants
- Right-sizing models and context often cut cost without hurting quality
- Intelligent routing became a standard pattern by late 2024
- Some workloads never needed LLMs in the first place
- Cost dashboards created healthier conversations with finance