Prompting, RAG, or Fine-tuning? How Teams Actually Decided (2022–2024)

In theory, you could solve most problems with any combination of prompting, RAG, or fine-tuning. In practice, the constraints of 2022–24 pushed teams toward specific patterns.

Between 2022 and 2024, we watched teams swing between extremes: "just prompt it" one quarter, "we must fine-tune" the next. The projects that made it to stable production had a simpler decision rule: start with the simplest technique that could possibly work, and only add complexity when evaluation demanded it.

A Simple Decision Tree That Aged Well

1If the task uses public knowledge and has low stakes, start with prompting.
2If the task requires proprietary or frequently-changing knowledge, start with RAG.
3If the task requires a model to adopt a consistent style or policy over thousands of examples, consider fine-tuning - but only with a strong evaluation set.

Almost every project that jumped straight to fine-tuning in 2023 eventually circled back to build the evaluation and retrieval layers they skipped the first time.

What Prompting Was Quietly Great At

Prompting alone carried more weight than many teams expected. For structured, low-risk tasks - drafting internal summaries, generating test ideas, transforming formats, or assisting with analysis where a human stayed in the loop - carefully designed prompts combined with lightweight evaluation delivered production value quickly without the overhead of new infrastructure.

Stable templates: prompts turned into versioned templates with clear input and output contracts
Guardrails in code: output validation and post-processing caught most edge cases
Fast iteration: product teams could refine behaviour weekly without touching model configuration
Good enough quality: in many back-office tasks, 'good enough' from prompting beat 'perfect' that required RAG or fine-tuning

When RAG Was the Right First Step

RAG shone whenever proprietary or frequently changing knowledge mattered: policies, contracts, product docs, and support knowledge bases. The critical decision point was whether users needed to see where an answer came from and whether content would change often enough that static fine-tuned knowledge would go stale.

Teams that reached for RAG too late often found themselves rewriting entire feature architectures to add retrieval, citations, and evaluation. Teams that started with a minimal RAG layer - even a simple vector index plus keyword search over key documents - were able to refine their systems incrementally without disruptive rework.

When Fine-Tuning Finally Paid Off

Fine-tuning only made sense in a narrower set of cases than the early hype suggested. It was most valuable where behaviour needed to be deeply, consistently shaped across thousands of similar examples: applying a complex policy, adopting a particular writing style at scale, or following domain-specific reasoning patterns that generic models struggled with.

1Prove the task with prompting and/or RAG on a small scale first.
2Build a strong, labelled evaluation set that captures success and failure modes.
3Only then invest in fine-tuning, with a clear hypothesis about what it should improve.
4Re-run evaluation regularly to ensure the fine-tuned model keeps its advantages over time.

Building something in this space?

We'd be happy to talk through your use case. No pitch - just an honest conversation about what's feasible.

Book a 30-minute call

Key takeaways

Prompt engineering was enough for many structured, low-risk tasks
RAG dominated where proprietary knowledge and traceability mattered
Fine-tuning paid off only when behaviour needed to be deeply, consistently shaped
Evaluation and data availability, not fashion, should drive the choice
Hybrid patterns (RAG + light fine-tuning) emerged in a few specific niches

Back to all articles

Prompting, RAG, or Fine-tuning? How Teams Actually Decided (2022–2024)

A Simple Decision Tree That Aged Well

What Prompting Was Quietly Great At

When RAG Was the Right First Step

When Fine-Tuning Finally Paid Off

RAG Evaluation: The Metrics That Actually Mattered in 2023–24

Why Agentic AI Fails in Production (And How to Prevent It)

Building Multi-Agent Systems That Don't Break in Production