"Should we fine-tune a model or use RAG?" is one of the most common AI questions we hear. The answer is usually "both, but start with RAG." Here's the framework for picking — and for combining them when one isn't enough.

What each one actually does

RAG (retrieval-augmented generation) doesn't change the model. It changes what the model sees at inference time. When the user asks a question, your system retrieves relevant documents from a vector store and injects them into the prompt. The model uses them as context.

Fine-tuning changes the model itself. You take an open-source base model and train it further on your data, adjusting its weights so it knows things — or sounds like — your specific domain.

The first is teaching the model to look things up. The second is teaching it to know things.

Start with RAG. Almost always.

For the vast majority of business use cases, RAG is the right starting point. Here's why:

  • It's cheaper. Setting up a vector store and retrieval pipeline costs weeks, not months. Fine-tuning costs GPUs and machine learning engineering time.
  • It's auditable. Every answer cites the documents it came from. Hallucinations are easier to catch.
  • Your data changes. Update a document, re-index, done. With fine-tuning, you re-train.
  • It composes. RAG works with any model — GPT, Claude, Gemini, Llama. You aren't locked into one provider.

If you're building a "chat with our docs" experience, a customer support assistant, an internal knowledge tool, or anything where the answer needs to come from a known corpus — RAG. Period.

When fine-tuning actually wins

Fine-tuning earns its complexity in specific cases:

  • Style and tone matter more than facts. If you need the model to write in a specific voice — your brand's, your CEO's, your domain's house style — fine-tuning teaches that better than prompting.
  • Structured output reliability. Fine-tuning on thousands of examples of perfectly-formatted JSON or SQL outputs produces more reliable structured generation than prompting.
  • Domain-specific reasoning. Highly specialised domains (medical, legal, scientific) where the base model's knowledge is shallow benefit from training on domain corpora.
  • Latency or cost at scale. A fine-tuned small model can sometimes match a large model's quality on a narrow task — and run 10x cheaper.

The hybrid: when you need both

Most sophisticated production systems eventually combine both. A fine-tuned model handles the style, reasoning, and structured output. RAG handles the up-to-date knowledge. The result is more reliable than either alone.

For example, a legal AI assistant we worked on used:

  • A fine-tuned model that knew how to write in legal style, format citations correctly, and follow the firm's house argument structure.
  • A RAG layer that pulled in the actual case law and statutes relevant to each query.

Either alone was insufficient. RAG without fine-tuning produced answers in the wrong style. Fine-tuning without RAG hallucinated case citations. Together, they worked.

The real cost comparison

For a rough sense of relative cost:

  • RAG MVP: 2-4 weeks engineering, ~$5K in infrastructure for the first year.
  • Production RAG: 6-12 weeks engineering, $20-50K/year in infrastructure depending on volume.
  • Fine-tuning MVP: 6-12 weeks engineering + ML expertise, $10-30K in training compute, evaluation budget.
  • Production fine-tuning: Ongoing ML engineering, retraining costs every model update, evaluation infrastructure. $100K+/year is realistic.

Note the gap. Fine-tuning is roughly an order of magnitude more expensive across its lifecycle. That cost is justified for the right use cases — but it's not the default.

The most expensive AI mistake we see is fine-tuning when RAG would have worked. The second is the opposite. Pick deliberately.

How to actually decide

Ask these questions in order:

  1. Does the answer need to come from a specific corpus that updates over time? → RAG.
  2. Does the model need to write in a specific style or format? → Fine-tuning.
  3. Both? → Hybrid. Start with RAG, layer fine-tuning when the style gap matters.
  4. Neither — you just need a good chatbot? → Probably just prompting. Don't over-engineer.

The mistake we see most often is teams reaching for fine-tuning because it sounds more impressive than retrieval. It usually isn't, and it costs ten times as much. Start with the cheaper option. Add complexity when the cheaper option clearly fails.