Fine-tuning works when you do it well and produces worse outputs when you do it badly. The hardest part isn't the training — it's the dataset preparation and evaluation. Here's the workflow we follow.
Decide if fine-tuning is the right answer
Before any training: confirm fine-tuning is the right approach. Most "I need fine-tuning" requests are actually "I need better prompting" or "I need RAG." See our decision tree.
Fine-tuning makes sense for: style consistency, structured output reliability, narrow-task latency or cost optimisation, or domain-specific reasoning that prompting can't reach.
Pick the base model
Smaller models fine-tune faster and run cheaper. Bigger models start with more general capability. Our defaults:
- 7B for style-and-format tasks where the base capability is sufficient.
- 32B-70B for domain reasoning tasks.
- Avoid fine-tuning the smallest models (1-3B) unless you have a very narrow task — they often can't hold complex instructions.
Dataset preparation
This is 80% of the work. The principles:
- Quality over quantity. 1,000 hand-curated examples beats 100,000 noisy ones.
- Cover the distribution. Examples should span the breadth of what the model will see in production.
- Include edge cases. Adversarial inputs, ambiguous queries, out-of-domain questions — show the model what to do.
- Maintain quality drift. Re-review the dataset every few months as your understanding of "good" evolves.
Format the data correctly
Use the model's chat template. Don't invent your own. The model was pre-trained on a specific format — fine-tuning with a different format wastes capacity.
The training itself
- LoRA or QLoRA for most cases. Cheap, fast, and modular — you can swap LoRA adapters per task.
- Full fine-tuning only for cases where LoRA isn't enough. Much more expensive and harder to manage.
- Learning rate: Start with 1e-4 to 5e-5 for LoRA, much lower for full fine-tuning.
- Epochs: 2-4 for most tasks. More than that and you'll overfit on small datasets.
- Validation set: Hold out 10-15% of data, never used for training, always used for evaluation.
Evaluation — the part that matters most
Without a good evaluation harness, you're flying blind. Set this up before you train:
- Golden test set. A curated set of 100-500 representative examples with expert-graded reference answers.
- Rubric. A clear scoring rubric for each example. "Did the model do X correctly?"
- Automated scoring. Where possible, exact-match or structured comparison.
- LLM-as-judge. Where automated scoring isn't possible, use a strong frontier model as judge with a careful rubric. Validate the judge against human grading on a sample.
- Regression suite. Run the eval on every new training run. Track scores over time. Catch regressions before they ship.
Common failure modes
Catastrophic forgetting
The model becomes good at your task but terrible at general capability. Mitigations: include general-purpose examples in your training mix, train fewer epochs, use lower learning rates.
Memorisation
The model regurgitates training examples verbatim instead of generalising. Mitigations: more diverse training data, evaluate on held-out examples that test generalisation.
Reward hacking
If your eval is gameable, the model will game it. Continually adversarially probe your eval set. Add new examples regularly.
Distribution mismatch
The model is great on your test set and terrible in production. Cause: your test set doesn't reflect production. Mitigation: build the test set from real production queries, not synthetic ones.
Deployment
- Ship behind a feature flag. Start with internal users.
- Log every production input and output. Sample for ongoing evaluation.
- A/B test against the base model. Don't trust offline metrics alone.
- Have a rollback plan. Fine-tuned models can degrade over time as your data drifts.