Session overview

A practical session on fine-tuning open-source language models on proprietary data, including the dataset preparation, training, and evaluation workflow we use in client engagements.

What we cover

  • When fine-tuning is the right answer. Decision framework for choosing fine-tuning vs RAG vs prompting.
  • Base model selection. Llama, Qwen, Mistral, DeepSeek — practical considerations for picking a starting point.
  • Dataset preparation. Curation, formatting, edge cases, the 80% of the work that's least glamorous.
  • LoRA and QLoRA training. Hyperparameter choices, training duration, hardware requirements.
  • Evaluation harnesses. Building the test set, defining rubrics, automated and LLM-as-judge scoring.
  • Common failure modes. Catastrophic forgetting, memorisation, reward hacking, distribution mismatch.

Live demonstration

The session includes a live walkthrough of fine-tuning a 7B model on a small proprietary dataset — from raw data through to evaluation. Total elapsed wall-clock time visible to the audience.

Reference materials

The workflow demonstrated is documented in our fine-tuning guide. Background context on choosing between fine-tuning and RAG is in our decision tree blog post.

Q&A topics

  • Sizing the dataset for a target capability.
  • Continuous fine-tuning vs episodic retraining.
  • Comparing fine-tuned open-source models with frontier APIs.
  • Cost optimisation strategies.

Recording

Contact us to request the recording.