The brief
A research-heavy organisation in a highly regulated sector wanted to give their analysts access to an AI assistant that could summarise documents, draft reports, and answer questions from their internal corpus. The catch: their data could not, under any circumstances, leave their network. Cloud APIs — including BAA-covered ones — were off the table.
What we evaluated
The decision tree we ran with them was:
- Cloud APIs with a BAA. Rejected — even with a BAA, sending data to a third party violated internal data residency rules.
- Managed private deployment (Bedrock, Azure OpenAI dedicated). Rejected — same residency rules applied.
- On-premise inference with an open-source model. The only remaining option.
The architecture
We deployed a Llama 3.x derivative fine-tuned on the client's internal style guide and a curated subset of their public-facing knowledge base. The stack:
- An 8x H100 GPU cluster in the client's existing data centre.
- vLLM as the inference server for high-throughput batching.
- A RAG layer indexing ~5M internal documents into a vector store running on the same network.
- A Next.js application providing a familiar chat interface, with strict SSO and per-user audit logging.
- End-to-end observability — every prompt, retrieval, and response logged and tied to a user, for audit.
The hardest parts
Three challenges took most of the project time:
- Data preparation. The "5M internal documents" was a mess of PDFs, scanned images, Word docs, and HTML in inconsistent quality. Building the OCR + cleaning + chunking pipeline was probably 40% of the project.
- Fine-tuning evaluation. The client had no existing benchmark for "is this assistant good enough?" We built one — a curated set of 200 representative queries with expert-graded reference answers, scored by both rubric and LLM-as-judge.
- Change management. A core group of analysts loved the tool from day one. Another group resisted it. We spent a month on training and adoption work that wasn't in the original scope.
The outcome
- Average time spent on routine document summarisation tasks dropped 60%, measured against a baseline study.
- Adoption across the eligible analyst population reached 80% within four months.
- Zero data left the client's network — verified by network audit.
- The internal team can now self-serve fine-tuning updates against new corpora using the pipeline we built.
Operationally, the cost of running this on-prem was actually higher than the cloud API equivalent would have been — but that was never the point. The project existed because the cloud option didn't exist for this client. That framing has held up.