The brief

A research-heavy organisation in a highly regulated sector wanted to give their analysts access to an AI assistant that could summarise documents, draft reports, and answer questions from their internal corpus. The catch: their data could not, under any circumstances, leave their network. Cloud APIs — including BAA-covered ones — were off the table.

What we evaluated

The decision tree we ran with them was:

  1. Cloud APIs with a BAA. Rejected — even with a BAA, sending data to a third party violated internal data residency rules.
  2. Managed private deployment (Bedrock, Azure OpenAI dedicated). Rejected — same residency rules applied.
  3. On-premise inference with an open-source model. The only remaining option.

The architecture

We deployed a Llama 3.x derivative fine-tuned on the client's internal style guide and a curated subset of their public-facing knowledge base. The stack:

  • An 8x H100 GPU cluster in the client's existing data centre.
  • vLLM as the inference server for high-throughput batching.
  • A RAG layer indexing ~5M internal documents into a vector store running on the same network.
  • A Next.js application providing a familiar chat interface, with strict SSO and per-user audit logging.
  • End-to-end observability — every prompt, retrieval, and response logged and tied to a user, for audit.

The hardest parts

Three challenges took most of the project time:

  • Data preparation. The "5M internal documents" was a mess of PDFs, scanned images, Word docs, and HTML in inconsistent quality. Building the OCR + cleaning + chunking pipeline was probably 40% of the project.
  • Fine-tuning evaluation. The client had no existing benchmark for "is this assistant good enough?" We built one — a curated set of 200 representative queries with expert-graded reference answers, scored by both rubric and LLM-as-judge.
  • Change management. A core group of analysts loved the tool from day one. Another group resisted it. We spent a month on training and adoption work that wasn't in the original scope.

The outcome

  • Average time spent on routine document summarisation tasks dropped 60%, measured against a baseline study.
  • Adoption across the eligible analyst population reached 80% within four months.
  • Zero data left the client's network — verified by network audit.
  • The internal team can now self-serve fine-tuning updates against new corpora using the pipeline we built.

Operationally, the cost of running this on-prem was actually higher than the cloud API equivalent would have been — but that was never the point. The project existed because the cloud option didn't exist for this client. That framing has held up.