On-Premise LLM for Regulated Industry

The brief

A research-heavy organisation in a highly regulated sector wanted to give their analysts access to an AI assistant that could summarise documents, draft reports, and answer questions from their internal corpus. The catch: their data could not, under any circumstances, leave their network. Cloud APIs — including BAA-covered ones — were off the table.

What we evaluated

The decision tree we ran with them was:

Cloud APIs with a BAA. Rejected — even with a BAA, sending data to a third party violated internal data residency rules.
Managed private deployment (Bedrock, Azure OpenAI dedicated). Rejected — same residency rules applied.
On-premise inference with an open-source model. The only remaining option.

The architecture

We deployed a Llama 3.x derivative fine-tuned on the client's internal style guide and a curated subset of their public-facing knowledge base. The stack:

An 8x H100 GPU cluster in the client's existing data centre.
vLLM as the inference server for high-throughput batching.
A RAG layer indexing ~5M internal documents into a vector store running on the same network.
A Next.js application providing a familiar chat interface, with strict SSO and per-user audit logging.
End-to-end observability — every prompt, retrieval, and response logged and tied to a user, for audit.

The hardest parts

Three challenges took most of the project time:

Data preparation. The "5M internal documents" was a mess of PDFs, scanned images, Word docs, and HTML in inconsistent quality. Building the OCR + cleaning + chunking pipeline was probably 40% of the project.
Fine-tuning evaluation. The client had no existing benchmark for "is this assistant good enough?" We built one — a curated set of 200 representative queries with expert-graded reference answers, scored by both rubric and LLM-as-judge.
Change management. A core group of analysts loved the tool from day one. Another group resisted it. We spent a month on training and adoption work that wasn't in the original scope.

The outcome

Average time spent on routine document summarisation tasks dropped 60%, measured against a baseline study.
Adoption across the eligible analyst population reached 80% within four months.
Zero data left the client's network — verified by network audit.
The internal team can now self-serve fine-tuning updates against new corpora using the pipeline we built.

Operationally, the cost of running this on-prem was actually higher than the cloud API equivalent would have been — but that was never the point. The project existed because the cloud option didn't exist for this client. That framing has held up.

Web Design

Web Development

Software Development

Featured

Free Discovery Call

Case Studies

On-Premise AI Setup

Training & Fine-Tuning

AI Agents & Automation

AI-Powered Development

Learn

Tools

Featured

Project Scoping Guide

About

Our Vision

Team

Careers