We deploy on-premise LLMs for a living. We also talk clients out of on-prem at least as often as we talk them into it. Here's the honest framework we use to figure out whether on-prem is the right choice — or whether you're about to spend six figures solving a problem you don't actually have.
The three reasons to go on-prem
There are three legitimate reasons to deploy AI on your own infrastructure:
- Regulatory or compliance constraints. If your industry forbids sending data to third-party clouds — healthcare, defence, parts of finance, sovereign data residency — on-prem isn't optional. It's the only option.
- Sensitive data that can't leave your network. Even when regulation doesn't require it, some data is too commercially sensitive to send to OpenAI or Anthropic. Trade secrets, M&A diligence, customer PII at scale.
- Predictable cost at high volume. Above a certain query volume — usually millions of tokens per day — owning the inference is cheaper than renting it. The crossover point depends on your model size and utilisation, but it's real.
If none of those apply, you probably shouldn't deploy on-prem. Use the cloud APIs.
The hidden costs nobody mentions
Sales decks present on-prem as "buy GPUs, save money on tokens." The reality is more expensive than that. Real on-prem deployments include:
- GPU procurement. A single H100 is around $30K. A production cluster is 8+ GPUs. Allocate $300K minimum for hardware, more if you want redundancy.
- Networking and storage. High-bandwidth networking between GPUs (NVLink, InfiniBand) and fast storage for model weights. Easily another $50-100K.
- Power and cooling. A loaded H100 draws ~700W. An 8-GPU node needs serious cooling. Your existing rack space might not handle it.
- Ops staff. Someone needs to manage the cluster, monitor model performance, handle failures, and ship updates. That's a real headcount cost, often two engineers.
- Model lifecycle. Open-source models update every few months. Re-evaluating, re-fine-tuning, and re-deploying is ongoing work.
Add it up and the true cost of a production on-prem LLM deployment is usually $500K-1M in year one, with $200-400K/year ongoing. That's the number that needs to compare to your cloud API spend.
The middle path: hosted private models
Many clients we work with don't need true on-prem — they need logical isolation. They want a dedicated model instance, no data shared with the provider, and ideally infrastructure in a specific region. That's not on-prem; that's a managed private deployment.
Options like AWS Bedrock with custom models, Azure OpenAI with private endpoints, or dedicated instances from Cohere, Anthropic, or self-hosted inference providers (Together AI, Modal, Replicate enterprise) often hit the requirements without the GPU procurement headache.
Before going full on-prem, exhaust the managed-private options. They cover ~70% of "we need privacy" use cases.
What we recommend evaluating
If you're seriously considering on-prem, run this checklist:
- Quantify the volume. What's your expected daily token throughput? If it's under 10M tokens/day, cloud APIs are almost certainly cheaper.
- Quantify the latency budget. Are you running batch jobs (latency doesn't matter) or interactive UI (every 100ms counts)? On-prem can be faster or slower than cloud depending on architecture.
- List the data constraints honestly. Not "we'd prefer not to send data to OpenAI" — actually list the contractual, regulatory, and customer commitments that force the decision.
- Identify the model. Llama 3.x, Mistral, Qwen, DeepSeek — which open model meets your quality bar? Run a real evaluation before buying hardware.
- Plan the ops. Who runs this in production? If the answer is "we'll figure it out," stop. That's the most expensive part.
The right question isn't "should we do on-prem AI?" It's "what specifically would we lose by using cloud APIs, and is that loss worth $500K?"
When we say yes
The on-prem deployments we've shipped that worked best had three things in common: real regulatory pressure, real query volume, and real ops investment. None of them did it for cost savings. The cost savings, when they materialised, were a happy side-effect of necessity.
If you're a healthcare network processing millions of patient queries per day, on-prem is obvious. If you're a startup that wants "privacy," cloud APIs with a BAA will serve you better and you can revisit in two years.