Abstract
This paper provides a decision framework for organisations evaluating on-premise AI deployment versus cloud API services and managed private deployments. We cover the three legitimate reasons for on-prem, the hidden costs that don't appear in vendor presentations, and a total-cost-of-ownership model for comparing options honestly.
The three reasons to go on-prem
- Regulatory or compliance constraints that forbid sending data to third-party clouds.
- Sensitive data that cannot leave your network even under BAA-style protections.
- Predictable cost at high volume — above roughly 10M tokens/day, dedicated infrastructure can be cheaper than per-token pricing.
If none of these apply with rigour, on-prem is the wrong answer. This is the single most common mistake we see in evaluation projects.
The hidden costs
Vendor presentations focus on hardware. Real on-premise deployments include:
- Hardware: $300K-1M+ for a production GPU cluster.
- Power and cooling: A loaded 8-GPU node draws ~5.6kW. Annual electricity cost depends on tariff; cooling adds 30-50%.
- Networking and storage: $50-150K of additional infrastructure for a serious deployment.
- Ops staff: Two engineers minimum to run a production LLM cluster. That's $300-600K/year fully loaded.
- Model lifecycle: Re-evaluation, re-fine-tuning, and re-deployment as base models update.
- Integration: Application-layer work to wire the model into your existing systems.
Total year-one cost for a production deployment: $800K-1.5M. Year two onwards: $400-700K.
The middle path: managed private deployments
Many organisations that think they need on-prem actually need logical isolation: a dedicated model instance, no data sharing with the provider, and infrastructure in a specific region. Options include:
- AWS Bedrock with custom models.
- Azure OpenAI with private endpoints.
- Dedicated instances from Cohere, Anthropic, or other providers.
- Hosted private deployments from Together AI, Modal, Replicate enterprise.
These cover roughly 70% of "we need privacy" use cases without the GPU procurement burden. Exhaust them before considering full on-prem.
The decision framework
Before committing to on-prem, answer these five questions honestly:
- Volume. What is your projected daily token throughput? If under 10M tokens/day, cloud APIs are almost certainly cheaper.
- Latency. What's your latency budget? On-prem can be faster or slower than cloud depending on architecture.
- Data constraints. List the specific contractual, regulatory, and customer commitments that force the decision. Vague concerns are not enough.
- Model fit. Have you run an actual evaluation of the open-source models you'd deploy? Do they meet your quality bar?
- Ops capability. Who runs this in production? If the answer is uncertain, stop. That's the most expensive part of the project.
TCO comparison example
For a hypothetical workload of 50M tokens/day:
- Cloud API (GPT-4 class): ~$1M-2M/year depending on context length and caching.
- Managed private (Bedrock dedicated): $400-800K/year + setup costs.
- On-prem with open-source model: $1M year-one, $500K/year ongoing.
At this volume, all three are plausible. The decision turns on data constraints and ops maturity, not cost.
Lock-in considerations
On-prem is the lowest-lock-in option in theory. In practice:
- Hardware vendors have lock-in via service contracts.
- Inference frameworks (vLLM, TGI, TensorRT) have switching costs.
- Fine-tuned models are tied to specific base architectures.
The "no vendor lock-in" pitch overstates the case. Lock-in is real; it's just to different vendors.
Recommendations
- Default to cloud APIs. Only deviate when you have specific, documented reasons.
- Before going full on-prem, evaluate managed private offerings. They cover most "we need privacy" cases.
- If you do go on-prem, plan for ops investment from day one. Hardware is the cheap part.
- Build your eval harness before you buy GPUs. Know what "good enough" looks like.
- Have a graceful exit strategy. If the project doesn't deliver, what's the off-ramp?
Conclusion
On-premise AI is the right answer for a specific set of organisations with real regulatory pressure and real query volume. It is the wrong answer for most others. The mistake is treating it as a fashionable choice instead of a constrained one.