Abstract

This paper provides a decision framework for organisations evaluating on-premise AI deployment versus cloud API services and managed private deployments. We cover the three legitimate reasons for on-prem, the hidden costs that don't appear in vendor presentations, and a total-cost-of-ownership model for comparing options honestly.

The three reasons to go on-prem

  1. Regulatory or compliance constraints that forbid sending data to third-party clouds.
  2. Sensitive data that cannot leave your network even under BAA-style protections.
  3. Predictable cost at high volume — above roughly 10M tokens/day, dedicated infrastructure can be cheaper than per-token pricing.

If none of these apply with rigour, on-prem is the wrong answer. This is the single most common mistake we see in evaluation projects.

The hidden costs

Vendor presentations focus on hardware. Real on-premise deployments include:

  • Hardware: $300K-1M+ for a production GPU cluster.
  • Power and cooling: A loaded 8-GPU node draws ~5.6kW. Annual electricity cost depends on tariff; cooling adds 30-50%.
  • Networking and storage: $50-150K of additional infrastructure for a serious deployment.
  • Ops staff: Two engineers minimum to run a production LLM cluster. That's $300-600K/year fully loaded.
  • Model lifecycle: Re-evaluation, re-fine-tuning, and re-deployment as base models update.
  • Integration: Application-layer work to wire the model into your existing systems.

Total year-one cost for a production deployment: $800K-1.5M. Year two onwards: $400-700K.

The middle path: managed private deployments

Many organisations that think they need on-prem actually need logical isolation: a dedicated model instance, no data sharing with the provider, and infrastructure in a specific region. Options include:

  • AWS Bedrock with custom models.
  • Azure OpenAI with private endpoints.
  • Dedicated instances from Cohere, Anthropic, or other providers.
  • Hosted private deployments from Together AI, Modal, Replicate enterprise.

These cover roughly 70% of "we need privacy" use cases without the GPU procurement burden. Exhaust them before considering full on-prem.

The decision framework

Before committing to on-prem, answer these five questions honestly:

  1. Volume. What is your projected daily token throughput? If under 10M tokens/day, cloud APIs are almost certainly cheaper.
  2. Latency. What's your latency budget? On-prem can be faster or slower than cloud depending on architecture.
  3. Data constraints. List the specific contractual, regulatory, and customer commitments that force the decision. Vague concerns are not enough.
  4. Model fit. Have you run an actual evaluation of the open-source models you'd deploy? Do they meet your quality bar?
  5. Ops capability. Who runs this in production? If the answer is uncertain, stop. That's the most expensive part of the project.

TCO comparison example

For a hypothetical workload of 50M tokens/day:

  • Cloud API (GPT-4 class): ~$1M-2M/year depending on context length and caching.
  • Managed private (Bedrock dedicated): $400-800K/year + setup costs.
  • On-prem with open-source model: $1M year-one, $500K/year ongoing.

At this volume, all three are plausible. The decision turns on data constraints and ops maturity, not cost.

Lock-in considerations

On-prem is the lowest-lock-in option in theory. In practice:

  • Hardware vendors have lock-in via service contracts.
  • Inference frameworks (vLLM, TGI, TensorRT) have switching costs.
  • Fine-tuned models are tied to specific base architectures.

The "no vendor lock-in" pitch overstates the case. Lock-in is real; it's just to different vendors.

Recommendations

  • Default to cloud APIs. Only deviate when you have specific, documented reasons.
  • Before going full on-prem, evaluate managed private offerings. They cover most "we need privacy" cases.
  • If you do go on-prem, plan for ops investment from day one. Hardware is the cheap part.
  • Build your eval harness before you buy GPUs. Know what "good enough" looks like.
  • Have a graceful exit strategy. If the project doesn't deliver, what's the off-ramp?

Conclusion

On-premise AI is the right answer for a specific set of organisations with real regulatory pressure and real query volume. It is the wrong answer for most others. The mistake is treating it as a fashionable choice instead of a constrained one.