This is the playbook for deploying a production-grade LLM on your own infrastructure. It assumes you've already decided that on-prem is the right call — see our buyer's guide if you haven't.

Hardware

For a serious production deployment:

  • GPUs: NVIDIA H100s are the current default. 8 GPUs per node, NVLinked, with high-bandwidth interconnect between nodes if you're scaling beyond one box.
  • CPU and RAM: Don't skimp. You need bandwidth to feed the GPUs. We use Intel Xeon Platinum or AMD EPYC, with ~1TB of RAM per node.
  • Storage: NVMe SSDs for model weights and KV cache. 4-8TB per node.
  • Network: 100 GbE between nodes. InfiniBand if you're doing distributed inference at serious scale.
  • Power and cooling: A loaded H100 draws ~700W. Plan accordingly.

A reasonable production cluster runs $300K-1M depending on size and redundancy.

Software stack

  • Model serving: vLLM is our default. High throughput, good batching, supports most modern open models. Alternatives: TGI, TensorRT-LLM, SGLang.
  • Orchestration: Kubernetes if you're at scale. Docker Compose if you're not. Don't over-engineer.
  • Load balancer: Envoy or HAProxy in front of vLLM workers. Sticky sessions for long-running streams.
  • Observability: Prometheus + Grafana for metrics. ELK or Loki for logs. OpenTelemetry for tracing.

Model choice

The current open model landscape:

  • Llama 3.x family. Strong general performance, good ecosystem.
  • Qwen family. Excellent for many tasks, particularly multilingual.
  • Mistral family. Solid mid-size options.
  • DeepSeek family. Strong reasoning capability.

Evaluate on your task, not on public benchmarks. Public benchmarks lie.

Fine-tuning vs RAG

Most on-prem deployments are RAG-first, fine-tuning-second. RAG gets you 80% of the way there with much less effort. See our RAG vs fine-tuning decision tree.

The RAG pipeline

  1. Document ingestion: chunkers, parsers (especially PDF), metadata extraction.
  2. Embedding: an open embedding model (BGE, GTE, E5) — also self-hosted.
  3. Vector store: Qdrant or Weaviate are our defaults. Postgres + pgvector for smaller corpora.
  4. Retrieval: dense + sparse hybrid (vector + BM25). Reranking step on top.
  5. Generation: pass retrieved chunks as context to the LLM.

Security

  • The cluster lives on your private network. No public ingress for the LLM itself.
  • API access goes through your existing identity provider with audit logging.
  • Every prompt and response is logged with user attribution.
  • Sensitive data is filtered at ingestion — PII, secrets, anything you don't want in retrieved context.
  • Prompt injection mitigations: input sanitisation, output validation, careful tool-use boundaries.

Operations

  • Monitoring: Per-request latency (p50, p95, p99), tokens per second, GPU utilisation, queue depth.
  • Alerting: GPU thermal events, OOM errors, error rate spikes.
  • Capacity planning: Run synthetic load tests monthly. Know your headroom.
  • Model updates: Have a versioning strategy. Run A/B comparisons before promoting a new model to production.

Cost

Honest accounting for an 8-GPU H100 cluster, first year:

  • Hardware: $400-600K
  • Power and cooling: $30-60K/year
  • Networking and storage: $50-100K
  • Ops staff: 2 engineers, $400-600K/year
  • Software and licenses: minimal if you stick to open-source

Total year-one cost: ~$1M. Year two onwards: ~$500K/year. Compare honestly to cloud API equivalents.

When this is worth it

Three conditions in combination:

  • Real regulatory or sovereignty constraints that forbid cloud.
  • High query volume (millions of tokens per day).
  • The ops team to actually run this in production.

If you don't have all three, reconsider. Managed private deployments cover most "we need privacy" use cases without the operational burden.