This is the playbook for deploying a production-grade LLM on your own infrastructure. It assumes you've already decided that on-prem is the right call — see our buyer's guide if you haven't.
Hardware
For a serious production deployment:
- GPUs: NVIDIA H100s are the current default. 8 GPUs per node, NVLinked, with high-bandwidth interconnect between nodes if you're scaling beyond one box.
- CPU and RAM: Don't skimp. You need bandwidth to feed the GPUs. We use Intel Xeon Platinum or AMD EPYC, with ~1TB of RAM per node.
- Storage: NVMe SSDs for model weights and KV cache. 4-8TB per node.
- Network: 100 GbE between nodes. InfiniBand if you're doing distributed inference at serious scale.
- Power and cooling: A loaded H100 draws ~700W. Plan accordingly.
A reasonable production cluster runs $300K-1M depending on size and redundancy.
Software stack
- Model serving: vLLM is our default. High throughput, good batching, supports most modern open models. Alternatives: TGI, TensorRT-LLM, SGLang.
- Orchestration: Kubernetes if you're at scale. Docker Compose if you're not. Don't over-engineer.
- Load balancer: Envoy or HAProxy in front of vLLM workers. Sticky sessions for long-running streams.
- Observability: Prometheus + Grafana for metrics. ELK or Loki for logs. OpenTelemetry for tracing.
Model choice
The current open model landscape:
- Llama 3.x family. Strong general performance, good ecosystem.
- Qwen family. Excellent for many tasks, particularly multilingual.
- Mistral family. Solid mid-size options.
- DeepSeek family. Strong reasoning capability.
Evaluate on your task, not on public benchmarks. Public benchmarks lie.
Fine-tuning vs RAG
Most on-prem deployments are RAG-first, fine-tuning-second. RAG gets you 80% of the way there with much less effort. See our RAG vs fine-tuning decision tree.
The RAG pipeline
- Document ingestion: chunkers, parsers (especially PDF), metadata extraction.
- Embedding: an open embedding model (BGE, GTE, E5) — also self-hosted.
- Vector store: Qdrant or Weaviate are our defaults. Postgres + pgvector for smaller corpora.
- Retrieval: dense + sparse hybrid (vector + BM25). Reranking step on top.
- Generation: pass retrieved chunks as context to the LLM.
Security
- The cluster lives on your private network. No public ingress for the LLM itself.
- API access goes through your existing identity provider with audit logging.
- Every prompt and response is logged with user attribution.
- Sensitive data is filtered at ingestion — PII, secrets, anything you don't want in retrieved context.
- Prompt injection mitigations: input sanitisation, output validation, careful tool-use boundaries.
Operations
- Monitoring: Per-request latency (p50, p95, p99), tokens per second, GPU utilisation, queue depth.
- Alerting: GPU thermal events, OOM errors, error rate spikes.
- Capacity planning: Run synthetic load tests monthly. Know your headroom.
- Model updates: Have a versioning strategy. Run A/B comparisons before promoting a new model to production.
Cost
Honest accounting for an 8-GPU H100 cluster, first year:
- Hardware: $400-600K
- Power and cooling: $30-60K/year
- Networking and storage: $50-100K
- Ops staff: 2 engineers, $400-600K/year
- Software and licenses: minimal if you stick to open-source
Total year-one cost: ~$1M. Year two onwards: ~$500K/year. Compare honestly to cloud API equivalents.
When this is worth it
Three conditions in combination:
- Real regulatory or sovereignty constraints that forbid cloud.
- High query volume (millions of tokens per day).
- The ops team to actually run this in production.
If you don't have all three, reconsider. Managed private deployments cover most "we need privacy" use cases without the operational burden.