Air-Gapped LLM Inference: Running Ollama on Bare Metal

The standard answer to “how do I run LLMs” is to call OpenAI. Drop in an API key, pay per token, ship it. That works fine if you’re building a product demo or running a side project. It doesn’t work if you care about data sovereignty, operating costs at scale, or building infrastructure that doesn’t have a single dependency on a third-party service you can’t control.

SOVEREIGN runs Ollama on bare metal. Here’s why, and how.

The Case Against Cloud APIs for Infrastructure

My use case isn’t “summarize this document.” It’s closer to: continuous inference over internal systems data, security questionnaire automation, multi-agent workflows running throughout the day, and AI-assisted code and infrastructure tooling that touches internal repositories.

Three problems with cloud APIs for that workload:

Cost at volume. API pricing is designed for low-volume experimentation. At the inference volume SOVEREIGN generates — multiple agents running continuously, RAG queries against internal knowledge bases, automated daily workflows — the monthly bill would be meaningful. The hardware is a fixed cost. Once amortized, inference is free.

Data sovereignty. Some of what runs through these pipelines is sensitive: internal architecture decisions, client-adjacent context, security tooling outputs. Sending that to a third-party API — regardless of their data handling policies — introduces a dependency and an exposure surface that I don’t want. Nothing leaves the network.

Latency and reliability. A multi-agent system that makes dozens of LLM calls per workflow is sensitive to network latency and API availability. Running inference locally means sub-100ms round trips and zero dependency on external uptime. The agents don’t care whether OpenAI is having an incident.

How SOVEREIGN Runs It

SOVEREIGN’s infrastructure is managed by a HashiStack control plane: Packer builds machine images, Terraform provisions nodes, Vault manages credentials, Consul handles service discovery, Nomad orchestrates workloads.

Ollama nodes are provisioned from a dedicated Packer image — ollama-node — built on top of the hashiclient base image (AlmaLinux 9, hardened, Consul and Nomad agents pre-installed). The Packer pipeline looks like this:

base → hashicore → hashiclient → ollama-node

Each stage inherits from the previous. The ollama-node layer installs Ollama, configures it as a systemd service, sets GPU memory allocation parameters, and registers the service definition with Consul. The resulting image is the same every time — no configuration drift, no manual SSH.

A Terraform module provisions the node as an antlet on the Antsle hypervisor via the custom provider I wrote. Nomad schedules model-specific jobs that pull the right model weights on startup. The Consul service catalog makes the Ollama API available to any service on the mesh via ollama.service.consul — no hardcoded IPs.

The entire pipeline from terraform apply to a reachable Ollama endpoint is about six minutes, including model weight download.

Model Selection and Resource Allocation

Not every inference task needs the same model. SOVEREIGN runs a tiered model fleet:

Llama 3.1 8B — fast, low resource, good for classification, routing decisions, structured extraction
Llama 3.1 70B (quantized to Q4_K_M) — primary reasoning model for complex tasks
Mistral 7B — alternate small model, faster for streaming use cases
Nomic Embed Text — embedding model for RAG pipelines (ChromaDB)

Model selection is done at the job level in Nomad — each agent workflow specifies which model it needs. The Ollama API supports multiple models loaded simultaneously, bounded by available VRAM.

The Q4_K_M quantization on 70B is worth understanding. A full-precision 70B model requires ~140GB of VRAM — out of reach for non-datacenter hardware. Q4_K_M reduces this to around 40GB with acceptable quality degradation for most tasks. On SOVEREIGN’s hardware, the 70B model fits in system RAM with model offloading, which is slower than pure VRAM inference but still usable for non-latency-sensitive workflows.

What It Actually Costs

The hardware cost is the honest comparison point. A setup capable of running meaningful local inference — enough RAM for quantized 70B, CPU that won’t bottleneck the context window, NVMe fast enough for model loading — runs somewhere between $2,000 and $5,000 depending on configuration. Call it $3,500 amortized over three years: roughly $100/month.

OpenAI’s GPT-4o at current pricing runs $5 per million input tokens and $15 per million output tokens. A day of moderate multi-agent usage — let’s say 500K input tokens and 200K output tokens — costs $5.50/day, or about $165/month. That’s already past the hardware amortization cost, and that’s a conservative usage estimate for a system running continuous workflows.

The crossover point depends on your actual inference volume. For light hobbyist use, cloud APIs are cheaper. For anything running continuously — background agents, daily pipelines, RAG systems fielding regular queries — local inference pays for itself.

SPECTER: The Practical Proof

SOVEREIGN’s inference infrastructure is what makes SPECTER possible. SPECTER is a RAG-based system that automates enterprise security questionnaires — the multi-hundred-question documents that vendors send during security evaluations.

The workflow: SPECTER embeds a knowledge base of past security responses using Nomic Embed Text into ChromaDB, then uses the 70B model to generate new answers grounded in that context. It processes questionnaires automatically, confidence-tiers every response, and routes low-confidence answers to human review.

Without local inference, running SPECTER against even one questionnaire would send sensitive client security posture information to an external API. That’s a non-starter in a compliance-oriented environment.

With SOVEREIGN’s inference stack, the entire pipeline runs on-premises. The data never leaves. The inference cost is zero at the margin.

The Operational Reality

Running local inference isn’t zero-overhead. There’s real work involved in:

Keeping Ollama updated and managing model weights (they’re large, they accumulate)
Tuning context window size and temperature for different use cases
Monitoring inference node resource utilization
Managing Packer image rebuilds when the base OS receives updates

None of this is hard, but it’s not free. If your goal is to run a single chatbot and you don’t care about data sovereignty, pay for the API. The overhead isn’t worth it at small scale.

If your goal is to build durable AI infrastructure that you control, that runs continuously, and that handles sensitive data — the investment in the infrastructure pays off quickly and compounds over time.

I build and run this infrastructure day to day. If that’s the kind of depth your team needs — let’s talk.

SOVEREIGN’s Ollama infrastructure powers SPECTER, FULCRUM, and several internal automation workflows. The Terraform provider that provisions the inference nodes is available on request.