Building a Terraform Provider from Scratch

Most people who work with Terraform spend their careers consuming providers — writing HCL, running terraform apply, and trusting that the AWS or Kubernetes provider handles the translation. That’s a perfectly valid way to use the tool. But it keeps you one layer above how Terraform actually works.

Writing a provider from scratch removes that abstraction entirely. You’re no longer a user of the framework — you’re implementing it.

Why the Antsle Provider Existed

SOVEREIGN is my self-hosted infrastructure platform running on Antsle hardware. Antsle makes a hypervisor appliance — physical hardware with a REST API for managing virtual machines and containers. It’s a solid piece of kit for someone who wants real infrastructure at home without cloud vendor dependency.

The problem: no Terraform provider existed. Every VM, every container, every network configuration had to be managed through the web UI or direct API calls. In a world where I’m running Vault, Consul, Nomad, Traefik, and a fleet of Ollama nodes, clicking through a UI is not infrastructure management — it’s theater.

So I wrote the provider.

What “Full CRUD Lifecycle” Actually Means

The Terraform Plugin SDK structures resources around four operations: Create, Read, Update, Destroy. Sounds simple. In practice, each one surfaces a different category of problem.

Create is where you learn about API idempotency. Most APIs aren’t naturally idempotent — if you call create twice, you get two resources. Terraform needs to know about the first one before you can ever call update or destroy. This means your Create function must immediately call Read and store the resulting ID in state. If it doesn’t, you’ve leaked a resource that Terraform can never manage again.

Read is the most important function and the one most tutorials underweight. Terraform’s refresh cycle calls Read constantly — on plan, on apply, on import. Read must be authoritative: whatever the API says is the source of truth, even if it contradicts local state. Getting this wrong means terraform plan always shows a diff even when nothing has changed, or worse, shows no diff when state has drifted.

Update taught me the difference between a mutable and immutable attribute. Some things on a VM can be changed in place — memory allocation, CPU count, description. Others require destroy-and-recreate — the base OS image, the storage backend. Setting ForceNew: true on the right schema attributes is what tells Terraform to plan a replacement rather than an in-place update. Getting this wrong either causes silent failures or unnecessary downtime.

Destroy is straightforward until the resource doesn’t exist. Proper Destroy implementations check for 404 and return cleanly — resources can be deleted out-of-band, and Terraform needs to handle that gracefully instead of erroring on an operation that already succeeded.

Nine Resource Types

The provider ships with nine writable resource types covering the main Antsle resource model: antlets (their term for VMs/containers), templates, networks, storage, and user management. Each required mapping the Antsle API’s data model into Terraform schema — deciding which fields are required vs optional, which are computed (set by the API, not the user), and which changes require replacement.

The fifteen read-only data sources exist so you can reference existing infrastructure in new resources without importing it. A new antlet can reference an existing network by name rather than by ID — which means the HCL reads like infrastructure intent rather than a list of opaque identifiers.

The Schema is Your Contract

Everything downstream depends on getting the schema right. The schema defines what users can write in HCL, what Terraform stores in state, and what the provider will try to reconcile on every apply. A schema mistake isn’t just a bug — it’s a breaking change that requires state migration or forces users to destroy and recreate resources.

The most important schema decisions I made:

Sensitive fields: API keys and passwords marked Sensitive: true so they never appear in plan output or logs
Computed + Optional: Fields like ip_address where the user can specify one or let the API assign — requires Computed: true alongside Optional: true so Terraform doesn’t plan a diff when the API fills in what the user left blank
ValidateFunc: Input validation before any API call — fail fast at plan time with a clear error rather than a cryptic API response at apply time

What It Replaced

Before the provider, standing up a new node in SOVEREIGN meant: log into the Antsle web UI, click through the antlet creation wizard, wait for provisioning, SSH in, run the Packer-built image bootstrap, manually register with Consul, and hope I’d remembered every configuration detail from the last time.

After: terraform apply with a module call. The same Packer-hardened AlmaLinux 9 image, the same Vault-managed credentials, the same Consul registration — fully automated from a three-line HCL block.

The provider is the reason SOVEREIGN can be described as “git push to running production service with zero manual SSH steps.” Without it, that claim would be aspirational at best.

Why This Matters More Than a Cert

There’s a specific class of Terraform knowledge that no certification tests for: what happens inside the provider. The Terraform Associate exam tests whether you can write HCL and understand the plan/apply cycle. It doesn’t test whether you understand why Computed: true is necessary, how the state lock works, or what the plugin protocol actually looks like.

Writing a provider requires that knowledge. You can’t fake your way through implementing the CRUD lifecycle — either the state is consistent or your resources leak. Either the schema is correct or users get plan noise forever.

I wrote this provider because I needed it. The fact that it’s the deepest possible proof of Terraform competency is a side effect, not the goal — but it’s a real one.

The provider powers the SOVEREIGN infrastructure platform. Source available on request.