Terraform at Scale — Laniakea Consult

Terraform is powerful. Terraform is also dangerous. After deploying IaC to 50+ environments, we've watched teams get burned by the same structural mistakes. Here's what we've learned.

50+

IaC deployments managed

99.9%

Plan accuracy after process

4 hrs

Average time to rollback

Core Principles

Remote state is non-negotiable

S3 + DynamoDB for locking. Not on your laptop. Not in Git. Ever.

Environments are separate state files

One dev, one staging, one prod state. Never combine them.

Plan output is the artifact

The plan file is your source of truth for changes. Review it. Always.

Blast radius management via decomposition

Break workloads into small state files. Don't deploy everything at once.

State File Problems

Drift from Manual Changes

Your developers SSH into a server and change a security group rule. Or they click a button in the console. Your state file and reality diverge. This happens faster than you think.

Solution: AWS Config running continuously. A Lambda that reconciles drift back to Terraform state. And a strict policy: no manual changes. Ever. If you need to change something, do it in Terraform and apply it.

The Import Trap

You'll eventually need to import existing infrastructure into Terraform. Everyone does. When you import, always run terraform plan immediately after and review the output carefully.

The mistake: importing a resource, then immediately running apply without reviewing the plan. Terraform will see "this resource wasn't in my state" and try to recreate it. For databases and load balancers, this is a disaster.

Always import. Always plan. Always review. Never apply without that middle step.

Lifecycle Blocks Are Your Friend

Use them aggressively. For databases, use `prevent_destroy`. For resources that change frequently, use `ignore_changes`. Here's the pattern for a database:

resource "aws_db_instance" "prod" {
  allocated_storage    = 100
  identifier           = "prod-db"
  engine              = "postgres"
  instance_class      = "db.r5.xlarge"
  skip_final_snapshot = false

  lifecycle {
    prevent_destroy = true
    ignore_changes = [password]
  }
}

The `prevent_destroy` prevents accidental deletion during a plan gone wrong. The `ignore_changes` on password means Terraform won't try to reset your database password every time you rotate it manually.

Module Architecture

The Rule of Three

If you're copy-pasting Terraform code, you should write a module. By the third copy, you've already lost. Use modules for any pattern you repeat more than twice.

Stable Interfaces

A module's input variables are its contract with the world. Once you publish a module, don't remove variables. Deprecate them, but don't remove. Changing a module's interface will break every consumer of that module.

Versioning

Source your modules with explicit versions. Not latest. Not a branch. A tag.

module "network" {
  source = "git::https://github.com/yourorg/terraform-modules.git//network?ref=v2.4.1"

  vpc_cidr = "10.0.0.0/16"
  environment = "prod"
}

When you need to update, you change the version deliberately. You test it in a non-prod environment first. Then you update production. No surprises.

CI/CD Pipeline

Your Terraform workflow should have four distinct stages:

Validate/Lint: On every commit, run `terraform validate` and a linter like TFLint. Catch syntax errors before code review.
Plan on PR: When a PR is opened, run `terraform plan` in non-prod and post the plan to the PR. Anyone can review the changes without running Terraform locally.
Apply on merge: When the PR merges, apply. No manual approval step in production. The review happened. The approval was the merge.
Post-apply validation: After apply completes, run tests. Check that resources were created with the right configurations. Use Terratest or similar.

Key principle: The plan is the contract. If your plan shows a resource replacement (something with a ~), require human approval before apply. Replacements are dangerous. Everything else goes through automatically once approved in code review.

The Permission Conversation Nobody Has

This one is critical. When you automate Terraform applies, who gets to do what?

Automated applies for low-risk changes: Adding tags, changing non-critical configuration. Let the automation run it.
Require human approval for replacements: If Terraform is destroying and recreating a database, a load balancer, or a volume, require explicit approval. Not an email approval. A real human in a Slack thread saying "yes, do it."
Never allow automated production applies to destroy resources. Ever. If you need to delete something in production, someone needs to run `terraform destroy` manually. With a 10-second confirmation prompt.

State Locking

Without locking, two engineers can apply Terraform at the same time. Both will read the same state file. Both will think they're safe. Both will apply changes. The second one wins, and the first one's changes are lost.

DynamoDB state locking prevents this. When an apply starts, Terraform creates a lock entry in DynamoDB. When the apply finishes, the lock is released. If another apply tries to run while the lock exists, it fails with a clear error.

Configure it once, then never think about it again. It's that important.

Remember: Terraform is a 10-year investment. You're not just deploying infrastructure today — you're committing to maintaining this code for a decade. Build it to last.

The Lessons We've Learned

Never run apply immediately after bulk import. Run plan first and review.
Use prevent_destroy on anything mission-critical.
Break workloads into small state files. A single state file with 500 resources is a nightmare waiting to happen.
Test your disaster recovery plan. If your state file gets corrupted, can you recover? You should have an answer.
Document your module interfaces. Future-you will thank you.
DRY principle applies to Terraform as much as code. Three copies means write a module.

Need a second opinion on your stack?

We'll review your environment and share findings in 5–7 business days. No sales pitch, no obligation.

Get a Free Assessment → More Articles

Terraform at Scale: Lessons from 50+ IaC Deployments