Terraform is powerful. Terraform is also dangerous. After deploying IaC to 50+ environments, we've watched teams get burned by the same structural mistakes. Here's what we've learned.

50+
IaC deployments managed
99.9%
Plan accuracy after process
4 hrs
Average time to rollback

Core Principles

Remote state is non-negotiable
S3 + DynamoDB for locking. Not on your laptop. Not in Git. Ever.
Environments are separate state files
One dev, one staging, one prod state. Never combine them.
Plan output is the artifact
The plan file is your source of truth for changes. Review it. Always.
Blast radius management via decomposition
Break workloads into small state files. Don't deploy everything at once.

State File Problems

Drift from Manual Changes

Your developers SSH into a server and change a security group rule. Or they click a button in the console. Your state file and reality diverge. This happens faster than you think.

Solution: AWS Config running continuously. A Lambda that reconciles drift back to Terraform state. And a strict policy: no manual changes. Ever. If you need to change something, do it in Terraform and apply it.

The Import Trap

You'll eventually need to import existing infrastructure into Terraform. Everyone does. When you import, always run terraform plan immediately after and review the output carefully.

The mistake: importing a resource, then immediately running apply without reviewing the plan. Terraform will see "this resource wasn't in my state" and try to recreate it. For databases and load balancers, this is a disaster.

Always import. Always plan. Always review. Never apply without that middle step.

Lifecycle Blocks Are Your Friend

Use them aggressively. For databases, use `prevent_destroy`. For resources that change frequently, use `ignore_changes`. Here's the pattern for a database:

resource "aws_db_instance" "prod" { allocated_storage = 100 identifier = "prod-db" engine = "postgres" instance_class = "db.r5.xlarge" skip_final_snapshot = false lifecycle { prevent_destroy = true ignore_changes = [password] } }

The `prevent_destroy` prevents accidental deletion during a plan gone wrong. The `ignore_changes` on password means Terraform won't try to reset your database password every time you rotate it manually.

Module Architecture

The Rule of Three

If you're copy-pasting Terraform code, you should write a module. By the third copy, you've already lost. Use modules for any pattern you repeat more than twice.

Stable Interfaces

A module's input variables are its contract with the world. Once you publish a module, don't remove variables. Deprecate them, but don't remove. Changing a module's interface will break every consumer of that module.

Versioning

Source your modules with explicit versions. Not latest. Not a branch. A tag.

module "network" { source = "git::https://github.com/yourorg/terraform-modules.git//network?ref=v2.4.1" vpc_cidr = "10.0.0.0/16" environment = "prod" }

When you need to update, you change the version deliberately. You test it in a non-prod environment first. Then you update production. No surprises.

CI/CD Pipeline

Your Terraform workflow should have four distinct stages:

  1. Validate/Lint: On every commit, run `terraform validate` and a linter like TFLint. Catch syntax errors before code review.
  2. Plan on PR: When a PR is opened, run `terraform plan` in non-prod and post the plan to the PR. Anyone can review the changes without running Terraform locally.
  3. Apply on merge: When the PR merges, apply. No manual approval step in production. The review happened. The approval was the merge.
  4. Post-apply validation: After apply completes, run tests. Check that resources were created with the right configurations. Use Terratest or similar.
Key principle: The plan is the contract. If your plan shows a resource replacement (something with a ~), require human approval before apply. Replacements are dangerous. Everything else goes through automatically once approved in code review.

The Permission Conversation Nobody Has

This one is critical. When you automate Terraform applies, who gets to do what?

State Locking

Without locking, two engineers can apply Terraform at the same time. Both will read the same state file. Both will think they're safe. Both will apply changes. The second one wins, and the first one's changes are lost.

DynamoDB state locking prevents this. When an apply starts, Terraform creates a lock entry in DynamoDB. When the apply finishes, the lock is released. If another apply tries to run while the lock exists, it fails with a clear error.

Configure it once, then never think about it again. It's that important.

Remember: Terraform is a 10-year investment. You're not just deploying infrastructure today — you're committing to maintaining this code for a decade. Build it to last.

The Lessons We've Learned

Building out your IaC practice?

We've designed Terraform architectures for teams of 5 to 500. Whether you're starting from scratch or refactoring an existing setup, we can help.

Request My Free Audit →