Terraform is powerful. Terraform is also dangerous. After deploying IaC to 50+ environments, we've watched teams get burned by the same structural mistakes. Here's what we've learned.
Core Principles
State File Problems
Drift from Manual Changes
Your developers SSH into a server and change a security group rule. Or they click a button in the console. Your state file and reality diverge. This happens faster than you think.
Solution: AWS Config running continuously. A Lambda that reconciles drift back to Terraform state. And a strict policy: no manual changes. Ever. If you need to change something, do it in Terraform and apply it.
The Import Trap
You'll eventually need to import existing infrastructure into Terraform. Everyone does. When you import, always run terraform plan immediately after and review the output carefully.
The mistake: importing a resource, then immediately running apply without reviewing the plan. Terraform will see "this resource wasn't in my state" and try to recreate it. For databases and load balancers, this is a disaster.
Always import. Always plan. Always review. Never apply without that middle step.
Lifecycle Blocks Are Your Friend
Use them aggressively. For databases, use `prevent_destroy`. For resources that change frequently, use `ignore_changes`. Here's the pattern for a database:
The `prevent_destroy` prevents accidental deletion during a plan gone wrong. The `ignore_changes` on password means Terraform won't try to reset your database password every time you rotate it manually.
Module Architecture
The Rule of Three
If you're copy-pasting Terraform code, you should write a module. By the third copy, you've already lost. Use modules for any pattern you repeat more than twice.
Stable Interfaces
A module's input variables are its contract with the world. Once you publish a module, don't remove variables. Deprecate them, but don't remove. Changing a module's interface will break every consumer of that module.
Versioning
Source your modules with explicit versions. Not latest. Not a branch. A tag.
When you need to update, you change the version deliberately. You test it in a non-prod environment first. Then you update production. No surprises.
CI/CD Pipeline
Your Terraform workflow should have four distinct stages:
- Validate/Lint: On every commit, run `terraform validate` and a linter like TFLint. Catch syntax errors before code review.
- Plan on PR: When a PR is opened, run `terraform plan` in non-prod and post the plan to the PR. Anyone can review the changes without running Terraform locally.
- Apply on merge: When the PR merges, apply. No manual approval step in production. The review happened. The approval was the merge.
- Post-apply validation: After apply completes, run tests. Check that resources were created with the right configurations. Use Terratest or similar.
The Permission Conversation Nobody Has
This one is critical. When you automate Terraform applies, who gets to do what?
- Automated applies for low-risk changes: Adding tags, changing non-critical configuration. Let the automation run it.
- Require human approval for replacements: If Terraform is destroying and recreating a database, a load balancer, or a volume, require explicit approval. Not an email approval. A real human in a Slack thread saying "yes, do it."
- Never allow automated production applies to destroy resources. Ever. If you need to delete something in production, someone needs to run `terraform destroy` manually. With a 10-second confirmation prompt.
State Locking
Without locking, two engineers can apply Terraform at the same time. Both will read the same state file. Both will think they're safe. Both will apply changes. The second one wins, and the first one's changes are lost.
DynamoDB state locking prevents this. When an apply starts, Terraform creates a lock entry in DynamoDB. When the apply finishes, the lock is released. If another apply tries to run while the lock exists, it fails with a clear error.
Configure it once, then never think about it again. It's that important.
The Lessons We've Learned
- Never run apply immediately after bulk import. Run plan first and review.
- Use prevent_destroy on anything mission-critical.
- Break workloads into small state files. A single state file with 500 resources is a nightmare waiting to happen.
- Test your disaster recovery plan. If your state file gets corrupted, can you recover? You should have an answer.
- Document your module interfaces. Future-you will thank you.
- DRY principle applies to Terraform as much as code. Three copies means write a module.
Building out your IaC practice?
We've designed Terraform architectures for teams of 5 to 500. Whether you're starting from scratch or refactoring an existing setup, we can help.
Request My Free Audit →