Terraform at Scale: Lessons from 50+ IaC Deployments
Terraform is powerful and Terraform is dangerous. After watching teams get burned by the same structural mistakes across dozens of engagements, we've built a set of principles we enforce on every IaC project we touch. Here's what we've learned.
Core Principles
Remote state is non-negotiable
S3 for state storage with DynamoDB for locking. Never commit state to git. Lock the state file against concurrent applies.
Environments are separate state files
Dev, staging, prod each have their own backend. One bad apply in dev cannot touch prod.
Plan output is the artifact
Capture the plan, review it, then apply it. Never apply without a prior plan review in git.
Blast radius management
Decompose state by blast radius. Database infrastructure is separate from application infrastructure is separate from networking.
State File Problems
State Drift from Manual Changes
Someone SSHes into an EC2 instance and modifies a security group rule manually. Your state file doesn't know about it. Now your Terraform plan shows a change that would revert that manual change. People get scared and don't apply.
The solution: regular state audits. Run terraform plan once a week in an automated job and alert if there's drift. More importantly, build a culture where people understand that manual changes break Terraform, and Terraform applies revert manual changes. If you need to make something manually, update the Terraform code afterward.
The Import Trap
You have existing infrastructure not managed by Terraform. You try to import it. Import works. Everyone's happy. Three months later, someone makes a manual change to the imported resource. Your state file still reflects the old configuration. Drift happens.
Import is useful for small migrations, but for large-scale adoption, build the new infrastructure with Terraform, run it alongside the old infrastructure, and do a clean cutover. Don't try to retrofit Terraform management onto existing resources at scale.
The Lifecycle Block
Some resources are critical and should never be deleted. Use lifecycle blocks, but use them sparingly and with clear intent:
resource "aws_db_instance" "production" {
allocated_storage = 100
engine = "mysql"
instance_class = "db.t3.micro"
lifecycle {
prevent_destroy = true
}
}This resource cannot be destroyed, even accidentally. Good for production databases. Dangerous if misused — a team member trying to clean up old infrastructure will hit the error and get frustrated.
Module Architecture at Scale
When to Extract a Module
The rule of three: when you've written the same pattern three times, extract it to a module. One time is a pattern you don't understand yet. Two times is maybe a pattern. Three times, extract it.
Stable Opinionated Interfaces
A module should have a clear, stable interface. Inputs should be self-explanatory. Outputs should be documented. Don't create modules with 50 optional variables trying to cover every possible use case. Better to have multiple focused modules than one sprawling module trying to do everything.
Module Versioning in Registry
Store modules in a Terraform registry (private or public). Version them semantically. In your configurations, always specify a version:
module "networking" {
source = "registry.terraform.io/laniakea/vpc/aws"
version = "2.4.1"
cidr_block = "10.0.0.0/16"
}When a module updates from 2.4.1 to 2.4.2, you choose when to upgrade. You don't accidentally get breaking changes.
The CI/CD Pipeline That Actually Works
Four stages, in order:
- Validate/Lint: terraform fmt and terraform validate run on every commit. Catch basic syntax errors immediately.
- Plan on PR: When a PR opens, run terraform plan and post the output. Everyone reviewing the PR can see exactly what will change.
- Apply on merge: When the PR merges to main, run terraform apply with the saved plan file. No surprises, no manual applies.
- Post-apply validation: Run smoke tests after apply completes. Ensure resources are reachable, load balancers are healthy, databases are responding.
The Conversation Nobody Has
Who has permission to apply in production?
Should anyone with terraform apply credentials be able to deploy? Or only certain users? Our recommendation: restricted to a CI/CD pipeline. Nobody applies manually in prod. All applies come from merged PRs. This creates an audit trail and enforces code review.
Automated vs Human Approval
Should a plan change be auto-approved or require human sign-off? For dev/staging, auto-approve. For prod, require an approval from someone other than the author. This catches mistakes early while still moving fast.
Common Mistakes
Bulk Import Without Testing
Never run bulk import and then immediately apply. Import the resources, run terraform plan, review the output carefully, then apply. The plan step will catch mismatches between what was imported and what Terraform expects.
Never run apply immediately after bulk import. Run plan, review it, and only apply once you're certain the results are correct.
Monolithic State Files
Everything-in-one-state-file sounds simple until you need to make a change to a small part of your infrastructure and accidentally risk the whole thing. Break infrastructure into logical blast radius zones. Networking, compute, databases, application deployments — separate state files.
Unclear Variable Validation
A variable accepts a string when it should accept only specific values. Use variable validation:
variable "environment" {
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}Catch mistakes at the input layer, not the apply layer.
Terraform is a 10-year investment. You're not just building infrastructure code, you're building a system for managing infrastructure over a decade. Make decisions that will still make sense when you have 500 modules and 100 engineers contributing code.
Building out your IaC practice?
We've designed and built Terraform systems for dozens of organizations. Let's talk about what a production-grade IaC architecture looks like for your infrastructure.
Get Your Free Cloud Audit
We'll assess your infrastructure, identify the biggest opportunities, and share our findings — no strings attached.
Request Your Free Audit