Infrastructure as Code (IaC) treats infrastructure configuration as software, enabling version control, testing, and automation of infrastructure deployments.
Core Principles
1. Declarative Over Imperative
Declarative (preferred): Define desired state, tool handles how to achieve it
# Terraform - Declarative
resource "aws_instance" "web" {
ami = "ami-12345678"
instance_type = "t3.micro"
count = 3
}
Imperative: Specify exact steps to achieve state
# Imperative approach
for i in range(3):
create_instance(ami="ami-12345678", type="t3.micro")
Why Declarative Wins
- Idempotent by design
- Self-documenting current state
- Easier to reason about
- Better for drift detection
2. Version Everything
Store all infrastructure code in version control:
- Configuration files
- Scripts and automation
- Documentation
- Policies and compliance rules
Never store:
- Secrets or credentials
- State files (use remote backends)
- Generated artifacts
- Binary files (unless necessary)
3. Immutable Infrastructure
Replace rather than update infrastructure.
Mutable (anti-pattern):
# SSH into server and update
ssh server01
apt-get update && apt-get upgrade
systemctl restart nginx
Immutable (preferred):
# Build new AMI with updates
# Deploy new instances
# Terminate old instances
resource "aws_instance" "web" {
ami = data.aws_ami.latest.id # New AMI
instance_type = "t3.micro"
lifecycle {
create_before_destroy = true
}
}
Benefits:
- Predictable deployments
- Easy rollbacks
- Reduced configuration drift
- Simplified testing
4. Idempotency
Running code multiple times produces the same result.
Idempotent
resource "aws_s3_bucket" "data" {
bucket = "my-data-bucket"
# Running again doesn't create duplicate
}
Not Idempotent
# Creates new bucket each time
aws s3 mb s3://my-data-bucket-$(date +%s)
5. Single Source of Truth
Infrastructure state should have one authoritative source.
Good: Remote state backend
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/infrastructure.tfstate"
region = "us-east-1"
}
}
Bad: Multiple local state files, manual tracking
Organization Best Practices
1. Directory Structure
Small Projects
infra/
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.tfvars
└── README.md
Medium Projects
infra/
├── environments/
│ ├── dev/
│ ├── staging/
│ └── prod/
├── modules/
│ ├── networking/
│ ├── compute/
│ └── database/
└── global/
└── iam/
Large Projects
infra/
├── live/
│ ├── prod/
│ │ ├── us-east-1/
│ │ │ ├── vpc/
│ │ │ ├── eks/
│ │ │ └── rds/
│ │ └── eu-west-1/
│ └── dev/
├── modules/
└── policies/
2. Naming Conventions
Resources
{environment}-{application}-{resource-type}-{descriptor}
Examples:
- prod-api-ec2-web
- staging-db-rds-primary
- dev-cache-elasticache-redis
Variables
# Use descriptive names
variable "database_instance_type" {} # Good
variable "db_type" {} # Too vague
# Use plurals for lists
variable "availability_zones" {} # Good
variable "az" {} # Unclear
Modules
modules/
├── vpc-standard/
├── eks-cluster/
└── rds-postgres/
3. Environment Management
Option 1: Separate Directories
environments/
├── dev/
│ └── main.tf
├── staging/
│ └── main.tf
└── prod/
└── main.tf
Option 2: Workspaces
terraform workspace new dev
terraform workspace new staging
terraform workspace new prod
Option 3: Separate Repositories
infra-dev/
infra-staging/
infra-prod/
Recommendation: Separate directories with shared modules for most use cases.
Module Design
1. Module Composition
Good Module
modules/application-stack/
├── main.tf
├── variables.tf
├── outputs.tf
├── versions.tf
├── README.md
└── examples/
└── basic/
Key Principles:
- Single responsibility
- Reusable across environments
- Well-documented inputs/outputs
- Versioned releases
- Include examples
2. Input Validation
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "instance_count" {
description = "Number of instances"
type = number
validation {
condition = var.instance_count > 0 && var.instance_count <= 10
error_message = "Instance count must be between 1 and 10."
}
}
3. Output Documentation
output "vpc_id" {
description = "ID of the VPC created for this environment"
value = aws_vpc.main.id
}
output "database_endpoint" {
description = "RDS instance endpoint for application connection"
value = aws_db_instance.main.endpoint
sensitive = false
}
output "database_password" {
description = "Master password for database (sensitive)"
value = aws_db_instance.main.password
sensitive = true
}
Security Best Practices
1. Secret Management
Never commit secrets
# BAD - Hardcoded password
resource "aws_db_instance" "main" {
password = "SuperSecret123!" # NEVER DO THIS
}
# GOOD - Reference from secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Use Environment Variables
export TF_VAR_database_password="secret"
terraform apply
Sensitive Variable Marking
variable "api_key" {
description = "API key for external service"
type = string
sensitive = true
}
2. Least Privilege
# Minimal IAM policy for Terraform
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"ec2:Describe*",
"ec2:CreateTags",
"ec2:RunInstances",
"ec2:TerminateInstances"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:RequestedRegion": "us-east-1"
}
}
}]
}
3. State File Security
# Encrypted state backend
terraform {
backend "s3" {
bucket = "terraform-state"
key = "prod/terraform.tfstate"
region = "us-east-1"
encrypt = true # Encrypt at rest
kms_key_id = "arn:aws:kms:..." # Customer managed key
dynamodb_table = "terraform-locks" # State locking
}
}
Testing Strategies
1. Static Analysis
Terraform Validate
terraform validate
Linting
tflint --config=.tflint.hcl
Security Scanning
# tfsec
tfsec .
# Checkov
checkov -d .
# terrascan
terrascan scan
2. Plan Review
# Generate plan
terraform plan -out=tfplan
# Review plan
terraform show tfplan
# Apply only after review
terraform apply tfplan
3. Automated Testing
Unit Tests (Terratest)
func TestVPCCreation(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"cidr_block": "10.0.0.0/16",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID)
}
Integration Tests
# Using pytest and boto3
def test_ec2_instance_running():
ec2 = boto3.client('ec2')
instances = ec2.describe_instances(
Filters=[{'Name': 'tag:Environment', 'Values': ['test']}]
)
assert len(instances['Reservations']) > 0
assert instances['Reservations'][0]['Instances'][0]['State']['Name'] == 'running'
4. Compliance Testing
Open Policy Agent (OPA)
# policy.rego
package terraform.analysis
deny[msg] {
resource := input.resource_changes[_]
resource.type == "aws_s3_bucket"
not resource.change.after.server_side_encryption_configuration
msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.address])
}
Sentinel (Terraform Cloud)
import "tfplan/v2" as tfplan
main = rule {
all tfplan.resource_changes as _, rc {
rc.type == "aws_instance" implies
rc.change.after.instance_type in ["t3.micro", "t3.small"]
}
}
CI/CD Integration
1. Pipeline Stages
# GitLab CI example
stages:
- validate
- plan
- apply
validate:
stage: validate
script:
- terraform init -backend=false
- terraform validate
- terraform fmt -check
- tfsec .
plan:
stage: plan
script:
- terraform init
- terraform plan -out=tfplan
artifacts:
paths:
- tfplan
apply:
stage: apply
script:
- terraform apply tfplan
when: manual
only:
- main
2. Approval Gates
# GitHub Actions with manual approval
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Terraform Plan
run: terraform plan -out=tfplan
approve:
needs: plan
runs-on: ubuntu-latest
environment: production # Requires manual approval
steps:
- run: echo "Approved"
apply:
needs: approve
runs-on: ubuntu-latest
steps:
- name: Terraform Apply
run: terraform apply tfplan
3. Automated Rollback
# Store previous successful state
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json
# If deployment fails, restore
terraform state push state-backup-<timestamp>.json
Documentation Standards
1. Module Documentation
# VPC Module
Creates a VPC with public and private subnets across multiple AZs.
## Usage
```hcl
module "vpc" {
source = "./modules/vpc"
name = "production"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-east-1a", "us-east-1b"]
}
Inputs
| Name | Description | Type | Default | Required |
|---|---|---|---|---|
| name | VPC name | string | - | yes |
| cidr_block | VPC CIDR | string | - | yes |
Outputs
| Name | Description |
|---|---|
| vpc_id | VPC identifier |
### 2. Architecture Diagrams
Include in README.md

### 3. Change Documentation
```markdown
# CHANGELOG.md
## [2.0.0] - 2025-01-15
### Breaking Changes
- Removed deprecated `subnet_type` variable
- Renamed `enable_nat` to `enable_nat_gateway`
### Added
- Support for IPv6
- VPC flow logs
### Fixed
- Issue with subnet CIDR calculation
State Management
1. State Locking
# DynamoDB for locking
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
}
2. State Manipulation
# List resources
terraform state list
# Show resource details
terraform state show aws_instance.web
# Move resource
terraform state mv aws_instance.old aws_instance.new
# Remove from state (doesn't delete resource)
terraform state rm aws_instance.temp
# Import existing resource
terraform import aws_instance.web i-1234567890abcdef0
3. State Backup
# Pull current state
terraform state pull > terraform.tfstate.backup
# Push state (use with caution!)
terraform state push terraform.tfstate.backup
Drift Detection
1. Regular Checks
# Detect drift
terraform plan -detailed-exitcode
# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected
2. Automated Drift Detection
# GitHub Actions
name: Drift Detection
on:
schedule:
- cron: '0 */6 * * *' # Every 6 hours
jobs:
detect-drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Terraform Plan
run: |
terraform init
terraform plan -detailed-exitcode
continue-on-error: true
- name: Notify on Drift
if: failure()
run: |
# Send alert via Slack, email, etc.
3. Drift Reconciliation
# Option 1: Update infrastructure to match code
terraform apply
# Option 2: Update code to match infrastructure
terraform refresh
terraform plan
# Review changes and update code
Performance Optimization
1. Reduce Plan Time
# Use targeted applies when appropriate
terraform apply -target=module.vpc
# Increase parallelism
terraform apply -parallelism=20
2. Module Efficiency
# Cache data sources
locals {
availability_zones = data.aws_availability_zones.available.names
}
# Use count/for_each efficiently
resource "aws_subnet" "private" {
for_each = toset(local.availability_zones)
vpc_id = aws_vpc.main.id
availability_zone = each.key
}
3. State File Optimization
# Separate large deployments into smaller state files
# Instead of one large state:
terraform/
├── networking/ # Own state
├── compute/ # Own state
└── databases/ # Own state
Common Anti-Patterns
1. Manual Changes
Problem: Making changes outside IaC Solution: Always use infrastructure code
2. Shared State Without Locking
Problem: Concurrent modifications corrupt state Solution: Always use state locking
3. Hardcoded Values
Problem: Not reusable, error-prone Solution: Use variables and locals
4. No Module Versioning
Problem: Breaking changes affect all users Solution: Version modules, pin versions
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 3.0" # Pin major version
}
5. Monolithic Configurations
Problem: Difficult to manage, slow deployments Solution: Break into logical modules
6. Inadequate Testing
Problem: Production issues Solution: Test in lower environments first
7. No Disaster Recovery Plan
Problem: Can’t recover from state corruption Solution: Regular state backups, documented recovery
Tool-Specific Best Practices
Terraform
- Use workspaces for environments sparingly
- Leverage modules for reusability
- Always use remote state
- Pin provider versions
Pulumi
- Use stack references for cross-stack dependencies
- Leverage secrets management
- Use component resources
- Implement unit tests
CloudFormation
- Use nested stacks for large deployments
- Implement change sets for review
- Use stack policies for protection
- Leverage StackSets for multi-account
Ansible
- Keep playbooks idempotent
- Use roles for organization
- Implement molecule testing
- Use vaults for secrets
Monitoring and Observability
1. Track Deployments
# Tag resources with deployment metadata
resource "aws_instance" "web" {
tags = {
DeployedBy = "Terraform"
DeployedAt = timestamp()
GitCommit = var.git_commit
GitBranch = var.git_branch
Environment = var.environment
}
}
2. Audit Logging
# CloudTrail for AWS API calls
resource "aws_cloudtrail" "main" {
name = "terraform-api-audit"
s3_bucket_name = aws_s3_bucket.logs.id
include_global_service_events = true
is_multi_region_trail = true
}
3. Metrics Collection
# Track IaC metrics
metrics = {
"deployment_duration": time.time() - start_time,
"resources_created": len(created_resources),
"resources_modified": len(modified_resources),
"resources_deleted": len(deleted_resources)
}
Conclusion
Effective Infrastructure as Code requires:
- Treating infrastructure as software
- Version control everything
- Automate testing and deployment
- Implement security best practices
- Document thoroughly
- Monitor and audit changes
- Plan for failure and recovery
The goal is reliable, repeatable, auditable infrastructure deployments that enable teams to move fast without breaking things.
Looking for infrastructure automation expertise? I use Terraform and IaC principles to manage cloud infrastructure—this entire site is deployed with infrastructure as code. See what I can automate for you.