Skip to main content

Infrastructure as Code Best Practices

November 10, 2025

Essential principles and practices for managing infrastructure as code across any platform

Infrastructure as Code (IaC) treats infrastructure configuration as software, enabling version control, testing, and automation of infrastructure deployments.

Core Principles

1. Declarative Over Imperative

Declarative (preferred): Define desired state, tool handles how to achieve it

# Terraform - Declarative
resource "aws_instance" "web" {
  ami           = "ami-12345678"
  instance_type = "t3.micro"
  count         = 3
}

Imperative: Specify exact steps to achieve state

# Imperative approach
for i in range(3):
    create_instance(ami="ami-12345678", type="t3.micro")

Why Declarative Wins

  • Idempotent by design
  • Self-documenting current state
  • Easier to reason about
  • Better for drift detection

2. Version Everything

Store all infrastructure code in version control:

  • Configuration files
  • Scripts and automation
  • Documentation
  • Policies and compliance rules

Never store:

  • Secrets or credentials
  • State files (use remote backends)
  • Generated artifacts
  • Binary files (unless necessary)

3. Immutable Infrastructure

Replace rather than update infrastructure.

Mutable (anti-pattern):

# SSH into server and update
ssh server01
apt-get update && apt-get upgrade
systemctl restart nginx

Immutable (preferred):

# Build new AMI with updates
# Deploy new instances
# Terminate old instances
resource "aws_instance" "web" {
  ami           = data.aws_ami.latest.id  # New AMI
  instance_type = "t3.micro"

  lifecycle {
    create_before_destroy = true
  }
}

Benefits:

  • Predictable deployments
  • Easy rollbacks
  • Reduced configuration drift
  • Simplified testing

4. Idempotency

Running code multiple times produces the same result.

Idempotent

resource "aws_s3_bucket" "data" {
  bucket = "my-data-bucket"
  # Running again doesn't create duplicate
}

Not Idempotent

# Creates new bucket each time
aws s3 mb s3://my-data-bucket-$(date +%s)

5. Single Source of Truth

Infrastructure state should have one authoritative source.

Good: Remote state backend

terraform {
  backend "s3" {
    bucket = "terraform-state"
    key    = "prod/infrastructure.tfstate"
    region = "us-east-1"
  }
}

Bad: Multiple local state files, manual tracking

Organization Best Practices

1. Directory Structure

Small Projects

infra/
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.tfvars
└── README.md

Medium Projects

infra/
├── environments/
│   ├── dev/
│   ├── staging/
│   └── prod/
├── modules/
│   ├── networking/
│   ├── compute/
│   └── database/
└── global/
    └── iam/

Large Projects

infra/
├── live/
│   ├── prod/
│   │   ├── us-east-1/
│   │   │   ├── vpc/
│   │   │   ├── eks/
│   │   │   └── rds/
│   │   └── eu-west-1/
│   └── dev/
├── modules/
└── policies/

2. Naming Conventions

Resources

{environment}-{application}-{resource-type}-{descriptor}

Examples:
- prod-api-ec2-web
- staging-db-rds-primary
- dev-cache-elasticache-redis

Variables

# Use descriptive names
variable "database_instance_type" {}   # Good
variable "db_type" {}                  # Too vague

# Use plurals for lists
variable "availability_zones" {}        # Good
variable "az" {}                       # Unclear

Modules

modules/
├── vpc-standard/
├── eks-cluster/
└── rds-postgres/

3. Environment Management

Option 1: Separate Directories

environments/
├── dev/
│   └── main.tf
├── staging/
│   └── main.tf
└── prod/
    └── main.tf

Option 2: Workspaces

terraform workspace new dev
terraform workspace new staging
terraform workspace new prod

Option 3: Separate Repositories

infra-dev/
infra-staging/
infra-prod/

Recommendation: Separate directories with shared modules for most use cases.

Module Design

1. Module Composition

Good Module

modules/application-stack/
├── main.tf
├── variables.tf
├── outputs.tf
├── versions.tf
├── README.md
└── examples/
    └── basic/

Key Principles:

  • Single responsibility
  • Reusable across environments
  • Well-documented inputs/outputs
  • Versioned releases
  • Include examples

2. Input Validation

variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "instance_count" {
  description = "Number of instances"
  type        = number

  validation {
    condition     = var.instance_count > 0 && var.instance_count <= 10
    error_message = "Instance count must be between 1 and 10."
  }
}

3. Output Documentation

output "vpc_id" {
  description = "ID of the VPC created for this environment"
  value       = aws_vpc.main.id
}

output "database_endpoint" {
  description = "RDS instance endpoint for application connection"
  value       = aws_db_instance.main.endpoint
  sensitive   = false
}

output "database_password" {
  description = "Master password for database (sensitive)"
  value       = aws_db_instance.main.password
  sensitive   = true
}

Security Best Practices

1. Secret Management

Never commit secrets

# BAD - Hardcoded password
resource "aws_db_instance" "main" {
  password = "SuperSecret123!"  # NEVER DO THIS
}

# GOOD - Reference from secrets manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Use Environment Variables

export TF_VAR_database_password="secret"
terraform apply

Sensitive Variable Marking

variable "api_key" {
  description = "API key for external service"
  type        = string
  sensitive   = true
}

2. Least Privilege

# Minimal IAM policy for Terraform
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "ec2:Describe*",
      "ec2:CreateTags",
      "ec2:RunInstances",
      "ec2:TerminateInstances"
    ],
    "Resource": "*",
    "Condition": {
      "StringEquals": {
        "aws:RequestedRegion": "us-east-1"
      }
    }
  }]
}

3. State File Security

# Encrypted state backend
terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                    # Encrypt at rest
    kms_key_id     = "arn:aws:kms:..."      # Customer managed key
    dynamodb_table = "terraform-locks"       # State locking
  }
}

Testing Strategies

1. Static Analysis

Terraform Validate

terraform validate

Linting

tflint --config=.tflint.hcl

Security Scanning

# tfsec
tfsec .

# Checkov
checkov -d .

# terrascan
terrascan scan

2. Plan Review

# Generate plan
terraform plan -out=tfplan

# Review plan
terraform show tfplan

# Apply only after review
terraform apply tfplan

3. Automated Testing

Unit Tests (Terratest)

func TestVPCCreation(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "cidr_block": "10.0.0.0/16",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcID)
}

Integration Tests

# Using pytest and boto3
def test_ec2_instance_running():
    ec2 = boto3.client('ec2')
    instances = ec2.describe_instances(
        Filters=[{'Name': 'tag:Environment', 'Values': ['test']}]
    )
    assert len(instances['Reservations']) > 0
    assert instances['Reservations'][0]['Instances'][0]['State']['Name'] == 'running'

4. Compliance Testing

Open Policy Agent (OPA)

# policy.rego
package terraform.analysis

deny[msg] {
    resource := input.resource_changes[_]
    resource.type == "aws_s3_bucket"
    not resource.change.after.server_side_encryption_configuration
    msg := sprintf("S3 bucket '%s' must have encryption enabled", [resource.address])
}

Sentinel (Terraform Cloud)

import "tfplan/v2" as tfplan

main = rule {
    all tfplan.resource_changes as _, rc {
        rc.type == "aws_instance" implies
        rc.change.after.instance_type in ["t3.micro", "t3.small"]
    }
}

CI/CD Integration

1. Pipeline Stages

# GitLab CI example
stages:
  - validate
  - plan
  - apply

validate:
  stage: validate
  script:
    - terraform init -backend=false
    - terraform validate
    - terraform fmt -check
    - tfsec .

plan:
  stage: plan
  script:
    - terraform init
    - terraform plan -out=tfplan
  artifacts:
    paths:
      - tfplan

apply:
  stage: apply
  script:
    - terraform apply tfplan
  when: manual
  only:
    - main

2. Approval Gates

# GitHub Actions with manual approval
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Terraform Plan
        run: terraform plan -out=tfplan

  approve:
    needs: plan
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval
    steps:
      - run: echo "Approved"

  apply:
    needs: approve
    runs-on: ubuntu-latest
    steps:
      - name: Terraform Apply
        run: terraform apply tfplan

3. Automated Rollback

# Store previous successful state
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json

# If deployment fails, restore
terraform state push state-backup-<timestamp>.json

Documentation Standards

1. Module Documentation

# VPC Module

Creates a VPC with public and private subnets across multiple AZs.

## Usage

```hcl
module "vpc" {
  source = "./modules/vpc"

  name               = "production"
  cidr_block         = "10.0.0.0/16"
  availability_zones = ["us-east-1a", "us-east-1b"]
}

Inputs

NameDescriptionTypeDefaultRequired
nameVPC namestring-yes
cidr_blockVPC CIDRstring-yes

Outputs

NameDescription
vpc_idVPC identifier

### 2. Architecture Diagrams

Include in README.md

Architecture


### 3. Change Documentation

```markdown
# CHANGELOG.md

## [2.0.0] - 2025-01-15
### Breaking Changes
- Removed deprecated `subnet_type` variable
- Renamed `enable_nat` to `enable_nat_gateway`

### Added
- Support for IPv6
- VPC flow logs

### Fixed
- Issue with subnet CIDR calculation

State Management

1. State Locking

# DynamoDB for locking
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

2. State Manipulation

# List resources
terraform state list

# Show resource details
terraform state show aws_instance.web

# Move resource
terraform state mv aws_instance.old aws_instance.new

# Remove from state (doesn't delete resource)
terraform state rm aws_instance.temp

# Import existing resource
terraform import aws_instance.web i-1234567890abcdef0

3. State Backup

# Pull current state
terraform state pull > terraform.tfstate.backup

# Push state (use with caution!)
terraform state push terraform.tfstate.backup

Drift Detection

1. Regular Checks

# Detect drift
terraform plan -detailed-exitcode

# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected

2. Automated Drift Detection

# GitHub Actions
name: Drift Detection
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -detailed-exitcode
        continue-on-error: true
      - name: Notify on Drift
        if: failure()
        run: |
          # Send alert via Slack, email, etc.

3. Drift Reconciliation

# Option 1: Update infrastructure to match code
terraform apply

# Option 2: Update code to match infrastructure
terraform refresh
terraform plan
# Review changes and update code

Performance Optimization

1. Reduce Plan Time

# Use targeted applies when appropriate
terraform apply -target=module.vpc

# Increase parallelism
terraform apply -parallelism=20

2. Module Efficiency

# Cache data sources
locals {
  availability_zones = data.aws_availability_zones.available.names
}

# Use count/for_each efficiently
resource "aws_subnet" "private" {
  for_each = toset(local.availability_zones)

  vpc_id            = aws_vpc.main.id
  availability_zone = each.key
}

3. State File Optimization

# Separate large deployments into smaller state files
# Instead of one large state:
terraform/
├── networking/     # Own state
├── compute/        # Own state
└── databases/      # Own state

Common Anti-Patterns

1. Manual Changes

Problem: Making changes outside IaC Solution: Always use infrastructure code

2. Shared State Without Locking

Problem: Concurrent modifications corrupt state Solution: Always use state locking

3. Hardcoded Values

Problem: Not reusable, error-prone Solution: Use variables and locals

4. No Module Versioning

Problem: Breaking changes affect all users Solution: Version modules, pin versions

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"  # Pin major version
}

5. Monolithic Configurations

Problem: Difficult to manage, slow deployments Solution: Break into logical modules

6. Inadequate Testing

Problem: Production issues Solution: Test in lower environments first

7. No Disaster Recovery Plan

Problem: Can’t recover from state corruption Solution: Regular state backups, documented recovery

Tool-Specific Best Practices

Terraform

  • Use workspaces for environments sparingly
  • Leverage modules for reusability
  • Always use remote state
  • Pin provider versions

Pulumi

  • Use stack references for cross-stack dependencies
  • Leverage secrets management
  • Use component resources
  • Implement unit tests

CloudFormation

  • Use nested stacks for large deployments
  • Implement change sets for review
  • Use stack policies for protection
  • Leverage StackSets for multi-account

Ansible

  • Keep playbooks idempotent
  • Use roles for organization
  • Implement molecule testing
  • Use vaults for secrets

Monitoring and Observability

1. Track Deployments

# Tag resources with deployment metadata
resource "aws_instance" "web" {
  tags = {
    DeployedBy    = "Terraform"
    DeployedAt    = timestamp()
    GitCommit     = var.git_commit
    GitBranch     = var.git_branch
    Environment   = var.environment
  }
}

2. Audit Logging

# CloudTrail for AWS API calls
resource "aws_cloudtrail" "main" {
  name                          = "terraform-api-audit"
  s3_bucket_name                = aws_s3_bucket.logs.id
  include_global_service_events = true
  is_multi_region_trail         = true
}

3. Metrics Collection

# Track IaC metrics
metrics = {
    "deployment_duration": time.time() - start_time,
    "resources_created": len(created_resources),
    "resources_modified": len(modified_resources),
    "resources_deleted": len(deleted_resources)
}

Conclusion

Effective Infrastructure as Code requires:

  1. Treating infrastructure as software
  2. Version control everything
  3. Automate testing and deployment
  4. Implement security best practices
  5. Document thoroughly
  6. Monitor and audit changes
  7. Plan for failure and recovery

The goal is reliable, repeatable, auditable infrastructure deployments that enable teams to move fast without breaking things.


Looking for infrastructure automation expertise? I use Terraform and IaC principles to manage cloud infrastructure—this entire site is deployed with infrastructure as code. See what I can automate for you.