Platform Engineering 2026: Infrastructure as Code Best Practices

Platform engineering is having a moment. After years of everyone saying “just use DevOps,” organizations are realizing that throwing developers into AWS consoles and expecting them to build production-ready infrastructure is a recipe for disaster. The solution? Build internal platforms that treat infrastructure, policies, configuration - basically everything - as code.

I’ve been deep in this world lately, and what’s fascinating is how the “as code” philosophy has expanded beyond just infrastructure. In 2025, we’re seeing Infrastructure, Policy, Configuration, Application architecture, and even Data pipelines all converging into a single programmable stack. Let me walk you through what this actually means in practice.

What You’ll Learn:

Infrastructure as Code fundamentals with Terraform
Policy as Code with AWS SCPs and Azure Policy
Configuration management with Parameter Store
Application as Code with AWS CDK and Pulumi
Platform engineering patterns and anti-patterns
Multi-cloud abstractions (AWS + Azure)

Quick Navigation: DevOps vs Platform | IaC | Policy as Code | Platform Patterns | Getting Started

The Problem: DevOps Didn’t Scale

Here’s the thing about traditional DevOps: it worked great when you had 10 developers. Everyone could learn AWS, everyone understood the architecture, everyone knew where the production secrets lived. But scale that to 100 developers across 10 teams? Chaos.

The pattern plays out the same way everywhere:

Each team builds their own infrastructure (slightly differently)
Nobody remembers to enable encryption on that one S3 bucket
Production and staging drift apart over time
“It works in dev” becomes a meme because dev actually uses different infrastructure
Security team has a meltdown when they discover someone deployed a public RDS instance

The DevOps answer was “shift left” - make developers responsible for operations. The platform engineering answer is “build golden paths” - give developers self-service tools that make it harder to do the wrong thing than the right thing.

The “As Code” Stack

What started as Infrastructure as Code (Terraform, CloudFormation, etc.) has evolved into an entire philosophy. If it can be automated, it should be codified, version-controlled, and programmatically enforced.

The Everything as Code Stack

Here’s how the modern stack breaks down:

1. Infrastructure as Code (The Foundation)

This is table stakes in 2025. If you’re still clicking around in the AWS console to create production infrastructure, we need to talk.

I primarily use Terraform because it works across AWS and Azure (more on why Azure matters to me in a minute). Here’s what a production-ready VPC module actually looks like:

module "production_network" {
  source = "./modules/aws-network"

  environment = "prod"
  vpc_cidr    = "10.0.0.0/16"

  # Multi-AZ for high availability
  enable_nat_gateway = true
  nat_gateway_count  = 3

  # These get applied automatically
  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

The beauty here is that a developer doesn’t need to understand VPC peering, NAT gateways, route tables, or subnet sizing. The platform team builds the module once, enforces best practices, and developers just consume it.

AWS vs Azure networking is an interesting comparison. AWS uses VPCs with explicit subnets and route tables. Azure uses VNets with a hub-spoke model that feels more natural for enterprise architectures. The Terraform syntax differs, but the philosophy is identical: define it once, deploy it everywhere.

2. Policy as Code (The Guardrails)

Here’s where it gets interesting. Infrastructure as Code gives you automation, but Policy as Code gives you governance.

The classic scenario: a developer accidentally deploys a publicly accessible S3 bucket in production. It’s not malicious - they just didn’t know better. The solution isn’t more training; it’s automated enforcement.

AWS Service Control Policies (SCPs) let you prevent actions at the organization level:

resource "aws_organizations_policy" "require_s3_encryption" {
  name = "RequireS3Encryption"
  type = "SERVICE_CONTROL_POLICY"

  content = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Sid    = "DenyUnencryptedS3"
      Effect = "Deny"
      Action = ["s3:PutObject"]
      Resource = "*"
      Condition = {
        StringNotEquals = {
          "s3:x-amz-server-side-encryption" = ["AES256", "aws:kms"]
        }
      }
    }]
  })
}

Now it’s literally impossible to create an unencrypted S3 bucket. The AWS API will reject it. No amount of clicking in the console or running Terraform will work. The policy is code, version-controlled, reviewed, and enforced automatically.

Azure has Azure Policy which works similarly but with a different philosophy. Azure is more prescriptive about resource groups and subscriptions, so policies often target specific scopes. I’m learning that Azure’s approach to governance is actually more structured than AWS - which makes sense given its enterprise heritage.

3. Configuration as Code

Configuration drift is the silent killer. You deploy perfectly identical environments, then someone manually changes a setting in staging to test something, forgets to update production, and three months later you’re debugging why staging works but production doesn’t.

The solution: treat configuration the same as infrastructure.

AWS Systems Manager Parameter Store as code:

resource "aws_ssm_parameter" "app_config" {
  for_each = {
    "database_host" = "prod-db.example.com"
    "log_level"     = "info"
    "api_timeout"   = "30"
  }

  name  = "/myapp/production/${each.key}"
  type  = "String"
  value = each.value
}

Now your application reads configuration from Parameter Store, and Terraform ensures production and staging have the same structure with only intentional differences (like database endpoints). Version control shows exactly when log levels changed and who approved it.

4. Application as Code

This is where things get meta. Instead of developers writing Terraform to describe infrastructure separately from their application, the application itself declares its infrastructure requirements.

AWS CDK lets you do this in TypeScript:

const database = new rds.DatabaseInstance(this, 'Database', {
  engine: rds.DatabaseInstanceEngine.postgres({
    version: rds.PostgresEngineVersion.VER_15_3,
  }),
  instanceType: environment === 'production'
    ? ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE)
    : ec2.InstanceType.of(ec2.InstanceClass.T4G, ec2.InstanceSize.MEDIUM),
  multiAz: environment === 'production',
});

The application code says “I need a PostgreSQL database, and in production it should be multi-AZ with this instance size.” The platform provisions it automatically with appropriate sizing for the environment.

I’m particularly interested in Pulumi for this because it supports real programming languages and works across AWS and Azure. Writing infrastructure in TypeScript or Python feels more natural than learning yet another DSL.

5. Data as Code

Data pipelines have historically been the wild west. Someone builds an ETL job, it runs on a cron somewhere, and when it breaks six months later nobody remembers how it works.

Treating data pipelines as code means your ETL logic is version-controlled, tested, and deployed through CI/CD just like application code.

AWS Glue jobs defined in Terraform:

resource "aws_glue_job" "data_transformation" {
  name     = "data-transformation"
  role_arn = aws_iam_role.glue_job.arn

  command {
    script_location = "s3://${aws_s3_bucket.scripts.id}/transform.py"
    python_version  = "3"
  }

  default_arguments = {
    "--SOURCE_BUCKET" = aws_s3_bucket.data_lake.id
    "--TARGET_BUCKET" = aws_s3_bucket.data_lake.id
  }
}

The transformation script is in Git, the job definition is in Terraform, and changes go through the same review process as application code. Azure Data Factory has similar patterns with linked services and pipelines defined as code.

6. Architecture as Code

The most ambitious level is treating entire system architectures as code. Instead of architecture documents that drift out of date, the architecture IS the code.

I’ve been experimenting with AWS CDK to define complete application stacks - VPC, load balancer, auto-scaling, databases, caching, monitoring - all declared in code with explicit architectural decisions:

// Multi-AZ, auto-scaling, encrypted, backed up
const webApp = new WellArchitectedWebApp(this, 'ProductionApp', {
  multiAz: true,
  autoScaling: { min: 3, max: 10 },
  encryption: 'required',
  backupRetention: 30,
});

The code enforces the Well-Architected Framework. It’s impossible to deploy without encryption, or without backups, or without multi-AZ. The architecture is the code, and the code is the architecture.

The Platform Engineering Pattern

Here’s how this comes together in practice. As a platform team, you build reusable modules that encode best practices. Developers consume those modules through self-service.

What the platform team builds:

# modules/web-application/main.tf
variable "app_name" {}
variable "environment" {}
variable "enable_database" { default = false }

module "vpc" {
  source = "../networking/vpc-standard"
  # Handles all the complex VPC stuff
}

module "application" {
  source = "../compute/ecs-fargate"
  # Handles containers, load balancers, auto-scaling
}

module "database" {
  count  = var.enable_database ? 1 : 0
  source = "../data/rds-postgres"
  # Handles RDS with backups, encryption, multi-AZ
}

What developers use:

module "my_app" {
  source = "git::https://github.com/platform/modules//web-application"

  app_name        = "customer-api"
  environment     = "production"
  enable_database = true
}

That’s it. The developer gets a production-ready, compliant, multi-AZ application stack without needing to understand any of the underlying complexity. The platform team maintains the modules, enforces policies, and ensures consistency.

Why Azure Matters Too

You might notice I keep mentioning Azure alongside AWS. Here’s why: in 2025, “cloud” means “multi-cloud” whether you like it or not. Acquisitions bring Azure subscriptions, partners run on GCP, regulatory requirements force specific clouds.

More importantly, Azure’s approach to platform engineering is actually more mature in some ways than AWS. The hub-spoke VNet model, Azure Policy’s scope hierarchy, resource groups as first-class organizational boundaries - Microsoft built this for enterprises from day one.

The core platform engineering patterns - reusable modules, policy enforcement, self-service infrastructure - work the same way across clouds. Learning both AWS and Azure means understanding which abstractions are cloud-specific and which are universal principles.

Why This Actually Matters

The real benefit of platform engineering isn’t just “faster deployments” - it’s predictability and consistency.

When you manually provision infrastructure, you spend hours clicking through consoles, inevitably miss something (was encryption enabled on that S3 bucket?), and three months later production and staging have mysteriously diverged. Nobody knows why staging works but production doesn’t.

With everything as code:

Deployments become boring: The same Terraform that works in dev works in staging works in prod. No surprises.
Rollbacks are trivial: git revert and re-deploy. You’re back to the last known-good state in minutes.
Drift is impossible: Infrastructure matches the code. If they diverge, Terraform tells you.
Policy violations can’t happen: SCPs and policy-as-code make it literally impossible to deploy non-compliant infrastructure.

The time savings are real - automated deployments beat manual provisioning by hours - but the bigger win is sleeping at night knowing production is actually compliant and consistent.

The Challenges Nobody Talks About

Platform engineering sounds great until you hit the real-world challenges:

Developer adoption: Developers don’t automatically use your platform just because you built it. You need product management, documentation, training, and evangelism. Treat your platform as a product with your developers as customers.

Over-engineering: The temptation is to build the perfect, enterprise-grade platform from day one. Don’t. Start with the highest-impact, highest-frequency use cases and iterate. Teams routinely spend six months building platforms nobody uses because they built the wrong thing.

Tool sprawl: It’s easy to end up with Terraform for IaC, Sentinel for policies, OPA for Kubernetes, AWS Config for compliance, and five other tools. Pick a stack and standardize. The cognitive load of context-switching between tools is real.

Opportunity cost: Converting legacy infrastructure to code takes time. Not everything needs to be codified. Focus on new projects and high-churn infrastructure first. That stable EC2 instance that hasn’t changed in two years? Leave it alone.

The AI Factor

Here’s something wild: Google reports that 25% of all new code is now AI-generated. Platform engineering is getting an AI boost through tools like GitHub Copilot.

Instead of remembering Terraform syntax for every AWS resource, I write:

# Create an S3 bucket with versioning, encryption, and lifecycle policy

And Copilot generates the complete, correct Terraform code. For standard patterns, it’s like having a junior engineer pair-programming with you who never forgets syntax.

But - and this is critical - AI generates code, it doesn’t understand your architecture or requirements. You still need the expertise to validate what it produces. AI-assisted is great. AI-generated alone is dangerous.

Where to Start

If you’re building a platform team in 2025:

Start with IaC: Get your core infrastructure into Terraform. VPCs, databases, load balancers. Version control everything.

Layer in policy enforcement: Add Sentinel or OPA to prevent common mistakes. Require tags, enforce encryption, restrict instance types.

Build reusable modules: Create 3-5 golden path modules for your most common patterns (web app, API service, background worker, etc.).

Self-service portal: Use Backstage or similar to let developers provision infrastructure without opening tickets.

Measure adoption: Track how many teams use your platform vs. manual deployments. If adoption is low, talk to developers and fix the friction points.

Iterate based on feedback: Your platform is a product. Gather feedback, prioritize features, ship improvements.

The Future

The trend is clear: everything that can be code, will be code. We’re moving toward a world where:

Developers declare what they need, platforms provision it automatically
Policies enforce compliance and security at every layer
Architecture decisions are codified and enforced
Multi-cloud is the default, not the exception
AI assists with implementation but humans validate correctness

Platform engineering isn’t about replacing DevOps or SRE. It’s about taking the lessons from DevOps - automation, infrastructure as code, CI/CD - and building them into reusable platforms that scale beyond 10 people.

If you’re building platforms in 2025, the “everything as code” approach isn’t optional. It’s how you keep control as complexity grows, how you enable developers without slowing them down, and how you sleep at night knowing production is actually compliant.

Key Takeaways

If you only remember 5 things from this guide:

Platform engineering ≠ DevOps - Build golden paths, not free-for-all infrastructure access
Everything that can be code, should be code - IaC, Policy, Config, Application, Data, Architecture
Start with IaC, layer in policy enforcement - Automate first, add guardrails second
Multi-cloud is inevitable - Learn patterns that work across AWS and Azure
Your platform is a product - Treat developers as customers, measure adoption, iterate

Expected time investment:

Basic IaC implementation: 2-4 weeks
Policy enforcement layer: 1-2 weeks
Reusable module library: 4-6 weeks
Self-service portal: 6-8 weeks

ROI markers:

70%+ reduction in infrastructure provisioning time
90%+ reduction in compliance violations
50%+ reduction in environment drift incidents

Want to Go Deeper?

If you’re interested in hands-on platform engineering projects:

Azure Landing Zone: Implement the Cloud Adoption Framework with management groups, policies, and hub-spoke networking
Kubernetes Platform: Build an AKS or EKS cluster with GitOps, service mesh, and policy enforcement
Multi-Cloud Abstraction: Create Terraform modules that work across AWS and Azure
Infrastructure Testing: Use Terratest or similar to add automated testing to your IaC
Cost Optimization: Build policies that automatically right-size resources based on usage

The tools are mature, the patterns are proven, and the demand for platform engineering skills is only growing. The question isn’t whether to adopt “everything as code” - it’s how fast you can implement it before your competitors do.

What platform engineering challenges are you facing? I’m always interested in hearing about real-world implementations and the friction points teams hit. The theory is easy; the practice is where it gets interesting.