Platform engineering is having a moment. After years of everyone saying “just use DevOps,” organizations are realizing that throwing developers into AWS consoles and expecting them to build production-ready infrastructure is a recipe for disaster. The solution? Build internal platforms that treat infrastructure, policies, configuration - basically everything - as code.
I’ve been deep in this world lately, and what’s fascinating is how the “as code” philosophy has expanded beyond just infrastructure. In 2025, we’re seeing Infrastructure, Policy, Configuration, Application architecture, and even Data pipelines all converging into a single programmable stack. Let me walk you through what this actually means in practice.
What You’ll Learn:
- Infrastructure as Code fundamentals with Terraform
- Policy as Code with AWS SCPs and Azure Policy
- Configuration management with Parameter Store
- Application as Code with AWS CDK and Pulumi
- Platform engineering patterns and anti-patterns
- Multi-cloud abstractions (AWS + Azure)
Quick Navigation: DevOps vs Platform | IaC | Policy as Code | Platform Patterns | Getting Started
The Problem: DevOps Didn’t Scale
Here’s the thing about traditional DevOps: it worked great when you had 10 developers. Everyone could learn AWS, everyone understood the architecture, everyone knew where the production secrets lived. But scale that to 100 developers across 10 teams? Chaos.
The pattern plays out the same way everywhere:
- Each team builds their own infrastructure (slightly differently)
- Nobody remembers to enable encryption on that one S3 bucket
- Production and staging drift apart over time
- “It works in dev” becomes a meme because dev actually uses different infrastructure
- Security team has a meltdown when they discover someone deployed a public RDS instance
The DevOps answer was “shift left” - make developers responsible for operations. The platform engineering answer is “build golden paths” - give developers self-service tools that make it harder to do the wrong thing than the right thing.
The “As Code” Stack
What started as Infrastructure as Code (Terraform, CloudFormation, etc.) has evolved into an entire philosophy. If it can be automated, it should be codified, version-controlled, and programmatically enforced.
Here’s how the modern stack breaks down:
1. Infrastructure as Code (The Foundation)
This is table stakes in 2025. If you’re still clicking around in the AWS console to create production infrastructure, we need to talk.
I primarily use Terraform because it works across AWS and Azure (more on why Azure matters to me in a minute). Here’s what a production-ready VPC module actually looks like:
module "production_network" {
source = "./modules/aws-network"
environment = "prod"
vpc_cidr = "10.0.0.0/16"
# Multi-AZ for high availability
enable_nat_gateway = true
nat_gateway_count = 3
# These get applied automatically
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
The beauty here is that a developer doesn’t need to understand VPC peering, NAT gateways, route tables, or subnet sizing. The platform team builds the module once, enforces best practices, and developers just consume it.
AWS vs Azure networking is an interesting comparison. AWS uses VPCs with explicit subnets and route tables. Azure uses VNets with a hub-spoke model that feels more natural for enterprise architectures. The Terraform syntax differs, but the philosophy is identical: define it once, deploy it everywhere.
2. Policy as Code (The Guardrails)
Here’s where it gets interesting. Infrastructure as Code gives you automation, but Policy as Code gives you governance.
The classic scenario: a developer accidentally deploys a publicly accessible S3 bucket in production. It’s not malicious - they just didn’t know better. The solution isn’t more training; it’s automated enforcement.
AWS Service Control Policies (SCPs) let you prevent actions at the organization level:
resource "aws_organizations_policy" "require_s3_encryption" {
name = "RequireS3Encryption"
type = "SERVICE_CONTROL_POLICY"
content = jsonencode({
Version = "2012-10-17"
Statement = [{
Sid = "DenyUnencryptedS3"
Effect = "Deny"
Action = ["s3:PutObject"]
Resource = "*"
Condition = {
StringNotEquals = {
"s3:x-amz-server-side-encryption" = ["AES256", "aws:kms"]
}
}
}]
})
}
Now it’s literally impossible to create an unencrypted S3 bucket. The AWS API will reject it. No amount of clicking in the console or running Terraform will work. The policy is code, version-controlled, reviewed, and enforced automatically.
Azure has Azure Policy which works similarly but with a different philosophy. Azure is more prescriptive about resource groups and subscriptions, so policies often target specific scopes. I’m learning that Azure’s approach to governance is actually more structured than AWS - which makes sense given its enterprise heritage.
3. Configuration as Code
Configuration drift is the silent killer. You deploy perfectly identical environments, then someone manually changes a setting in staging to test something, forgets to update production, and three months later you’re debugging why staging works but production doesn’t.
The solution: treat configuration the same as infrastructure.
AWS Systems Manager Parameter Store as code:
resource "aws_ssm_parameter" "app_config" {
for_each = {
"database_host" = "prod-db.example.com"
"log_level" = "info"
"api_timeout" = "30"
}
name = "/myapp/production/${each.key}"
type = "String"
value = each.value
}
Now your application reads configuration from Parameter Store, and Terraform ensures production and staging have the same structure with only intentional differences (like database endpoints). Version control shows exactly when log levels changed and who approved it.
4. Application as Code
This is where things get meta. Instead of developers writing Terraform to describe infrastructure separately from their application, the application itself declares its infrastructure requirements.
AWS CDK lets you do this in TypeScript:
const database = new rds.DatabaseInstance(this, 'Database', {
engine: rds.DatabaseInstanceEngine.postgres({
version: rds.PostgresEngineVersion.VER_15_3,
}),
instanceType: environment === 'production'
? ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE)
: ec2.InstanceType.of(ec2.InstanceClass.T4G, ec2.InstanceSize.MEDIUM),
multiAz: environment === 'production',
});
The application code says “I need a PostgreSQL database, and in production it should be multi-AZ with this instance size.” The platform provisions it automatically with appropriate sizing for the environment.
I’m particularly interested in Pulumi for this because it supports real programming languages and works across AWS and Azure. Writing infrastructure in TypeScript or Python feels more natural than learning yet another DSL.
5. Data as Code
Data pipelines have historically been the wild west. Someone builds an ETL job, it runs on a cron somewhere, and when it breaks six months later nobody remembers how it works.
Treating data pipelines as code means your ETL logic is version-controlled, tested, and deployed through CI/CD just like application code.
AWS Glue jobs defined in Terraform:
resource "aws_glue_job" "data_transformation" {
name = "data-transformation"
role_arn = aws_iam_role.glue_job.arn
command {
script_location = "s3://${aws_s3_bucket.scripts.id}/transform.py"
python_version = "3"
}
default_arguments = {
"--SOURCE_BUCKET" = aws_s3_bucket.data_lake.id
"--TARGET_BUCKET" = aws_s3_bucket.data_lake.id
}
}
The transformation script is in Git, the job definition is in Terraform, and changes go through the same review process as application code. Azure Data Factory has similar patterns with linked services and pipelines defined as code.
6. Architecture as Code
The most ambitious level is treating entire system architectures as code. Instead of architecture documents that drift out of date, the architecture IS the code.
I’ve been experimenting with AWS CDK to define complete application stacks - VPC, load balancer, auto-scaling, databases, caching, monitoring - all declared in code with explicit architectural decisions:
// Multi-AZ, auto-scaling, encrypted, backed up
const webApp = new WellArchitectedWebApp(this, 'ProductionApp', {
multiAz: true,
autoScaling: { min: 3, max: 10 },
encryption: 'required',
backupRetention: 30,
});
The code enforces the Well-Architected Framework. It’s impossible to deploy without encryption, or without backups, or without multi-AZ. The architecture is the code, and the code is the architecture.
The Platform Engineering Pattern
Here’s how this comes together in practice. As a platform team, you build reusable modules that encode best practices. Developers consume those modules through self-service.
What the platform team builds:
# modules/web-application/main.tf
variable "app_name" {}
variable "environment" {}
variable "enable_database" { default = false }
module "vpc" {
source = "../networking/vpc-standard"
# Handles all the complex VPC stuff
}
module "application" {
source = "../compute/ecs-fargate"
# Handles containers, load balancers, auto-scaling
}
module "database" {
count = var.enable_database ? 1 : 0
source = "../data/rds-postgres"
# Handles RDS with backups, encryption, multi-AZ
}
What developers use:
module "my_app" {
source = "git::https://github.com/platform/modules//web-application"
app_name = "customer-api"
environment = "production"
enable_database = true
}
That’s it. The developer gets a production-ready, compliant, multi-AZ application stack without needing to understand any of the underlying complexity. The platform team maintains the modules, enforces policies, and ensures consistency.
Why Azure Matters Too
You might notice I keep mentioning Azure alongside AWS. Here’s why: in 2025, “cloud” means “multi-cloud” whether you like it or not. Acquisitions bring Azure subscriptions, partners run on GCP, regulatory requirements force specific clouds.
More importantly, Azure’s approach to platform engineering is actually more mature in some ways than AWS. The hub-spoke VNet model, Azure Policy’s scope hierarchy, resource groups as first-class organizational boundaries - Microsoft built this for enterprises from day one.
The core platform engineering patterns - reusable modules, policy enforcement, self-service infrastructure - work the same way across clouds. Learning both AWS and Azure means understanding which abstractions are cloud-specific and which are universal principles.
Why This Actually Matters
The real benefit of platform engineering isn’t just “faster deployments” - it’s predictability and consistency.
When you manually provision infrastructure, you spend hours clicking through consoles, inevitably miss something (was encryption enabled on that S3 bucket?), and three months later production and staging have mysteriously diverged. Nobody knows why staging works but production doesn’t.
With everything as code:
- Deployments become boring: The same Terraform that works in dev works in staging works in prod. No surprises.
- Rollbacks are trivial:
git revertand re-deploy. You’re back to the last known-good state in minutes. - Drift is impossible: Infrastructure matches the code. If they diverge, Terraform tells you.
- Policy violations can’t happen: SCPs and policy-as-code make it literally impossible to deploy non-compliant infrastructure.
The time savings are real - automated deployments beat manual provisioning by hours - but the bigger win is sleeping at night knowing production is actually compliant and consistent.
The Challenges Nobody Talks About
Platform engineering sounds great until you hit the real-world challenges:
Developer adoption: Developers don’t automatically use your platform just because you built it. You need product management, documentation, training, and evangelism. Treat your platform as a product with your developers as customers.
Over-engineering: The temptation is to build the perfect, enterprise-grade platform from day one. Don’t. Start with the highest-impact, highest-frequency use cases and iterate. Teams routinely spend six months building platforms nobody uses because they built the wrong thing.
Tool sprawl: It’s easy to end up with Terraform for IaC, Sentinel for policies, OPA for Kubernetes, AWS Config for compliance, and five other tools. Pick a stack and standardize. The cognitive load of context-switching between tools is real.
Opportunity cost: Converting legacy infrastructure to code takes time. Not everything needs to be codified. Focus on new projects and high-churn infrastructure first. That stable EC2 instance that hasn’t changed in two years? Leave it alone.
The AI Factor
Here’s something wild: Google reports that 25% of all new code is now AI-generated. Platform engineering is getting an AI boost through tools like GitHub Copilot.
Instead of remembering Terraform syntax for every AWS resource, I write:
# Create an S3 bucket with versioning, encryption, and lifecycle policy
And Copilot generates the complete, correct Terraform code. For standard patterns, it’s like having a junior engineer pair-programming with you who never forgets syntax.
But - and this is critical - AI generates code, it doesn’t understand your architecture or requirements. You still need the expertise to validate what it produces. AI-assisted is great. AI-generated alone is dangerous.
Where to Start
If you’re building a platform team in 2025:
Start with IaC: Get your core infrastructure into Terraform. VPCs, databases, load balancers. Version control everything.
Layer in policy enforcement: Add Sentinel or OPA to prevent common mistakes. Require tags, enforce encryption, restrict instance types.
Build reusable modules: Create 3-5 golden path modules for your most common patterns (web app, API service, background worker, etc.).
Self-service portal: Use Backstage or similar to let developers provision infrastructure without opening tickets.
Measure adoption: Track how many teams use your platform vs. manual deployments. If adoption is low, talk to developers and fix the friction points.
Iterate based on feedback: Your platform is a product. Gather feedback, prioritize features, ship improvements.
The Future
The trend is clear: everything that can be code, will be code. We’re moving toward a world where:
- Developers declare what they need, platforms provision it automatically
- Policies enforce compliance and security at every layer
- Architecture decisions are codified and enforced
- Multi-cloud is the default, not the exception
- AI assists with implementation but humans validate correctness
Platform engineering isn’t about replacing DevOps or SRE. It’s about taking the lessons from DevOps - automation, infrastructure as code, CI/CD - and building them into reusable platforms that scale beyond 10 people.
If you’re building platforms in 2025, the “everything as code” approach isn’t optional. It’s how you keep control as complexity grows, how you enable developers without slowing them down, and how you sleep at night knowing production is actually compliant.
Key Takeaways
If you only remember 5 things from this guide:
- Platform engineering ≠ DevOps - Build golden paths, not free-for-all infrastructure access
- Everything that can be code, should be code - IaC, Policy, Config, Application, Data, Architecture
- Start with IaC, layer in policy enforcement - Automate first, add guardrails second
- Multi-cloud is inevitable - Learn patterns that work across AWS and Azure
- Your platform is a product - Treat developers as customers, measure adoption, iterate
Expected time investment:
- Basic IaC implementation: 2-4 weeks
- Policy enforcement layer: 1-2 weeks
- Reusable module library: 4-6 weeks
- Self-service portal: 6-8 weeks
ROI markers:
- 70%+ reduction in infrastructure provisioning time
- 90%+ reduction in compliance violations
- 50%+ reduction in environment drift incidents
Want to Go Deeper?
If you’re interested in hands-on platform engineering projects:
- Azure Landing Zone: Implement the Cloud Adoption Framework with management groups, policies, and hub-spoke networking
- Kubernetes Platform: Build an AKS or EKS cluster with GitOps, service mesh, and policy enforcement
- Multi-Cloud Abstraction: Create Terraform modules that work across AWS and Azure
- Infrastructure Testing: Use Terratest or similar to add automated testing to your IaC
- Cost Optimization: Build policies that automatically right-size resources based on usage
The tools are mature, the patterns are proven, and the demand for platform engineering skills is only growing. The question isn’t whether to adopt “everything as code” - it’s how fast you can implement it before your competitors do.
What platform engineering challenges are you facing? I’m always interested in hearing about real-world implementations and the friction points teams hit. The theory is easy; the practice is where it gets interesting.