Case Study: IAM Misuse in Production

Mar 11

Written By Alex

A platform teardown of IAM misuse, broken trust, and a late-night prod scare.

We had alerts. We had CI/CD. We had everything—except a clue who triggered what.

It started with a broken deployment.
The app failed, but no one knew why.
No commits. No builds. No audit trail.

And then we realized:
Someone—or something—had run terraform apply in prod.
With outdated state.
From a machine we didn’t control.

The setup looked fine on paper:

AWS + GitHub Actions
Fine-grained IAM roles
Terraform Cloud backend
A few human users with AdministratorAccess
VPN-only access to *.infra.company.net

But in practice?

sts:GetCallerIdentity returned 3 different roles for the same user in the same minute.
The Terraform Cloud workspace token had never rotated.
A contractor’s laptop was still in .aws/config.

The blast radius:

IAM role trust policies didn’t enforce SourceIp or SourceVpc
A lambda for sandbox cleanup was invoked with prod IAM
An old SSH key on a shared bastion was still valid
One engineer’s local kubeconfig pointed to production

How we fixed it:

IAM roles now require ExternalId tied to each system
AssumeRole restricted by condition: aws:SourceIp + aws:UserAgent
Added terraform plan signer with OIDC claims from GitHub Actions
All human access gated via AWS IAM Identity Center with session TTL
Created internal tool whoami for real-time credential tracing

Lessons burned in:

Never trust a "secure by default" setup — audit your defaults
Rotate everything: keys, tokens, sessions
IAM is not least privilege until it fails in production
Contextless access is access you can’t attribute

Want the Terraform policy.tf diff?
It’s not pretty. But it’s in version control now.
Because next time, it won’t just be IAM.

Auth fails loud. Access fails silent.
Better make the silent parts traceable.

Alex

Case Study: IAM Misuse in Production

Case Study: AUTOVACUUM ON FIRE

Case Study: Circle