Case Study: IAM Misuse in Production

A platform teardown of IAM misuse, broken trust, and a late-night prod scare.

We had alerts. We had CI/CD. We had everything—except a clue who triggered what.

It started with a broken deployment.
The app failed, but no one knew why.
No commits. No builds. No audit trail.

And then we realized:
Someone—or something—had run terraform apply in prod.
With outdated state.
From a machine we didn’t control.

The setup looked fine on paper:

  • AWS + GitHub Actions

  • Fine-grained IAM roles

  • Terraform Cloud backend

  • A few human users with AdministratorAccess

  • VPN-only access to *.infra.company.net

But in practice?

sts:GetCallerIdentity returned 3 different roles for the same user in the same minute.

The Terraform Cloud workspace token had never rotated.

A contractor’s laptop was still in .aws/config.

The blast radius:

  • IAM role trust policies didn’t enforce SourceIp or SourceVpc

  • A lambda for sandbox cleanup was invoked with prod IAM

  • An old SSH key on a shared bastion was still valid

  • One engineer’s local kubeconfig pointed to production

How we fixed it:

  • IAM roles now require ExternalId tied to each system

  • AssumeRole restricted by condition: aws:SourceIp + aws:UserAgent

  • Added terraform plan signer with OIDC claims from GitHub Actions

  • All human access gated via AWS IAM Identity Center with session TTL

  • Created internal tool whoami for real-time credential tracing

Lessons burned in:

  • Never trust a "secure by default" setup — audit your defaults

  • Rotate everything: keys, tokens, sessions

  • IAM is not least privilege until it fails in production

  • Contextless access is access you can’t attribute

Want the Terraform policy.tf diff?
It’s not pretty. But it’s in version control now.
Because next time, it won’t just be IAM.

Auth fails loud. Access fails silent.
Better make the silent parts traceable.

Previous
Previous

Case Study: AUTOVACUUM ON FIRE

Next
Next

Case Study: Circle