Case Study: IAM Misuse in Production
A platform teardown of IAM misuse, broken trust, and a late-night prod scare.
We had alerts. We had CI/CD. We had everything—except a clue who triggered what.
It started with a broken deployment.
The app failed, but no one knew why.
No commits. No builds. No audit trail.
And then we realized:
Someone—or something—had run terraform apply
in prod.
With outdated state.
From a machine we didn’t control.
The setup looked fine on paper:
AWS + GitHub Actions
Fine-grained IAM roles
Terraform Cloud backend
A few human users with
AdministratorAccess
VPN-only access to
*.infra.company.net
But in practice?
sts:GetCallerIdentity
returned 3 different roles for the same user in the same minute.The Terraform Cloud workspace token had never rotated.
A contractor’s laptop was still in
.aws/config
.
The blast radius:
IAM role trust policies didn’t enforce
SourceIp
orSourceVpc
A lambda for sandbox cleanup was invoked with prod IAM
An old SSH key on a shared bastion was still valid
One engineer’s local kubeconfig pointed to production
How we fixed it:
IAM roles now require
ExternalId
tied to each systemAssumeRole
restricted by condition:aws:SourceIp
+aws:UserAgent
Added
terraform plan
signer with OIDC claims from GitHub ActionsAll human access gated via AWS IAM Identity Center with session TTL
Created internal tool
whoami
for real-time credential tracing
Lessons burned in:
Never trust a "secure by default" setup — audit your defaults
Rotate everything: keys, tokens, sessions
IAM is not least privilege until it fails in production
Contextless access is access you can’t attribute
Want the Terraform policy.tf
diff?
It’s not pretty. But it’s in version control now.
Because next time, it won’t just be IAM.
Auth fails loud. Access fails silent.
Better make the silent parts traceable.