Case Study: AUTOVACUUM ON FIRE

Problem:
Query latency spiked ×10 under production load. PostgreSQL was fully I/O bound.
No blocked queries. No locking. Everything “looked fine”. Except it wasn’t.

Symptoms:
• CPU mostly idle, disk 100% busy
pg_stat_activity: no slow queries
pg_stat_user_tables: extreme bloat
• 2M+ dead tuples on orders
• Autovacuum had been running 5.7 hours on a single table

Root Cause:
A legacy GIN index on a JSONB field (metadata → 'tags') that nobody used anymore.
Autovacuum wasn’t blocked — it was crawling. Default cost limits, no tuning, no backstops.
It slowly churned through massive index pages and never finished.

Fixes Applied:
• Dropped the unused GIN index
• Tuned autovacuum settings:
  – autovacuum_vacuum_cost_limit = 4000
  – autovacuum_vacuum_scale_factor = 0.01
• Added scheduled fallback vacuums via pg_cron
• Increased maintenance_work_mem
• Monitored n_dead_tup for early warning

Lessons Learned:

• GIN indexes must be justified and monitored
• Autovacuum can silently kill performance
• Don’t rely on defaults in production
• Visibility into dead tuples is critical

Infra Components:
PostgreSQL, autovacuum, GIN, JSONB, pg_stat_*, pg_cron, disk I/O, maintenance_work_mem

Next
Next

Case Study: IAM Misuse in Production