Case Study: AUTOVACUUM ON FIRE
Problem:
Query latency spiked ×10 under production load. PostgreSQL was fully I/O bound.
No blocked queries. No locking. Everything “looked fine”. Except it wasn’t.
Symptoms:
• CPU mostly idle, disk 100% busy
• pg_stat_activity: no slow queries
• pg_stat_user_tables: extreme bloat
• 2M+ dead tuples on orders
• Autovacuum had been running 5.7 hours on a single table
Root Cause:
A legacy GIN index on a JSONB field (metadata → 'tags') that nobody used anymore.
Autovacuum wasn’t blocked — it was crawling. Default cost limits, no tuning, no backstops.
It slowly churned through massive index pages and never finished.
Fixes Applied:
• Dropped the unused GIN index
• Tuned autovacuum settings:
– autovacuum_vacuum_cost_limit = 4000
– autovacuum_vacuum_scale_factor = 0.01
• Added scheduled fallback vacuums via pg_cron
• Increased maintenance_work_mem
• Monitored n_dead_tup for early warning
Lessons Learned:
• GIN indexes must be justified and monitored
• Autovacuum can silently kill performance
• Don’t rely on defaults in production
• Visibility into dead tuples is critical
Infra Components:PostgreSQL, autovacuum, GIN, JSONB, pg_stat_*, pg_cron, disk I/O, maintenance_work_mem