Case Study: AUTOVACUUM ON FIRE
Problem:
Query latency spiked ×10 under production load. PostgreSQL was fully I/O bound.
No blocked queries. No locking. Everything “looked fine”. Except it wasn’t.
Symptoms:
• CPU mostly idle, disk 100% busy
• pg_stat_activity
: no slow queries
• pg_stat_user_tables
: extreme bloat
• 2M+ dead tuples on orders
• Autovacuum had been running 5.7 hours on a single table
Root Cause:
A legacy GIN index on a JSONB
field (metadata → 'tags'
) that nobody used anymore.
Autovacuum wasn’t blocked — it was crawling. Default cost limits, no tuning, no backstops.
It slowly churned through massive index pages and never finished.
Fixes Applied:
• Dropped the unused GIN index
• Tuned autovacuum settings:
– autovacuum_vacuum_cost_limit = 4000
– autovacuum_vacuum_scale_factor = 0.01
• Added scheduled fallback vacuums via pg_cron
• Increased maintenance_work_mem
• Monitored n_dead_tup
for early warning
Lessons Learned:
• GIN indexes must be justified and monitored
• Autovacuum can silently kill performance
• Don’t rely on defaults in production
• Visibility into dead tuples is critical
Infra Components:PostgreSQL
, autovacuum
, GIN
, JSONB
, pg_stat_*
, pg_cron
, disk I/O
, maintenance_work_mem