Lessons from production.
Field notes from real systems — debugging stories, architecture trade-offs, anti-patterns we've paid for in outages. No tutorials, no hot takes. Just what actually held at 3 AM.
Symptom ≠ Root Cause: How the auto-healer became the real problem
A PostgreSQL primary at 91 % CPU. The auto-healer kills the noisiest query. An hour later: 91 % again. The lesson: quick fixes can lock themselves into an infinite loop if nobody asks which pattern is actually repeating.
Cache without a lock is thundering herd: 14 endpoints, 8 workers, one dead database
Cache expiry is the one moment when parallel workers all get expensive at the same time. Without a per-key lock, every refresh dumps your full load down the slowest path.
Expression index ignored: Why COALESCE in the index didn't match the ORDER BY — 29 500× speedup
A functional index on COALESCE(column, 0) had zero effect. The planner ignored it because the ORDER BY used a subtly different expression. Lesson: expression identity is not a suggestion, it's a precondition.
UPDATE with subquery and LIMIT: When the daemon spins in place
A simple UPDATE pattern that looks correct on small datasets and silently stagnates in production. The cause: a filter in the wrong place kills forward progress.
Commit before async I/O: how a single enricher idled the entire PgBouncer pool
A transaction waiting on an HTTP reply is invisible in the connection pool — but it holds the slot. With twelve parallel daemons that's enough to push a whole backend to 502.
Batch finalisation per container: why the monitor showed nothing for 83 minutes
A pipeline with N parallel sub-jobs finalises its status at the batch level — every worker is running, but the monitor reports standstill until the last container finishes. The fix: finalise per container, not per batch.