Lessons from production.
Field notes from real systems — debugging stories, architecture trade-offs, anti-patterns we've paid for in outages. No tutorials, no hot takes. Just what actually held at 3 AM.
Symptom ≠ Root Cause: How the auto-healer became the real problem
A PostgreSQL primary at 91 % CPU. The auto-healer kills the noisiest query. An hour later: 91 % again. The lesson: quick fixes can lock themselves into an infinite loop if nobody asks which pattern is actually repeating.
Cache without a lock is thundering herd: 14 endpoints, 8 workers, one dead database
Cache expiry is the one moment when parallel workers all get expensive at the same time. Without a per-key lock, every refresh dumps your full load down the slowest path.
Expression index ignored: Why COALESCE in the index didn't match the ORDER BY — 29 500× speedup
A functional index on COALESCE(column, 0) had zero effect. The planner ignored it because the ORDER BY used a subtly different expression. Lesson: expression identity is not a suggestion, it's a precondition.
UPDATE with subquery and LIMIT: When the daemon spins in place
A simple UPDATE pattern that looks correct on small datasets and silently stagnates in production. The cause: a filter in the wrong place kills forward progress.