Lessons from production.

Field notes from real systems — debugging stories, architecture trade-offs, anti-patterns we've paid for in outages. No tutorials, no hot takes. Just what actually held at 3 AM.

Debugging2026-04-18 · 5 min

Symptom ≠ Root Cause: How the auto-healer became the real problem

A PostgreSQL primary at 91 % CPU. The auto-healer kills the noisiest query. An hour later: 91 % again. The lesson: quick fixes can lock themselves into an infinite loop if nobody asks which pattern is actually repeating.

Read insight→

Infrastructure2026-04-18 · 6 min

Cache without a lock is thundering herd: 14 endpoints, 8 workers, one dead database

Cache expiry is the one moment when parallel workers all get expensive at the same time. Without a per-key lock, every refresh dumps your full load down the slowest path.

Read insight→

Data Engineering2026-04-18 · 5 min

Expression index ignored: Why COALESCE in the index didn't match the ORDER BY — 29 500× speedup

A functional index on COALESCE(column, 0) had zero effect. The planner ignored it because the ORDER BY used a subtly different expression. Lesson: expression identity is not a suggestion, it's a precondition.

Read insight→

Data Engineering2026-04-18 · 4 min

UPDATE with subquery and LIMIT: When the daemon spins in place

A simple UPDATE pattern that looks correct on small datasets and silently stagnates in production. The cause: a filter in the wrong place kills forward progress.

Read insight→

Data Engineering2026-04-18 · 4 min

Commit before async I/O: how a single enricher idled the entire PgBouncer pool

A transaction waiting on an HTTP reply is invisible in the connection pool — but it holds the slot. With twelve parallel daemons that's enough to push a whole backend to 502.

Read insight→

Distributed Systems2026-04-18 · 4 min

Batch finalisation per container: why the monitor showed nothing for 83 minutes

A pipeline with N parallel sub-jobs finalises its status at the batch level — every worker is running, but the monitor reports standstill until the last container finishes. The fix: finalise per container, not per batch.

Read insight→