A PostgreSQL primary at 91 % CPU. The auto-healer runs every 5 minutes, finds the noisiest query, sends pg_cancel_backend(), and load drops to 35 %. An hour later: 91 % again. The healer kicks the next query. Back to 35 %. Another hour. Back to 91 %.

You could declare that "the system is working". It isn't. It's burning CPU cycles fighting a symptom it never understands.

The moment of honesty

After three cycles we stopped. Instead of tuning the healer more aggressively, we asked the question we should have asked from the start: What's bringing the load back?

Answer, after a 20-minute audit: 14 different backend endpoints call the same 1.4-million-row COUNT(*) FILTER (...) aggregation. Without a cache lock. On every cache miss, eight parallel uvicorn workers fire the same expensive query simultaneously. Thundering herd.

The healer was right that there were queries running that shouldn't be. The healer was wrong that killing them solved the problem. The problem was that 14 clients were grabbing the same cow by the udder at the same time.

What was actually needed

Exactly one lock per cache key. We used Redis SET NX EX as a distributed lock plus a per-worker asyncio.Lock as a local short-circuit.
Found 14 endpoints, fixed three critical ones immediately, documented the remaining eleven in a ticket with a timeline.
Did not tune the healer. The healer was never the problem.

After the fix: CPU load holds at 53 %. No more escalations. Zero healer cycles needed.

The meta-lesson

Every system has a point where the symptom looks more interesting than the cause. The symptom is loud, visible, and "solvable" with a quick script. The cause sits three layers of abstraction deeper, has no name, and requires that someone actually understand the subsystem.

The rule we now impose on every recurring incident:

"If we ship this fix now — what prevents us from standing here again in one hour?"

If the answer is unclear, the planned fix is not a fix. It's a symptom-suppressor, and the system will find another, louder way to signal the same root cause.

Applies far beyond databases

Every senior engineer knows this pattern:

Flaky test? Blind-retry in CI is symptom suppression. The test reveals a real race.
Memory leak? Pod restart every 6 h is symptom suppression. The leak is eating customer data in between.
Bounce rate on a button? A/B test with a new color is symptom suppression. The feature doesn't solve the user problem.

For every recurring issue: identify the five consumers of the affected subsystem and check the pattern there — not just at the spot where the alert was loud.

Symptom ≠ Root Cause: How the auto-healer became the real problem

The moment of honesty

What was actually needed

The meta-lesson

Applies far beyond databases

A similar fire?