Commit before async I/O: how a single enricher idled the entire PgBouncer pool
A transaction waiting on an HTTP reply is invisible in the connection pool — but it holds the slot. With twelve parallel daemons that's enough to push a whole backend to 502.
The picture looked innocent: an email enricher daemon reads a batch of candidates from the database, checks the email domains against an external validation API, writes the result back. On dev: fine. Single test runs: unremarkable. In the production cluster with twelve parallel partitions: backend 502s every few minutes.
The stack trace didn't point at the enricher — it pointed at unrelated endpoints waiting on a connection that never came. PgBouncer pool: exhausted. Yet the enricher showed only ~4 active queries on the monitor. How?
The silent killer
The code looked like this:
async def enrich_batch():
async with db.begin() as conn:
rows = await conn.execute(
"SELECT id, email FROM job_advertisement "
"WHERE email_verified IS NULL LIMIT 200"
)
for row in rows:
# Async HTTP — 200-1200 ms per call
result = await validate_email(row.email)
await conn.execute(
"UPDATE job_advertisement SET email_verified=:v WHERE id=:id",
{"v": result, "id": row.id},
)
Problem: async with db.begin() opens a transaction and holds the connection until the block exits. Inside the block the loop calls an external API 200 times. 200 × ~500 ms = 100 seconds per batch. All that time: the connection is idle in transaction.
PgBouncer typically waits 30–60 seconds and then kicks idle-in-transaction connections. Result: backend queries can't get a connection, timeout, 502. The monitor shows just three active SELECTs — the twelve idle-in-tx enricher connections are not counted as "active" by many monitoring setups.
The fix
The rule: commit immediately after SELECT, before any async I/O.
async def enrich_batch():
# Phase 1: SELECT, then commit immediately
async with db.begin() as conn:
rows = await conn.execute(
"SELECT id, email FROM job_advertisement "
"WHERE email_verified IS NULL LIMIT 200"
)
rows = rows.fetchall()
# Transaction closed, connection back in the pool
# Phase 2: HTTP calls without an open DB connection
results = []
for row in rows:
r = await validate_email(row.email)
results.append((row.id, r))
# Phase 3: UPDATE batch in a fresh short transaction
async with db.begin() as conn:
for rid, r in results:
await conn.execute(
"UPDATE job_advertisement SET email_verified=:v WHERE id=:id",
{"v": r, "id": rid},
)
The connection is now only open for pure DB operations — milliseconds, not minutes.
Operational consequence
The rule as a review pattern:
"Every async function that does DB + external I/O has at least two transactions."
In addition we set idle_in_transaction_session_timeout=5s in PgBouncer — a forgotten transaction blocks the pool for at most five seconds, after which PostgreSQL terminates it. Harsh, but calculated: a single daemon error beats a full backend outage.
Why tests don't catch this
Locally, with one worker and a local DB: no pool exhaustion. Locally, with a mocked API: no I/O delay. The anti-pattern is indistinguishable from correct code — it only shows under real load. That's why a review gate belongs in the pipeline: PRs containing await db.begin() and HTTP client imports get an automatic label and require explicit human review.