Commit before async I/O: how a single enricher idled the entire PgBouncer pool
A transaction waiting on an HTTP reply is invisible in the connection pool — but it holds the slot. With twelve parallel daemons that's enough to push a whole backend to 502.
The picture looked innocent: an email enricher daemon reads a batch of candidates from the database, checks the email domains against an external validation API, writes the result back. On dev: fine. Single test runs: unremarkable. In the production cluster with twelve parallel partitions: backend 502s every few minutes.
The stack trace didn't point at the enricher — it pointed at unrelated endpoints waiting on a connection that never came. PgBouncer pool: exhausted. Yet the enricher showed only ~4 active queries on the monitor. How?
The silent killer
The code looked like this:
async def enrich_batch():
async with db.begin() as conn:
rows = await conn.execute(
"SELECT id, email FROM job_advertisement "
"WHERE email_verified IS NULL LIMIT 200"
)
for row in rows:
# Async HTTP — 200-1200 ms per call
result = await validate_email(row.email)
await conn.execute(
"UPDATE job_advertisement SET email_verified=:v WHERE id=:id",
{"v": result, "id": row.id},
)
Problem: async with db.begin() opens a transaction and holds the connection until the block exits. Inside the block the loop calls an external API 200 times. 200 × ~500 ms = 100 seconds per batch. All that time: the connection is idle in transaction.
PgBouncer typically waits 30–60 seconds and then kicks idle-in-transaction connections. Result: backend queries can't get a connection, timeout, 502. The monitor shows just three active SELECTs — the twelve idle-in-tx enricher connections are not counted as "active" by many monitoring setups.
The fix
The rule: commit immediately after SELECT, before any async I/O.
async def enrich_batch():
# Phase 1: SELECT, then commit immediately
async with db.begin() as conn:
rows = await conn.execute(
"SELECT id, email FROM job_advertisement "
"WHERE email_verified IS NULL LIMIT 200"
)
rows = rows.fetchall()
# Transaction closed, connection back in the pool
# Phase 2: HTTP calls without an open DB connection
results = []
for row in rows:
r = await validate_email(row.email)
results.append((row.id, r))
# Phase 3: UPDATE batch in a fresh short transaction
async with db.begin() as conn:
for rid, r in results:
await conn.execute(
"UPDATE job_advertisement SET email_verified=:v WHERE id=:id",
{"v": r, "id": rid},
)
The connection is now only open for pure DB operations — milliseconds, not minutes.
Operational consequence
The rule as a review pattern:
"Every async function that does DB + external I/O has at least two transactions."
In addition we set idle_in_transaction_session_timeout=5s in PgBouncer — a forgotten transaction blocks the pool for at most five seconds, after which PostgreSQL terminates it. Harsh, but calculated: a single daemon error beats a full backend outage.
Why tests don't catch this
Locally, with one worker and a local DB: no pool exhaustion. Locally, with a mocked API: no I/O delay. The anti-pattern is indistinguishable from correct code — it only shows under real load. That's why a review gate belongs in the pipeline: PRs containing await db.begin() and HTTP client imports get an automatic label and require explicit human review.
Stesso fuoco anche da te?
Probabilmente abbiamo già visto qualcosa di simile. Parliamone.
Iniziamo a parlarne→