ZERONE
Back to insights
Data Engineering2026-04-18 · 4 min read

Commit before async I/O: how a single enricher idled the entire PgBouncer pool

A transaction waiting on an HTTP reply is invisible in the connection pool — but it holds the slot. With twelve parallel daemons that's enough to push a whole backend to 502.

The picture looked innocent: an email enricher daemon reads a batch of candidates from the database, checks the email domains against an external validation API, writes the result back. On dev: fine. Single test runs: unremarkable. In the production cluster with twelve parallel partitions: backend 502s every few minutes.

The stack trace didn't point at the enricher — it pointed at unrelated endpoints waiting on a connection that never came. PgBouncer pool: exhausted. Yet the enricher showed only ~4 active queries on the monitor. How?

The silent killer

The code looked like this:

async def enrich_batch():
    async with db.begin() as conn:
        rows = await conn.execute(
            "SELECT id, email FROM job_advertisement "
            "WHERE email_verified IS NULL LIMIT 200"
        )
        for row in rows:
            # Async HTTP — 200-1200 ms per call
            result = await validate_email(row.email)
            await conn.execute(
                "UPDATE job_advertisement SET email_verified=:v WHERE id=:id",
                {"v": result, "id": row.id},
            )

Problem: async with db.begin() opens a transaction and holds the connection until the block exits. Inside the block the loop calls an external API 200 times. 200 × ~500 ms = 100 seconds per batch. All that time: the connection is idle in transaction.

PgBouncer typically waits 30–60 seconds and then kicks idle-in-transaction connections. Result: backend queries can't get a connection, timeout, 502. The monitor shows just three active SELECTs — the twelve idle-in-tx enricher connections are not counted as "active" by many monitoring setups.

The fix

The rule: commit immediately after SELECT, before any async I/O.

async def enrich_batch():
    # Phase 1: SELECT, then commit immediately
    async with db.begin() as conn:
        rows = await conn.execute(
            "SELECT id, email FROM job_advertisement "
            "WHERE email_verified IS NULL LIMIT 200"
        )
        rows = rows.fetchall()
    # Transaction closed, connection back in the pool

    # Phase 2: HTTP calls without an open DB connection
    results = []
    for row in rows:
        r = await validate_email(row.email)
        results.append((row.id, r))

    # Phase 3: UPDATE batch in a fresh short transaction
    async with db.begin() as conn:
        for rid, r in results:
            await conn.execute(
                "UPDATE job_advertisement SET email_verified=:v WHERE id=:id",
                {"v": r, "id": rid},
            )

The connection is now only open for pure DB operations — milliseconds, not minutes.

Operational consequence

The rule as a review pattern:

"Every async function that does DB + external I/O has at least two transactions."

In addition we set idle_in_transaction_session_timeout=5s in PgBouncer — a forgotten transaction blocks the pool for at most five seconds, after which PostgreSQL terminates it. Harsh, but calculated: a single daemon error beats a full backend outage.

Why tests don't catch this

Locally, with one worker and a local DB: no pool exhaustion. Locally, with a mocked API: no I/O delay. The anti-pattern is indistinguishable from correct code — it only shows under real load. That's why a review gate belongs in the pipeline: PRs containing await db.begin() and HTTP client imports get an automatic label and require explicit human review.

A similar fire?

We've likely seen something close. Get in touch.

Talk to us