The scenario: a background daemon computes a derived score trust_score per row and writes it back. The table has around 2 million rows, the daemon runs every 60 seconds, each iteration a batch of 10 000. After a week of observation: only 240 000 rows had their score set. The daemon had essentially stopped making progress after its first run.

The query looked like:

UPDATE job_advertisement ja
SET trust_score = sub.ts
FROM (
    SELECT id, compute_trust(id) AS ts
    FROM job_advertisement
    LIMIT 10000
) sub
WHERE ja.id = sub.id
  AND (ja.trust_score IS NULL OR ja.trust_score <> sub.ts);

At first glance: clean, batch-bounded, idempotent, writes only on change. In reality: broken.

What's wrong

The subquery picks the same 10 000 rows every time (no ORDER BY, but PostgreSQL is deterministic enough that it returns the first 10 000 by heap position in practice). On the first pass the daemon updates exactly those 10 000. On the second pass the subquery returns the same IDs again — but the WHERE predicate trust_score IS NULL OR trust_score <> sub.ts is now false for all 10 000. Zero updates.

The daemon keeps running, pushes 10 000 rows a second through the subquery, and updates nothing. The filter in the UPDATE WHERE has no effect on which rows the subquery produces — it just discards them after the fact.

The subquery produces the same candidate set forever, and the WHERE discards it forever.

The fix

Move the filter into the subquery, not after it:

UPDATE job_advertisement ja
SET trust_score = sub.ts
FROM (
    SELECT id, compute_trust(id) AS ts
    FROM job_advertisement
    WHERE trust_score IS NULL
    ORDER BY id
    LIMIT 10000
) sub
WHERE ja.id = sub.id;

Now the subquery picks fresh candidates every iteration (all rows that don't yet have a trust_score), and the WHERE accepts all of them. The daemon makes progress. 10 000 per minute, predictably, progress measurable.

ORDER BY RANDOM() LIMIT N as "progress insurance": works, but expensive on large tables because PostgreSQL sorts the whole thing.
WHERE id > (last_seen_id) with a persistent cursor: the correct solution for strictly ordered batch processing. Pro: deterministic, repeatable, monitorable. Con: requires a state file or a small cursor table.

We usually use the filter-in-subquery variant as long as a simple IS NULL predicate is enough. As soon as the progress condition becomes more complex (several flags, foreign-key dependent), we switch to a cursor.

Why it isn't caught in development

On a dev database with 1 000 rows and trust_score initially NULL for all of them: the daemon updates all 1 000 on the first run, reports "nothing to do" on the second. That is the desired behaviour. In production with 2 M rows, only 10 000 are touched per batch — and the same 10 000 every time.

The anti-pattern is invisible at dev scale. It only surfaces in production when someone watches the "rows with trust_score != NULL" monitor flat for two weeks.

Operational rule

For every batched UPDATE/DELETE/INSERT with LIMIT: make sure the selection comes from the progress set, don't let the WHERE predicate try to force progress.

And a monitorable progress metric per daemon, emitted every minute:

Rows pending (WHERE progress_condition): X
Rows done: Y
Rate/min: Z

If Z is 0 for two iterations in a row, either everything is finished (X == 0) or your daemon is spinning in place.

UPDATE with subquery and LIMIT: When the daemon spins in place

What's wrong

The fix

Why it isn't caught in development

Operational rule

A similar fire?

What's wrong

The fix

Two related anti-patterns

Why it isn't caught in development

Operational rule

A similar fire?