The bug wasn't that something was broken. The bug was that the monitor claimed nothing was running — while everything was.

The setup: a discovery daemon dispatches 3 000 queries every 30 minutes to 8 parallel Docker containers. Each container processes ~375 queries at 9/min = 41 minutes per container. Because queue lengths differ, finish time varies from 8 to 17 minutes per container — but the last one takes up to 83 minutes.

The monitor polled every 5 minutes: "how many jobs has the batch completed in the last hour?". As long as the batch wasn't fully done, it reported 0 done/h. Operator dashboards said "Discovery pipeline inactive". In reality, eight containers were running at full tilt.

The architectural sin

The original code had one finalize() call at the end of the batch wrapper:

def run_batch(queries):
    assign_to_containers(queries)
    wait_for_all_containers()
    finalize(batch_id)  # ← only now does the status propagate

Meaning: until the last container finishes its last query, the success signal doesn't exist in any table the monitor reads.

The fix

Each container reports its own completion:

def run_container(container_id, queries):
    for q in queries:
        process(q)
        write_result(q, container_id)
    # Every finishing event is propagated immediately
    finalize_container(container_id, batch_id, count=len(queries))

In addition: the batch wrapper now only makes a final finalize_batch(batch_id) call for batch-level stats (total duration, etc.), not for row-level progress.

The monitor now sees new numbers at every container finish. "0 done/h" becomes "37, 284, 531, …" within the first 20 minutes.

The calibration rule

We derived a numeric rule from the incident that we now apply to every batch pipeline:

Batch size = Workers × Throughput/min × Target minutes

For a target finish time of 15 minutes at 9 queries/min with 4 workers: 4 × 9 × 15 = 540 queries/batch (round to 600). With 8 workers: 8 × 9 × 15 = 1 080 (round to 1 200).

The previous 3 000 batch size was a rule-of-thumb with no regard for monitor granularity. With 600-item batches, each iteration now runs under 20 minutes — the monitor sees new finish events every 8 minutes.

Transferable pattern

The anti-pattern is not confined to web crawlers. We've found it in three other setups:

ETL pipelines that fill staging tables per batch and only push to production via INSERT ... SELECT at the end.
ML training that writes checkpoints only at the end of each epoch — monitoring shows "stale" for 40+ minutes on large epochs.
Backup jobs that set status to ✅ only after all chunks are done — 6 h of status blindness while the backup runs.

The operational antidote is always the same: finalise as granularly as possible. Per-container, per-shard, per-epoch, per-chunk. Anything that makes monitoring granularity substantially shorter than total runtime is the right call.

Batch finalisation per container: why the monitor showed nothing for 83 minutes

The architectural sin

The fix

The calibration rule

Transferable pattern

A similar fire?