Async Batch Processing

Async batch processing is the concurrency engine that sits between a broker full of buffered instrument payloads and the transactional writes that land results in the LIMS. Its job is narrow and unforgiving: drain queued work at a rate the downstream systems can absorb, preserve ordering only where clinical semantics demand it, and guarantee that every payload is committed exactly once or explicitly quarantined — never silently lost, never duplicated. This page specifies the stage contract, the asyncio patterns that make it safe under instrument surge, the idempotency model that survives partition recovery, and the tests that prove the invariants hold at the boundaries where a lost result becomes a patient-safety event.

Context and Pipeline Position

Within the Instrument Data Ingestion & HL7/CSV Pipelines tier, async batch processing is the execution layer, not an acquisition or parsing layer. Upstream, the Serial & FTP Polling Architectures capture raw HL7 v2.x frames, ASTM E1394/E1381 records, and CSV exports and land them, with immutable tracking identifiers, onto a durable broker (RabbitMQ, Kafka, or Redis Streams). This stage consumes from that broker, fans work out to a bounded pool of coroutines, and drives each payload through normalization and the schema validation and error-handling layer before committing. Payloads that arrive as delimited exports are reshaped by the CSV to HL7 Transformation stage first; this layer treats a transformed HL7 message and a natively HL7 message identically. Downstream, a committed result becomes eligible for the Clinical Result Validation & Rule Engine Architecture, which owns range, delta, and critical-value logic. Async batch processing owns none of those clinical decisions — it owns throughput, ordering, retry, and commit atomicity.

The separation matters for scale. Because this stage never mutates clinical meaning, it can run many workers in parallel and be restarted mid-flight without corrupting a verdict — the correctness of a result does not depend on which worker processed it or in what order, only that it was committed once and audited.

Stage Boundaries

Async batch processing exposes one ingress contract and two egress contracts, with explicit failure semantics at each edge. Treat these as the only supported ways a payload enters or leaves the stage.

Ingress — a buffered work item. The stage accepts a broker message carrying an opaque raw payload, a tracking_id assigned at acquisition, an instrument_id, a received_at timestamp, and a priority band (STAT, ROUTINE, RERUN). The stage does not trust that the payload is well-formed; structural validity is established downstream, not assumed at ingress. A message that cannot even be deserialized into this envelope is nacked to the dead-letter queue immediately — it is not retried, because a malformed envelope will never become well-formed on replay.

Egress A — a committed batch acknowledgment. On success the stage emits, per payload, a durable commit receipt: the tracking_id, the computed idempotency_key, the LIMS transaction identifier, the payload SHA-256 hash, and the commit timestamp. This receipt is the object reconciliation later correlates against instrument transmission logs. The commit is additive and idempotent — replaying the same payload yields the same receipt and no new LIMS row.

Egress B — a quarantine record. When a payload fails validation, exceeds its retry budget, or trips the circuit breaker, it is serialized to a dead-letter queue with full context: the raw payload, the failing stage, the error class, the attempt count, and the last exception chain. Quarantine is a terminal state for the worker; a human data engineer or an automated replay job owns re-driving it.

Failure semantics. Transport and downstream-unavailability failures are retryable with backoff and never collapse into a false success. Schema and semantic failures are not retryable and route straight to quarantine. The invariant across every edge: a payload is only acknowledged to the broker after its LIMS commit is durable, so a worker crash between consume and commit redelivers the message rather than dropping the result. At-least-once delivery plus idempotent commit yields effectively exactly-once.

Batch Framing and Priority Specification

The stage does not process one payload at a time; it assembles bounded batches so that connection acquisition, transaction setup, and audit flushing amortize across many results. Batch framing is governed by three limits, whichever trips first, and by strict priority preemption so a STAT cardiac troponin never waits behind a reference-lab batch reload.

Parameter	Typical range	Rationale
`max_batch_size`	500–5,000 records	Caps memory footprint; higher for flat CSV, lower for OBX-dense HL7.
`max_batch_bytes`	8–32 MiB	Hard ceiling independent of record count for wide payloads.
`linger_ms`	50–500 ms	Max time a partial batch waits before flushing, bounding STAT latency.
`max_concurrency`	8–64 workers	Semaphore-gated ceiling on simultaneous LIMS connections.
`priority_bands`	STAT / ROUTINE / RERUN	STAT preempts; RERUN yields; ROUTINE is the default drain.

Priority is enforced at consumption, not after batching: the consumer drains the STAT stream to empty before touching ROUTINE, and reruns are throttled to a fraction of total concurrency so a bulk reprocessing job cannot starve live instrument traffic. Ordering is preserved only within a single specimen_id — results for the same specimen commit in acquisition order — while cross-specimen ordering is deliberately unconstrained to unlock parallelism.

Implementation Patterns

The core loop is a bounded producer/consumer: a consumer coroutine pulls from the broker and frames batches, a pool of worker coroutines drains them under an asyncio.Semaphore, and structured concurrency via asyncio.TaskGroup guarantees that a failure in one worker cancels its siblings deterministically rather than leaking orphaned tasks. Model the work item and receipt with Pydantic v2 so the envelope is enforced by construction and the receipt is serializable straight into the audit sink. The concurrency primitives used here are documented in the official Python asyncio library.

python

from __future__ import annotations

import asyncio
import hashlib
from datetime import datetime, timezone
from enum import Enum

from pydantic import BaseModel, ConfigDict, Field


class Priority(str, Enum):
    STAT = "STAT"
    ROUTINE = "ROUTINE"
    RERUN = "RERUN"


class WorkItem(BaseModel):
    model_config = ConfigDict(frozen=True, extra="forbid")

    tracking_id: str
    instrument_id: str
    specimen_id: str
    priority: Priority = Priority.ROUTINE
    received_at: datetime
    raw: bytes

    def idempotency_key(self) -> str:
        digest = hashlib.sha256()
        digest.update(self.instrument_id.encode())
        digest.update(self.specimen_id.encode())
        digest.update(self.raw)
        return digest.hexdigest()


class CommitReceipt(BaseModel):
    model_config = ConfigDict(frozen=True, extra="forbid")

    tracking_id: str
    idempotency_key: str
    lims_txn_id: str
    payload_sha256: str
    committed_at: datetime

The idempotency key is derived from stable identity — instrument, specimen, and the exact bytes received — not from a wall-clock timestamp, so a redelivery after a broker partition maps to the same key and the commit is a no-op rather than a duplicate result. The worker pool caps concurrent LIMS connections with a semaphore and uses TaskGroup so any unhandled failure propagates as structured cancellation:

python

async def drain_batch(
    batch: list[WorkItem],
    sink: "LimsSink",
    max_concurrency: int = 32,
) -> list[CommitReceipt]:
    gate = asyncio.Semaphore(max_concurrency)
    receipts: list[CommitReceipt] = []

    async def handle(item: WorkItem) -> None:
        async with gate:
            receipt = await commit_one(item, sink)
            receipts.append(receipt)

    async with asyncio.TaskGroup() as tg:
        for item in batch:
            tg.create_task(handle(item))

    return receipts

Every downstream call is wrapped in a timeout guard so a stalled LIMS socket cannot pin a worker indefinitely, and retryable failures use exponential backoff with full jitter to avoid a thundering-herd reconnect when the LIMS recovers. A connection pool (asyncpg for PostgreSQL-backed LIMS, aioodbc for ODBC middleware) keeps connection setup off the hot path:

python

import random


async def commit_one(item: WorkItem, sink: "LimsSink", max_attempts: int = 5) -> CommitReceipt:
    key = item.idempotency_key()
    for attempt in range(1, max_attempts + 1):
        try:
            async with asyncio.timeout(10):
                return await sink.upsert(item, idempotency_key=key)
        except RetryableCommitError:
            if attempt == max_attempts:
                raise
            backoff = min(30.0, 2 ** attempt)
            await asyncio.sleep(random.uniform(0, backoff))  # full jitter
    raise AssertionError("unreachable")  # pragma: no cover

The upsert itself must be idempotent at the database layer: an INSERT ... ON CONFLICT (idempotency_key) DO NOTHING RETURNING txn_id, or the LIMS vendor’s dedup endpoint keyed on the same digest. The application-level key and the database-level constraint are belt and braces — either alone closes the duplicate window, but clinical pipelines run both because a single dedup layer is a single point of silent duplication. The batch is only acknowledged to the broker after every receipt in it is durable; a partial batch acks the committed items and re-drives the rest, which is safe precisely because commits are idempotent.

Error Classification and Handling

Async batch processing sorts every failure into a three-tier taxonomy, and the tier — not the individual exception — decides the workflow. Collapsing these tiers is the classic defect that turns a transient outage into lost results or a bad record into a wedged queue.

Tier	Example signatures	Disposition
Transport / infrastructure	LIMS 5xx, connection reset, pool timeout, broker redelivery	Retry with backoff + jitter; circuit-breaker after threshold.
Schema / contract	Missing OBR-4 or OBX-5, truncated ASTM frame, undecodable bytes	Non-retryable; quarantine to DLQ with full context.
Semantic / clinical	Unresolvable specimen ID, mismatched accession, unknown test code	Non-retryable; quarantine and flag for human adjudication.

Transport failures are the only retryable tier. A circuit breaker halts a worker after consecutive failures exceed a threshold, so when the LIMS is down the pool stops hammering it and sheds load back to the durable broker rather than spinning. Schema and semantic failures are terminal for the automated path — retrying a truncated ASTM frame produces the same truncated frame — so they route straight to the dead-letter queue with the raw payload, the failing tier, the attempt count, and the full raise ... from ... exception chain preserved for forensic triage. A payload that trips the retry budget on a transport failure is also quarantined rather than dropped, which keeps the at-least-once guarantee intact even under prolonged downstream outages. Every quarantine event and every commit is written to structured JSON logs carrying the tracking_id as a correlation ID, so a single result can be traced across acquisition, this stage, and commit. ACK-timeout handling at the protocol edge is specified separately in handling HL7 ACK timeouts in clinical data pipelines.

Regulatory Touchpoints

Because this stage is the custody boundary where a buffered payload becomes a committed clinical record, several regulatory clauses bind directly to its behavior. Under CLIA §493.1291(a), the laboratory must have adequate manual or electronic systems to ensure test results are accurately and reliably transmitted from acquisition to the final report; the exactly-once commit model and the reconciliation receipts are the evidence that no result is dropped or duplicated in transit. CAP All Common (COM.30000 / COM.40000) requires documented verification that transmitted data match instrument output and that interface changes are validated before use — satisfied here by the payload SHA-256 hash recorded on every receipt and by the versioned worker configuration. 21 CFR Part 11 §11.10(e) requires secure, computer-generated, time-stamped audit trails that record operator actions and cannot obscure prior entries; every commit and every quarantine writes an append-only audit event capturing the operator attribution (system worker or human replay), payload hash, timestamp, and processing stage. Where the payload carries protected health information, the HIPAA Security Rule §164.312©(1) and (e)(2)(i) integrity and transmission-security controls apply to every broker hop and LIMS write; the enforcement mechanism for encryption in transit and least-privilege credentials is specified in Security & Access Controls, and the tenancy rules that determine which results a worker may commit are governed by the CLIA/CAP Data Boundaries. Field-level attribution of committed values back to their source segments follows the HL7 v2 Segment Mapping contract.

Testing and Validation

The invariants of this stage — no loss, no duplication, order within a specimen, safety-degrading quarantine — are exactly the properties that unit examples miss and property-based tests catch. Use hypothesis to generate adversarial batches and assert the invariants hold across redelivery, partial failure, and concurrent execution.

python

from hypothesis import given, strategies as st


work_items = st.builds(
    WorkItem,
    tracking_id=st.uuids().map(str),
    instrument_id=st.sampled_from(["CHEM-1", "HEME-2", "COAG-3"]),
    specimen_id=st.text(min_size=4, max_size=12),
    received_at=st.datetimes(timezones=st.just(timezone.utc)),
    raw=st.binary(min_size=1, max_size=512),
)


@given(batch=st.lists(work_items, min_size=1, max_size=200))
def test_commit_is_idempotent_under_replay(batch: list[WorkItem]) -> None:
    sink = FakeLimsSink()
    first = asyncio.run(drain_batch(batch, sink))
    # Redeliver the entire batch — simulates broker partition recovery.
    second = asyncio.run(drain_batch(batch, sink))

    keys = {i.idempotency_key() for i in batch}
    assert sink.row_count() == len(keys)  # no duplicate rows on replay
    assert {r.idempotency_key for r in first} == {r.idempotency_key for r in second}

Two additional test classes round out coverage. A contract test asserts the ingress/egress envelopes stay compatible with the upstream acquisition stage and the downstream LIMS adapter — a Pydantic round-trip (WorkItem.model_validate(item.model_dump())) that fails the build the moment a field is renamed. A golden-file fixture pins byte-exact HL7 and ASTM payloads with their expected commit receipts, so a change to normalization or hashing that shifts an idempotency key is caught before it can silently split one specimen into two LIMS rows. Fault injection — a FakeLimsSink that raises RetryableCommitError on the first N calls, then succeeds — proves the backoff path commits exactly once and never exhausts its budget on a recoverable outage.

Serial & FTP Polling Architectures — the acquisition layer that buffers raw instrument payloads onto the broker this stage drains.
Schema Validation & Error Handling — the validation contracts each worker enforces before commit.
CSV to HL7 Transformation — the stage that reshapes delimited exports into the HL7 messages this layer commits identically to native ones.
Handling HL7 ACK timeouts in clinical data pipelines — the protocol-edge retry logic that complements this stage’s commit-side backoff.
Clinical Result Validation & Rule Engine Architecture — the downstream tier that applies range, delta, and critical-value logic to committed results.

Part of: Instrument Data Ingestion & HL7/CSV Pipelines.

Frequently Asked Questions

How does at-least-once broker delivery avoid producing duplicate LIMS results?

The broker may redeliver a message after a worker crash or partition, but the commit is keyed on a stable idempotency_key derived from the instrument, specimen, and exact payload bytes. Both the application and a database unique constraint reject a second write for the same key, so at-least-once delivery plus idempotent commit is effectively exactly-once.

Why is the idempotency key not derived from a timestamp?

A wall-clock timestamp changes on redelivery, so two copies of the same result would hash to different keys and both would commit. Keying on stable identity — instrument, specimen, and the raw bytes — guarantees a redelivered payload maps to the existing row and becomes a no-op.

What keeps a STAT result from waiting behind a bulk rerun batch?

Priority is enforced at consumption: the consumer drains the STAT stream to empty before touching ROUTINE, and RERUN traffic is throttled to a fraction of total concurrency. A bulk reprocessing job can never acquire enough workers to starve live instrument traffic.

When does a failed payload get retried versus quarantined?

Only transport and infrastructure failures are retried, with exponential backoff and jitter under a circuit breaker. Schema and semantic failures are terminal — retrying a truncated ASTM frame reproduces the same error — so they route straight to the dead-letter queue with full context for human adjudication. A payload that exhausts its retry budget is also quarantined, never dropped.

Why acknowledge the broker only after the LIMS commit is durable?

Acking before the commit is durable would drop the result if the worker crashed in the window between the two. Deferring the ack until the commit lands means a crash triggers redelivery instead of loss, and idempotency absorbs the redelivery without creating a duplicate.