Integration concepts · 6 min read

MuleSoft batch processing: phases, aggregator, and watermarking

By the MulePrep team · Updated June 2026

Load and Dispatch

Split payload into records, persist to a queue, build blocks

Process

Run each record through the steps, block by block, in parallel

On Complete

Optional summary of successful and failed records

MuleSoft batch processing is the component you reach for when a flow has to move more records than comfortably fit in memory at once - a nightly sync of 200,000 accounts, a CSV with a million rows, a backfill against a rate-limited API. Instead of looping an in-memory collection, a Mule batch job splits the input into individual records, persists them to a queue on disk, and then streams them through your processing steps in parallel. That single design choice is what makes it reliable: because the records live in a persistent queue rather than in heap, the job can process inputs far larger than available memory and can pick up where it left off after a restart.

This post explains how the machinery actually works - the three MuleSoft batch job phases, the Batch Aggregator for efficient bulk calls, and watermarking with polling for incremental syncs - so you can reason about batch the way the exams test it, not just drag the component onto a canvas.

Load and Dispatch

Split payload into records, persist to a queue, build blocks

Process

Run each record through the steps, block by block, in parallel

On Complete

Optional summary of successful and failed records

The three batch job phases

Every batch job moves through three phases in a fixed order. Understanding what each one does - and what runs where - is the single most testable idea in this topic.

Load and Dispatch is implicit; you do not author it. When a payload reaches the batch job, Mule splits it into individual records, persists each record to a queue, and assembles records into blocks (256 records per block by default). This is the phase that lets batch handle datasets that would never fit in memory: only a block at a time is resident, the rest sit on disk.

Process is where your work happens. The job moves block by block, and within a block the records are processed asynchronously, in parallel across a thread pool. A batch job contains one or more batch steps, and every record flows through the steps in sequence. A step can carry an accept expression or accept policy so it only handles a subset of records - for example, only the ones an earlier step flagged.

On Complete is optional and runs exactly once, after every record has been attempted. It does not receive the records; it receives a result summary - how many records loaded, how many succeeded, how many failed. Use it to log outcomes, send a completion notification, or raise an alert when the failure count crosses a threshold. Because it runs once at the end, On Complete is the wrong place to do per-record work and the right place to report on the run as a whole.

PhaseWho writes itWhat it doesPer-record?
Load and DispatchMule (implicit)Splits payload into records, persists to a queue, builds blocksn/a
ProcessYouRuns each record through the batch steps, in parallel, block by blockYes
On CompleteYou (optional)Reports a summary of succeeded/failed records, onceNo

How records and failures flow through steps

A common misconception is that one bad record kills the job. It does not, by default. Each batch step has a max-failed-records setting. With the default of 0, a record that throws an error in a step is marked failed and filtered out of subsequent steps - but the job keeps processing every other record. Set max-failed-records to -1 and the job continues no matter how many records fail; set it to a positive number and the job stops once that many have failed.

This per-record fault isolation is the whole point of batch. A for-each or parallel-for-each iterates an in-memory collection inside a single flow: throw once and, depending on your error handler, you can lose the entire iteration and your place in it. Batch persists each record's state, so a failure is a property of that record, not of the run. By default, later steps only receive records that have succeeded so far; if a step genuinely needs to see the failures (to route them to a dead-letter destination, say), you configure it to accept failed records explicitly.

That filtering behaviour is also why ordering your steps matters: validate and enrich early, write to the system of record last, so a record that fails validation never reaches the expensive downstream call.

The Batch Aggregator: bulk calls instead of one-per-record

Processing records one at a time is correct but often slow and expensive - a thousand records means a thousand HTTP calls or a thousand single-row inserts. The Batch Aggregator (batch:aggregator) fixes this by collecting processed records into a group and releasing them together, so a downstream step operates on many records in one call.

You use it one of two ways:

  • Fixed-size aggregation. Set aggregatorSize to a number - say 100 - and the aggregator releases records in groups of exactly that size. Inside the aggregator, the payload is an array of those 100 records, which you hand to a bulk insert or a bulk API endpoint. This is the standard pattern for Salesforce Composite/Bulk operations and for batched database writes.
  • Streaming aggregation. Set streaming="true" and the aggregator streams the entire step's records as one iterable, letting you write all records in a single pass without holding them in memory. Use this when the whole step's output should go out together and the count is large or unknown.

Sizing the aggregator is a real design decision: too small and you lose the efficiency you came for; too large and you risk hitting payload limits or losing more work if that one bulk call fails. Match aggregatorSize to the downstream system's optimal batch size (for example, the API's documented max per request) rather than guessing. Because aggregation changes the failure unit from one record to a group, decide deliberately how a partial bulk failure should be reported back onto individual records.

Watermarking and polling for incremental syncs

The phases and the aggregator handle how records move through one run. Watermarking handles which records each run should pick up - the difference between re-reading an entire table every poll and reading only what changed.

A watermark is a persisted marker of the last value you successfully processed: a lastModified timestamp, an incrementing id, a sequence number. A polling source - the Scheduler triggering a query, or a connector's built-in watermark feature - reads the saved watermark from the Object Store, queries the source for records strictly greater than it, processes them, and then advances the watermark to the highest value it handled. On the next poll, it asks only for newer records. That is watermarking polling: a repeating schedule turned into an incremental, stateful sync.

Two sharp edges decide whether it is correct:

  • Advance the watermark only after success. If you move the marker before the records are safely processed and the run fails, those records are skipped forever. Update it at the end, against the highest processed value.
  • Boundaries can duplicate or drop. A > comparison can re-read a row that shares the boundary timestamp; a >= can skip one. Because a poll can overlap a write, you will sometimes reprocess a record - which is exactly why the downstream effect should be idempotent. The companion piece on idempotent consumers and at-least-once delivery covers how to make reprocessing harmless.

Watermarking is also why batch pairs naturally with asynchronous design: a scheduled poll feeding a batch job is fire-and-forget by nature, decoupled from any synchronous caller. If that distinction is fuzzy, the sync vs async integration guide draws the line, and the error handling guide explains how on-error-continue versus on-error-propagate changes whether a record-level failure is swallowed or surfaced.

Where this shows up on the MuleSoft exams

Batch is not a Developer I topic - it lives in the implementation and integration-design objectives of the more advanced tracks. If you are studying the MuleSoft Developer II certification, expect scenario questions on the three phases, max-failed-records behaviour, and aggregator sizing. On the architecture side, batch and watermarking show up as reliability and integration-pattern decisions in the MuleSoft Integration Architect certification and as platform-design tradeoffs in the MuleSoft Platform Architect certification. The questions reward understanding why batch persists records and when an incremental watermark beats a full reload - not memorising default block sizes.

If you want to drill these as exam-style questions, the free demo is a low-stakes way to find your weak spots before you commit study time.

Frequently asked questions

What are the three phases of a MuleSoft batch job?
Load and Dispatch, Process, and On Complete. Load and Dispatch is implicit: Mule splits the incoming payload into individual records, persists them to a queue, and groups them into blocks. Process runs each record asynchronously through the batch steps, block by block. On Complete is optional and runs once at the end with a summary of how many records succeeded and failed.
What does the Batch Aggregator do in MuleSoft?
The Batch Aggregator collects processed records into groups and releases them together, so you can send many records in one downstream call instead of one call per record. You set a fixed aggregatorSize (for example 100) for predictable bulk calls, or use streaming to aggregate the entire step. It is the standard way to turn a record-by-record batch into efficient bulk inserts or bulk API requests.
What is watermarking in MuleSoft and how does it relate to polling?
A watermark is a saved marker of the last value you successfully processed, such as a timestamp or an incrementing id. A polling source like the Scheduler or a connector's watermark feature stores it in the Object Store and, on the next poll, queries only for records newer than the watermark. This turns a repeating poll into an incremental sync that does not reprocess rows it has already handled.
When should I use batch processing instead of a for-each or parallel-for-each?
Use batch processing when the dataset is large enough that it should not sit in memory at once, when records should be processed independently with per-record error handling, or when the job must survive a restart. For-each and parallel-for-each iterate an in-memory collection inside a single flow and lose their place if the app stops. Batch persists its records to a queue, so it streams large inputs and resumes work after a crash.
Does a failed record stop the whole batch job?
No, not by default. Each batch step has a max-failed-records setting; with the default of zero, a record that throws is filtered out and the job keeps processing the rest. Subsequent steps only receive records that succeeded so far, unless a step is configured to accept failed records. Set max-failed-records to a negative value to let the job continue regardless of failures, and inspect the counts in the On Complete phase.

Independent study resource - not affiliated with, endorsed by, or connected to MuleSoft or Salesforce; their trademarks belong to their owners. All practice questions are original.