Batch Mode

Overview

Inspect supports calling the batch processing APIs for OpenAI, Anthropic, Google, and Together AI models. Batch processing has lower token costs (typically 50% of normal costs) and higher rate limits, but also substantially longer processing times—batched generations typically complete within an hour but can take much longer (up to 24 hours).

When batch processing is enabled, individual model requests are automatically collected and sent as batches to the provider’s batch API rather than making individual API calls.

Important

When considering whether to use batch processing for an evaluation, you should assess whether your usage pattern is a good fit for batch APIs. Generally evaluations that have a small number of sequential generations (e.g. a QA eval with a model scorer) are a good fit, as these will often complete in a small number of batches without taking many hours.

On the other hand, evaluations with a large and/or variable number of generations (e.g. agentic tasks) can often take many hours or days due to both the large number of batches that must be waited on and the path dependency created between requests in a batch.

Enabling Batch Mode

Pass the --batch CLI option or batch=True to eval() in order to enable batch processing for providers that support it. The --batch option supports several formats:

# Enable batching with default configuration
inspect eval arc.py --model openai/gpt-4o --batch

# Specify a batch size (e.g. 1000 requests per batch)
inspect eval arc.py --model openai/gpt-4o --batch 1000

# Pass a YAML or JSON config file with batch configuration
inspect eval arc.py --model openai/gpt-4o --batch batch.yml

Or from Python:

eval("arc.py", model="openai/gpt-4o", batch=True)
eval("arc.py", model="openai/gpt-4o", batch=1000)

If a provider does not support batch processing the batch option is ignored for that provider.

Batch Configuration

For more advanced batch processing configuration, you can specify a BatchConfig object in Python or pass a YAML/JSON config file via the --batch option. For example:

from inspect_ai.model import BatchConfig
eval(
    "arc.py", model="openai/gpt-4o", 
    batch=BatchConfig(size=200, send_delay=60)
)

Available BatchConfig options include:

Option	Description
`size`	Target number of requests to include in each batch. If not specified, uses provider-specific defaults (OpenAI: 100, Anthropic: 100). Batches may be smaller if the timeout is reached or if requests don’t fit within size limits.
`send_delay`	Maximum time (in seconds) to wait before sending a partially filled batch. If not specified, uses a default of 15 seconds. This prevents indefinite waiting when request volume is low.
`tick`	Time interval (in seconds) between checking for new batch requests and batch completion status. If not specified, uses a default of 15 seconds.
`max_batches`	Maximum number of batches to have in flight at once for a provider (defaults to 100).

Batch Processing Flow

When batch processing is enabled, the following steps are taken when handling generation requests:

Request Queuing: Individual model requests are queued rather than sent immediately
Batch Formation: Requests are grouped into batches based on size limits and timeouts.
Batch Submission: Complete batches are submitted to the provider’s batch API.
Status Monitoring: Inspect periodically checks batch completion status.
Result Distribution: When batches complete, results are distributed back to the original requests

These steps are transparent to the caller, however do have implications for total evaluation time as discussed above.

Details and Limitations

See the following documentation for additional provider-specific details on batch processing, including token costs, rate limits, and limitations:

In general, you should keep the following limitations in mind when using batch processing:

Batches may take up to 24 hours to complete.
Evaluations with many turns will wait for many batches (each potentially taking many hours), and samples will generally take longer as requests need to additionally wait on the other requests in their batch before proceeding to the next turn.
If you are using sandboxes then your machine’s resources may place an upper limit on the number of concurrent samples you have (correlated to the number of CPU cores, which will reduce batch sizes.

Footnotes

Web search and thinking are not currently supported by Google’s batch mode↩︎