Model Concurrency
Overview
Connections to model APIs are the most fundamental unit of concurrency to manage. The main thing that limits model API concurrency is not local compute or network availability, but rather rate limits imposed by model API providers. Inspect manages this with per-model concurrency limits (a cap on in-flight requests to a given provider) plus automatic retry on rate-limits and transient errors.
Two modes are available for managing connections:
Adaptive. Use
--adaptive-connectionsto let Inspect tune the number, scaling up while the provider keeps up and backing off on rate-limit retries.Static. Set a fixed
--max-connectionsvalue. You need to know the right number for your tier and workload.
By default, adaptive concurrency is used with a maximum of 100 concurrent connections per model. This page covers using and customizing both modes, plus retry tuning and debugging.
For other forms of parallelism (multiple tasks, sandbox containers, custom code), see Parallelism.
Adaptive Connections
Use the --adaptive-connections option to automatically scale model concurrency to your available capacity. Adaptive connections starts at 20 in-flight per model, grows up to the maximum while the provider keeps up, and backs off on rate-limit retries. Adaptive connections is on by default (with a maximum of 100), so the following commands are equivalent:
inspect eval --model openai/gpt-5
inspect eval --model openai/gpt-5 --adaptive-connections=100Adaptive connections are a new feature introduced in Inspect v0.3.217. If you previously used --max-connections we recommend migrating to --adaptive-connections, as you will ramp up to the same maximum concurrency with less exposure to exponential backoff from rate limits.
When adaptive connections is in effect, max_samples automatically tracks the controller’s current limit. Set an explicit max_samples to override this behavior.
Bounds Tuning
Tune the bounds with min, start, and max: start is where the controller begins (it doubles aggressively during slow-start until the first rate-limit episode), and max is the ceiling.
Set max higher than where you expect the controller to settle, since it’s a ceiling for the search, not a target. If you’re seeing the controller pin at max without ever scaling down, you likely have headroom: raise max until you observe occasional rate-limit cuts, which is the controller’s signal that it’s operating at the edge of your tier.
The simplest form of bounds tuning is a single integer setting just the maximum:
inspect eval --model openai/gpt-5 --adaptive-connections 50
inspect eval --model openai/gpt-5 --adaptive-connections 200min-max constrains the range (start defaults to 20, clamped into the range):
inspect eval --model openai/gpt-5 --adaptive-connections 5-50
inspect eval --model openai/gpt-5 --adaptive-connections 10-200min-start-max also sets the starting value:
inspect eval --model openai/gpt-5 --adaptive-connections 5-10-50
inspect eval --model openai/gpt-5 --adaptive-connections 10-20-200In Python, pass True for defaults, False to disable adaptive (uses static max_connections instead), int to set the maximum, or an AdaptiveConcurrency to customize:
from inspect_ai.util import AdaptiveConcurrency
eval(
"task.py",
model="openai/gpt-5",
adaptive_connections=AdaptiveConcurrency(min=4, max=80),
)Retry Types
The controller distinguishes two kinds of retries.
Rate-limit retries (HTTP 429). These shrink the limit by
decrease_factor(default 0.8) per episode, with a debounce so a single rate-limit burst produces only one cut.Transient retries (5xx, timeouts, and network errors). These pause scale-up (the eventual success won’t count toward growth) but do not shrink the limit. Provider 5xx and network blips are usually infra noise unrelated to your concurrency, and lowering concurrency doesn’t help an upstream outage.
After a rate-limit cut, the controller waits at least cooldown_seconds (default 15s) before allowing another cut. If the response carries a Retry-After header, the cooldown extends to honor it. Cache hits and successful-after-retry calls are neutral: they neither grow nor shrink the limit.
Advanced Tuning
The response curve is also tunable. These fields are Python-only (CLI shorthand stays at min-max / min-start-max):
cooldown_seconds(default 15): minimum debounce between scale-down cuts. Larger for long-running agent loops where each rate-limit episode takes longer to clear; smaller for short request workloads.decrease_factor(default 0.8): multiplicative cut on each rate-limit episode. More aggressive (e.g. 0.5) for volatile tiers where overshoots are common; gentler when tiers are stable.scale_up_percent(default 0.05): additive growth per clean round in steady state. Increase for short evals where slow ramp-up doesn’t have time to converge.
from inspect_ai.util import AdaptiveConcurrency
eval(
"task.py",
model="openai/gpt-5",
adaptive_connections=AdaptiveConcurrency(
min=4,
max=80,
cooldown_seconds=30,
decrease_factor=0.5,
scale_up_percent=0.1,
),
)Limit History
The full history of scale changes is captured in the eval log under stats.connection_limit_history. Each entry records the timestamp, model, old and new limits, and a reason of slow_start, steady_state_up, or rate_limit. Only rate_limit reflects an actual scale-down (transient infra noise no longer appears here). You can stream the same events live in the trace log:
inspect trace dump --filter "[connections]"Limiting Retries
By default, Inspect will retry model API calls indefinitely (with exponential backoff) when a recoverable HTTP error occurs. The initial backoff is 3 seconds and exponentiation will result in a 25 minute wait for the 10th request (then 30 minutes for the 11th and subsequent requests). You can limit Inspect’s retries using the --max-retries option:
inspect eval --model openai/gpt-4 --max-retries 10Note that model interfaces themselves may have internal retry behavior (for example, the openai and anthropic packages both retry twice by default).
You can put a limit on the total time for retries using the --timeout option:
inspect eval --model openai/gpt-4 --timeout 600Debugging Retries
If you want more insight into Model API connections and retries, specify log_level=http. For example:
inspect eval --model openai/gpt-4 --log-level=httpYou can also view all of the HTTP requests for the current (or most recent) evaluation run using the inspect trace http command. For example:
inspect trace http # show all http requests
inspect trace http --failed # show only failed requestsStatic Connections
If you prefer a static limit for connections, use --max-connections rather than --adaptive-connections. For example:
$ inspect eval --model openai/gpt-4 --max-connections 20When both --max-connections and --adaptive-connections are set, the explicit max_connections value takes precedence and adaptive is disabled. To opt out of adaptive without picking a specific cap (the provider’s default applies), pass --adaptive-connections false:
inspect eval --model openai/gpt-4 --adaptive-connections falseBatch mode likewise uses static concurrency regardless of --adaptive-connections.
Increasing the max connections might yield better performance due to higher parallelism, however it might also result in worse performance if this causes us to frequently hit rate limits (which are retried with exponential backoff). The “correct” max connections for your evaluations will vary based on your actual rate limit and the size and complexity of your evaluations.
Since it can be difficult to tune this value (especially across different times of day), you are generally much better off using Adaptive Connections which will dynamically find the maximum throughput that can be supported.
Learning More
Parallelism: running multiple tasks or models in parallel, sandbox container concurrency, and writing parallel custom code.
Batch Mode: provider-side batch APIs (separate quota, longer turnaround, lower per-token cost).