Setting Limits
Overview
In open-ended model conversations (for example, an agent evaluation with tool usage) it’s possible that a model will get “stuck” attempting to perform a task with no realistic prospect of completing it. Further, sometimes models will call commands in a sandbox that take an extremely long time (or worst case, hang indefinitely).
For this type of evaluation it’s normally a good idea to set limits on some combination of total time, total messages, tokens used, and/or cost. This article covers:
- Sample Limits — limits applied to individual samples within a task.
- Scoped Limits — limits applied to arbitrary blocks of code.
- Agent Limits — limits applied to agent execution.
Sample Limits
Sample limits don’t result in errors, but rather an early exit from execution (samples that encounter limits are still scored, albeit nearly always as “incorrect”).
Time Limit
Here we set a time_limit of 15 minutes (15 x 60 seconds) for each sample within a task:
@task
def intercode_ctf():
return Task(
dataset=read_dataset(),
solver=[
system_message("system.txt"),
use_tools([bash(timeout=3 * 60)]),
generate(),
],
time_limit=15 * 60,
scorer=includes(),
sandbox="docker",
)Note that we also set a timeout of 3 minutes for the bash() command. This isn’t required but is often a good idea so that a single wayward bash command doesn’t consume the entire time_limit.
We can also specify a time limit at the CLI or when calling eval():
inspect eval ctf.py --time-limit 900Appropriate timeouts will vary depending on the nature of your task so please view the above as examples only rather than recommend values.
Working Limit
The working_limit differs from the time_limit in that it measures only the time spent working (as opposed to retrying in response to rate limits or waiting on other shared resources). Working time is computed based on total clock time minus time spent on (a) unsuccessful model generations (e.g. rate limited requests); and (b) waiting on shared resources (e.g. Docker containers or subprocess execution).
In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (azureai) and in these cases the working_time will include any internal retries that the model client performs.
Here we set an working_limit of 10 minutes (10 x 60 seconds) for each sample within a task:
@task
def intercode_ctf():
return Task(
dataset=read_dataset(),
solver=[
system_message("system.txt"),
use_tools([bash(timeout=3 * 60)]),
generate(),
],
working_limit=10 * 60,
scorer=includes(),
sandbox="docker",
)Message Limit
Message limits enforce a limit on the number of messages in any conversation (e.g. a TaskState, AgentState, or any input to generate()).
Message limits are checked:
Whenever you call generate() on any model. A LimitExceededError will be raised if the number of messages passed in
inputparameter to generate() is equal to or exceeds the limit. This is to avoid proceeding to another (wasteful) generate call if we’re already at the limit.Whenever
TaskState.messagesorAgentState.messagesis mutated, but a LimitExceededError is only raised if the count exceeds the limit.
Here we set a message_limit of 30 for each sample within a task:
@task
def intercode_ctf():
return Task(
dataset=read_dataset(),
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
],
message_limit=30,
scorer=includes(),
sandbox="docker",
)This sets a limit of 30 total messages in a conversation before the model is forced to give up. At that point, whatever output happens to be in the TaskState will be scored (presumably leading to a score of incorrect).
Token Limit
Token usage (using total_tokens of ModelUsage) is automatically recorded for all models. Token limits are checked whenever generate() is called.
Here we set a token_limit of 500K for each sample within a task:
@task
def intercode_ctf():
return Task(
dataset=read_dataset(),
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
],
token_limit=(1024*500),
scorer=includes(),
sandbox="docker",
)It’s important to note that the token_limit is for all tokens used within the execution of a sample. If you want to limit the number of tokens that can be yielded from a single call to the model you should use the max_tokens generation option.
Cost Limit
Cost is computed from token usage and model cost data (see Model Cost). Cost limits are checked whenever generate() is called.
Here we set a cost_limit of $2.00 for each sample within a task:
@task
def intercode_ctf():
return Task(
dataset=read_dataset(),
solver=[
system_message("system.txt"),
use_tools([bash(timeout=120)]),
generate(),
],
cost_limit=2.00,
scorer=includes(),
sandbox="docker",
)The cost_limit requires model cost data to be configured via set_model_cost() or --model-cost-config. An error will be raised if a cost limit is set without cost data for all models used in the evaluation.
Model Cost
Cost tracking requires cost data for each model present in the eval or eval set. There are two ways to set cost data:
Python API:
from inspect_ai.model import set_model_cost, ModelCost
set_model_cost("openai/gpt-4o", ModelCost(
input=2.50, output=10.00,
input_cache_write=0, input_cache_read=1.25,
))CLI (YAML or JSON file):
Each model needs a price set for input, output, input_cache_write, and input_cache_read. Prices should be given in dollars per million tokens. Set unused fields to 0.
Below is an example cost config file given in YAML:
openai/gpt-4o:
input: 2.50
output: 10.00
input_cache_write: 0
input_cache_read: 1.25
anthropic/claude-sonnet-4-5-20250514:
input: 3.00
output: 15.00
input_cache_write: 3.75
input_cache_read: 0.30(As of Feb 9 2026, all major model providers count reasoning tokens as output tokens, so no separate price needs to be provided for reasoning tokens. If your use case requires separate calculation of reasoning token prices, contact us.)
When model cost data is configured, costs will be tracked for the sample as a whole, as well as any events within the sample that have a ModelUsage field.
Additionally, configuring model cost data allows setting sample cost limits:
inspect eval ctf.py --model-cost-config pricing.yaml --cost-limit 2.00Custom Limit
When limits are exceeded, a LimitExceededError is raised and caught by the main Inspect sample execution logic. If you want to create custom limit types, you can enforce them by raising a LimitExceededError as follows:
from inspect_ai.util import LimitExceededError
raise LimitExceededError(
"custom",
value=value,
limit=limit,
message=f"A custom limit was exceeded: {value}"
)Query Usage
We can determine how much of a sample limit has been used, what the limit is, and how much of the resource is remaining:
sample_time_limit = sample_limits().time
print(f"{sample_time_limit.remaining:.0f} seconds remaining")Note that sample_limits() only retrieves the sample-level limits, not scoped limits or agent limits.
Scoped Limits
You can also apply limits at arbitrary scopes, independent of the sample or agent-scoped limits. For instance, applied to a specific block of code. For example:
with token_limit(1024*500):
...A LimitExceededError will be raised if the limit is exceeded. The source field on LimitExceededError will be set to the Limit instance that was exceeded.
When catching LimitExceededError, ensure that your try block encompasses the usage of the limit context manager as some LimitExceededError exceptions are raised at the scope of closing the context manager:
try:
with token_limit(1024*500):
...
except LimitExceededError:
...The apply_limits() function accepts a list of Limit instances. If any of the limits passed in are exceeded, the limit_error property on the LimitScope yielded when opening the context manager will be set to the exception. By default, all LimitExceededError exceptions are propagated. However, if catch_errors is true, errors which are as a direct result of exceeding one of the limits passed to it will be caught. It will always allow LimitExceededError exceptions triggered by other limits (e.g. Sample scoped limits) to propagate up the call stack.
with apply_limits(
[token_limit(1000), message_limit(10)], catch_errors=True
) as limit_scope:
...
if limit_scope.limit_error:
print(f"One of our limits was hit: {limit_scope.limit_error}")Checking Usage
You can query how much of a limited resource has been used so far via the usage property of a scoped limit. For example:
with token_limit(10_000) as limit:
await generate()
print(f"Used {limit.usage:,} of 10,000 tokens")If you’re passing the limit instance to apply_limits() or an agent and want to query the usage, you should keep a reference to it:
limit = token_limit(10_000)
with apply_limits([limit]):
await generate()
print(f"Used {limit.usage:,} of 10,000 tokens")Time Limit
To limit the wall clock time to 15 minutes within a block of code:
with time_limit(15 * 60):
...Internally, this uses anyio’s cancellation scopes. The block will be cancelled at the first yield point (e.g. await statement).
Working Limit
The working_limit differs from the time_limit in that it measures only the time spent working (as opposed to retrying in response to rate limits or waiting on other shared resources). Working time is computed based on total clock time minus time spent on (a) unsuccessful model generations (e.g. rate limited requests); and (b) waiting on shared resources (e.g. Docker containers or subprocess execution).
In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (azureai) and in these cases the working_time will include any internal retries that the model client performs.
To limit the working time to 10 minutes:
with working_limit(10 * 60):
...Unlike time limits, this is not driven by anyio. It is checked periodically such as from generate() and after each Solver runs.
Message Limit
Message limits enforce a limit on the number of messages in any conversation (e.g. a TaskState, AgentState, or any input to generate()).
Message limits are checked:
Whenever you call generate() on any model. A LimitExceededError will be raised if the number of messages passed in
inputparameter to generate() is equal to or exceeds the limit. This is to avoid proceeding to another (wasteful) generate call if we’re already at the limit.Whenever
TaskState.messagesorAgentState.messagesis mutated, but a LimitExceededError is only raised if the count exceeds the limit.
Scoped message limits behave differently to scoped token limits in that only the innermost active message_limit() is checked.
To limit the conversation length within a block of code:
@agent
def myagent() -> Agent:
async def execute(state: AgentState):
with message_limit(50):
# A LimitExceededError will be raised when the limit is exceeded
...
with message_limit(None):
# The limit of 50 is temporarily removed in this block of code
...It’s important to note that message_limit() limits the total number of messages in the conversation, not just “new” messages appended by an agent.
Token Limit
Token usage (using total_tokens of ModelUsage) is automatically recorded for all models. Token limits are checked whenever generate() is called.
To limit the total number of tokens which can be used in a block of code:
@agent
def myagent(tokens: int = (1024*500)) -> Agent:
async def execute(state: AgentState):
with token_limit(tokens):
# a LimitExceededError will be raised if the limit is exceeded
...The limits can be stacked. Tokens used while a context manager is open count towards all open token limits.
@agent
def myagent() -> Solver:
async def execute(state: AgentState):
with token_limit(1024*500):
...
with token_limit(1024*200):
# Tokens used here count towards both active limits
...It’s important to note that token_limit() is for all tokens used while the context manager is open. If you want to limit the number of tokens that can be yielded from a single call to the model you should use the max_tokens generation option.
Cost Limit
Cost is computed from token usage and model cost data (see Model Cost). Cost limits are checked whenever generate() is called.
To limit the total cost within a block of code:
@agent
def myagent(budget: float = 2.00) -> Agent:
async def execute(state: AgentState):
with cost_limit(budget):
# a LimitExceededError will be raised if the limit is exceeded
...Cost limits work similarly to token limits, with stacking and tracking of costs used while the context manager is open.
Using cost_limit() requires model cost data to be configured via set_model_cost() or --model-cost-config. See Model Cost for details.
Agent Limits
To run an agent with one or more limits, pass the limit object in the limits argument to a function like handoff(), as_tool(), as_solver() or run() (see Using Agents for details on the various ways to run agents).
Here we limit an agent we are including as a solver to 500K tokens:
eval(
task="research_bench",
solver=as_solver(web_surfer(), limits=[token_limit(1024*500)])
)Here we limit an agent handoff() to 500K tokens:
eval(
task="research_bench",
solver=[
use_tools(
addition(),
handoff(web_surfer(), limits=[token_limit(1024*500)]),
),
generate()
]
)Limit Exceeded
Note that when limits are exceeded during an agent’s execution, the way this is handled differs depending on how the agent was executed:
For agents used via as_solver(), if a limit is exceeded then the sample will terminate (this is exactly how sample-level limits work).
For agents that are run() directly with limits, their limit exceptions will be caught and returned in a tuple. Limits other than the ones passed to run() will propagate up the stack.
from inspect_ai.agent import run state, limit_error = await run( agent=web_surfer(), input="What were the 3 most popular movies of 2020?", limits=[token_limit(1024*500)]) ) if limit_error: ...For tool based agents (handoff() and as_tool()), if a limit is exceeded then a message to that effect is returned to the model but the sample continues running.