Setting Limits

Overview

In open-ended model conversations (for example, an agent evaluation with tool usage) it’s possible that a model will get “stuck” attempting to perform a task with no realistic prospect of completing it. Further, sometimes models will call commands in a sandbox that take an extremely long time (or worst case, hang indefinitely).

For this type of evaluation it’s normally a good idea to set limits on some combination of total time, total messages, turns, tokens used, and/or cost. This article covers:

Sample Limits — limits applied to individual samples within a task.
Scoped Limits — limits applied to arbitrary blocks of code.
Agent Limits — limits applied to agent execution.

Sample Limits

Sample limits don’t result in errors, but rather an early exit from execution (samples that encounter limits are still scored, albeit nearly always as “incorrect”).

Time Limit

Here we set a time_limit of 15 minutes (15 x 60 seconds) for each sample within a task:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash(timeout=3 * 60)]),
            generate(),
        ],
        time_limit=15 * 60,
        scorer=includes(),
        sandbox="docker",
    )

Note that we also set a timeout of 3 minutes for the bash() command. This isn’t required but is often a good idea so that a single wayward bash command doesn’t consume the entire time_limit.

We can also specify a time limit at the CLI or when calling eval():

inspect eval ctf.py --time-limit 900

Appropriate timeouts will vary depending on the nature of your task so please view the above as examples only rather than recommend values.

Working Limit

The working_limit differs from the time_limit in that it measures only the time spent working (as opposed to retrying in response to rate limits or waiting on other shared resources). Working time is computed based on total clock time minus time spent on (a) unsuccessful model generations (e.g. rate limited requests); and (b) waiting on shared resources (e.g. Docker containers or subprocess execution).

In order to distinguish successful generate requests from rate limited and retried requests, Inspect installs hooks into the HTTP client of various model packages. This is not possible for some models (azureai) and in these cases the working_time will include any internal retries that the model client performs.

Here we set an working_limit of 10 minutes (10 x 60 seconds) for each sample within a task:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash(timeout=3 * 60)]),
            generate(),
        ],
        working_limit=10 * 60,
        scorer=includes(),
        sandbox="docker",
    )

Message Limit

Message limits enforce a limit on the number of messages in any conversation (e.g. a TaskState, AgentState, or any input to generate()).

Message limits are checked:

Whenever you call generate() on any model. A LimitExceededError will be raised if the number of messages passed in input parameter to generate() is equal to or exceeds the limit. This is to avoid proceeding to another (wasteful) generate call if we’re already at the limit.
Whenever TaskState.messages or AgentState.messages is mutated, but a LimitExceededError is only raised if the count exceeds the limit.

Here we set a message_limit of 30 for each sample within a task:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash(timeout=120)]),
            generate(),
        ],
        message_limit=30,
        scorer=includes(),
        sandbox="docker",
    )

This sets a limit of 30 total messages in a conversation before the model is forced to give up. At that point, whatever output happens to be in the TaskState will be scored (presumably leading to a score of incorrect).

Token Limit

Token usage (using total_tokens of ModelUsage) is automatically recorded for all models. Token limits are checked whenever generate() is called. By default token limits meter all tokens; use a limit with type "output" to meter only output tokens (which include reasoning tokens).

Here we set a token_limit of 500K for each sample within a task:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash(timeout=120)]),
            generate(),
        ],
        token_limit=(1024*500),
        scorer=includes(),
        sandbox="docker",
    )

The token_limit also accepts strings with magnitude suffixes (e.g. "500k" or "1m") and can be scoped to only output tokens by using a TokenLimit value or an "output:" prefix:

from inspect_ai.util import TokenLimit

Task(
    ...,
    token_limit=TokenLimit(tokens=1024*500, type="output")
)

# equivalent string form (also usable from the CLI, e.g.
# `inspect eval ctf.py --token-limit output:500k`)
Task(
    ...,
    token_limit="output:500k"
)

Output token limits meter only the tokens generated by the model (including reasoning tokens), so they are unaffected by growth in conversation input tokens.

Important

It’s important to note that the token_limit is for all tokens used within the execution of a sample. If you want to limit the number of tokens that can be yielded from a single call to the model you should use the max_tokens generation option.

Turn Limit

A “turn” is a single model generation (one call to the model that produces an assistant message). Turn limits are distinct from message limits, which count all messages in the conversation (user, assistant, and tool messages). One turn often results in several messages (e.g. an assistant message plus the tool messages from its tool calls).

A turn is recorded once per top-level generate() call (after retries and fallbacks have resolved, and including cache hits). Generations made via model.compact() do not count as turns. Turn limits are checked whenever a turn is recorded.

Here we set a turn_limit of 300 for each sample within a task:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash(timeout=120)]),
            generate(),
        ],
        turn_limit=300,
        scorer=includes(),
        sandbox="docker",
    )

This limits the agent to 300 model generations before it is forced to give up. As with other sample limits, whatever output happens to be in the TaskState at that point will be scored.

Cost Limit

Cost is computed from token usage and model cost data (see Model Cost). Cost limits are checked whenever generate() is called.

Here we set a cost_limit of $2.00 for each sample within a task:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash(timeout=120)]),
            generate(),
        ],
        cost_limit=2.00,
        scorer=includes(),
        sandbox="docker",
    )

Important

The cost_limit requires model cost data to be configured via set_model_cost() or --model-cost-config. An error will be raised if a cost limit is set without cost data for all models used in the evaluation.

Model Cost

Cost tracking requires cost data for each model present in the eval or eval set. There are two ways to set cost data:

Python API:

from inspect_ai.model import set_model_cost, ModelCost

set_model_cost("openai/gpt-4o", ModelCost(
    input=2.50, output=10.00,
    input_cache_write=0, input_cache_read=1.25,
))

CLI (YAML or JSON file):

Each model needs a price set for input, output, input_cache_write, and input_cache_read. Prices should be given in dollars per million tokens. Set unused fields to 0.

Below is an example cost config file given in YAML:

openai/gpt-4o:
    input: 2.50
    output: 10.00
    input_cache_write: 0
    input_cache_read: 1.25
anthropic/claude-sonnet-4-5-20250514:
    input: 3.00
    output: 15.00
    input_cache_write: 3.75
    input_cache_read: 0.30

(As of Feb 9 2026, all major model providers count reasoning tokens as output tokens, so no separate price needs to be provided for reasoning tokens. If your use case requires separate calculation of reasoning token prices, contact us.)

When model cost data is configured, costs will be tracked for the sample as a whole, as well as any events within the sample that have a ModelUsage field.

Additionally, configuring model cost data allows setting sample cost limits:

inspect eval ctf.py --model-cost-config pricing.yaml --cost-limit 2.00

Custom Limit

When limits are exceeded, a LimitExceededError is raised and caught by the main Inspect sample execution logic. If you want to create custom limit types, you can enforce them by raising a LimitExceededError as follows:

from inspect_ai.util import LimitExceededError

raise LimitExceededError(
    "custom",
    value=value,
    limit=limit,
    message=f"A custom limit was exceeded: {value}"
)

Query Usage

We can determine how much of a sample limit has been used, what the limit is, and how much of the resource is remaining:

sample_time_limit = sample_limits().time
print(f"{sample_time_limit.remaining:.0f} seconds remaining")

Note that sample_limits() only retrieves the sample-level limits, not scoped limits or agent limits.

Scoped Limits

You can also apply limits at arbitrary scopes, independent of the sample or agent-scoped limits. For instance, applied to a specific block of code. For example:

with token_limit(1024*500):
    ...

A LimitExceededError will be raised if the limit is exceeded. The source field on LimitExceededError will be set to the Limit instance that was exceeded.

When catching LimitExceededError, ensure that your try block encompasses the usage of the limit context manager as some LimitExceededError exceptions are raised at the scope of closing the context manager:

try:
    with token_limit(1024*500):
        ...
except LimitExceededError:
    ...

The apply_limits() function accepts a list of Limit instances. If any of the limits passed in are exceeded, the limit_error property on the LimitScope yielded when opening the context manager will be set to the exception. By default, all LimitExceededError exceptions are propagated. However, if catch_errors is true, errors which are as a direct result of exceeding one of the limits passed to it will be caught. It will always allow LimitExceededError exceptions triggered by other limits (e.g. Sample scoped limits) to propagate up the call stack.

with apply_limits(
    [token_limit(1000), message_limit(10)], catch_errors=True
) as limit_scope:
    ...
if limit_scope.limit_error:
    print(f"One of our limits was hit: {limit_scope.limit_error}")

Checking Usage

You can query how much of a limited resource has been used so far via the usage property of a scoped limit. For example:

with token_limit(10_000) as limit:
    await generate()
    print(f"Used {limit.usage:,} of 10,000 tokens")

If you’re passing the limit instance to apply_limits() or an agent and want to query the usage, you should keep a reference to it:

limit = token_limit(10_000)
with apply_limits([limit]):
    await generate()
    print(f"Used {limit.usage:,} of 10,000 tokens")

Time Limit

To limit the wall clock time to 15 minutes within a block of code:

with time_limit(15 * 60):
    ...

Internally, this uses anyio’s cancellation scopes. The block will be cancelled at the first yield point (e.g. await statement).

Working Limit

To limit the working time to 10 minutes:

with working_limit(10 * 60):
    ...

Unlike time limits, this is not driven by anyio. It is checked periodically such as from generate() and after each Solver runs.

Message Limit

Message limits enforce a limit on the number of messages in any conversation (e.g. a TaskState, AgentState, or any input to generate()).

Message limits are checked:

Whenever you call generate() on any model. A LimitExceededError will be raised if the number of messages passed in input parameter to generate() is equal to or exceeds the limit. This is to avoid proceeding to another (wasteful) generate call if we’re already at the limit.
Whenever TaskState.messages or AgentState.messages is mutated, but a LimitExceededError is only raised if the count exceeds the limit.

Scoped message limits behave differently to scoped token limits in that only the innermost active message_limit() is checked.

To limit the conversation length within a block of code:

@agent
def myagent() -> Agent:
    async def execute(state: AgentState):

        with message_limit(50):
            # A LimitExceededError will be raised when the limit is exceeded
            ...
            with message_limit(None):
                # The limit of 50 is temporarily removed in this block of code
                ...

Important

It’s important to note that message_limit() limits the total number of messages in the conversation, not just “new” messages appended by an agent.

Token Limit

To limit the total number of tokens which can be used in a block of code:

@agent
def myagent(tokens: int = (1024*500)) -> Agent:
    async def execute(state: AgentState):

        with token_limit(tokens):
            # a LimitExceededError will be raised if the limit is exceeded
            ...

The limits can be stacked. Tokens used while a context manager is open count towards all open token limits.

@agent
def myagent() -> Solver:
    async def execute(state: AgentState):

        with token_limit(1024*500):
            ...
            with token_limit(1024*200):
                # Tokens used here count towards both active limits
                ...

To meter only output tokens (which include reasoning tokens) rather than all tokens, pass type="output":

with token_limit(1024*500, type="output"):
    # a LimitExceededError will be raised if more than
    # 500K output tokens are used
    ...

Limits of different types can be stacked—each limit meters its own basis (all tokens or output tokens only).

Important

It’s important to note that token_limit() is for all tokens used while the context manager is open. If you want to limit the number of tokens that can be yielded from a single call to the model you should use the max_tokens generation option.

Suspending Token Limits

To run a block of code that should not count against any active token limits, use suspend_token_limit():

with token_limit(10_000):
    await generate()  # counts against the 10k budget
    with suspend_token_limit():
        # tokens used here are not metered against the 10k limit,
        # and any inner `token_limit()` is also suspended
        await expensive_summary()
    await generate()  # counts again

Unlike with token_limit(None):, which only suppresses the innermost limit’s check, suspend_token_limit() fully disables both recording and checking across all active token limits for the duration of the block.

Turn Limit

To limit the total number of turns (model generations) which can be used in a block of code:

@agent
def myagent(turns: int = 300) -> Agent:
    async def execute(state: AgentState):

        with turn_limit(turns):
            # a LimitExceededError will be raised if the limit is exceeded
            ...

Like token limits, turn limits can be stacked — a generation counts towards all open turn limits. To run generations that should not count against any active turn limit, use suspend_turn_limit():

with turn_limit(10):
    await generate()  # counts against the 10 turn budget
    with suspend_turn_limit():
        # generations here are not metered against the 10 turn limit
        await auxiliary_generate()
    await generate()  # counts again

Cost Limit

Cost is computed from token usage and model cost data (see Model Cost). Cost limits are checked whenever generate() is called.

To limit the total cost within a block of code:

@agent
def myagent(budget: float = 2.00) -> Agent:
    async def execute(state: AgentState):

        with cost_limit(budget):
            # a LimitExceededError will be raised if the limit is exceeded
            ...

Cost limits work similarly to token limits, with stacking and tracking of costs used while the context manager is open.

Important

Using cost_limit() requires model cost data to be configured via set_model_cost() or --model-cost-config. See Model Cost for details.

Agent Limits

To run an agent with one or more limits, pass the limit object in the limits argument to a function like handoff(), as_tool(), as_solver() or run() (see Using Agents for details on the various ways to run agents).

Here we limit an agent we are including as a solver to 500K tokens:

eval(
    task="research_bench", 
    solver=as_solver(web_surfer(), limits=[token_limit(1024*500)])
)

Here we limit an agent handoff() to 500K tokens:

eval(
    task="research_bench", 
    solver=[
        use_tools(
            addition(),
            handoff(web_surfer(), limits=[token_limit(1024*500)]),
        ),
        generate()
    ]
)

Limit Exceeded

Note that when limits are exceeded during an agent’s execution, the way this is handled differs depending on how the agent was executed:

For agents used via as_solver(), if a limit is exceeded then the sample will terminate (this is exactly how sample-level limits work).

For agents that are run() directly with limits, their limit exceptions will be caught and returned in a tuple. Limits other than the ones passed to run() will propagate up the stack.

from inspect_ai.agent import run

state, limit_error = await run(
    agent=web_surfer(), 
    input="What were the 3 most popular movies of 2020?",
    limits=[token_limit(1024*500)])
)
if limit_error:
    ...

For tool based agents (handoff() and as_tool()), if a limit is exceeded then a message to that effect is returned to the model but the sample continues running.