Custom Agents

Overview

Inspect agents bear some similarity to solvers in that they are functions that accept and return a state. However, agent state is intentionally much more narrow—it consists of only conversation history (messages) and the last model generation (output). This in turn enables agents to be used more flexibly: they can be employed as solvers, tools, participants in a workflow, or delegates in multi-agent systems.

Below we’ll cover the core Agent protocol, implementing a simple tool use loop, and related APIs for agent memory and observability.

Protocol

An Agent is a function that takes and returns an AgentState. Agent state includes two fields:

Field Type Description
messages List of ChatMessage Conversation history.
output ModelOutput Last model output.

Example

Here’s a simple example that implements a web_surfer() agent that uses the web_browser() tool to do open-ended web research:

from inspect_ai.agent import Agent, AgentState, agent
from inspect_ai.model import ChatMessageSystem, get_model
from inspect_ai.tool import web_browser

@agent
def web_surfer() -> Agent:
    async def execute(state: AgentState) -> AgentState:
        """Web research assistant."""
      
        # some general guidance for the agent
        state.messages.append(
            ChatMessageSystem(
                content="You are a tenacious web researcher that is "
                + "expert at using a web browser to answer questions."
            )
        )

        # run a tool loop w/ the web_browser then update & return state
        messages, state.output = await get_model().generate_loop(
            state.messages, tools=web_browser()
        )
        state.messages.extend(messages)
        return state

    return execute

The agent calls the generate_loop() function which runs the model in a loop until it stops calling tools. In this case the model may make several calls to the web_browser() tool to fulfil the request.

While this example illustrates the basic mechanic of agents, you generally wouldn’t write an agent that does only this (a system prompt with a tool use loop) as the react() agent provides a more sophisticated and flexible version of this pattern.

Tool Loop

Agents often run a tool use loop, and one of the more common reasons for creating a custom agent is to tailor the behaviour of the loop. Here is an agent loop that has a core similar to the built-in react() agent:

from typing import Sequence
from inspect_ai.agent import AgentState, agent
from inspect_ai.model import execute_tools, get_model
from inspect_ai.tool import (
    Tool, ToolDef, ToolSource, mcp_connection
)

@agent
def my_agent(tools: Sequence[Tool | ToolDef | ToolSource]):
    async def execute(state: AgentState):

        # establish MCP server connections required by tools
        async with mcp_connection(tools):

            while True:
                # call model and append to messages
                state.output = await get_model().generate(
                    input=state.messages,                          
                    tools=tools,                                   
                )                                                  
                state.messages.append(output.message)              

                # make tool calls or terminate if there are none   
                if output.message.tool_calls:                      
                    messages, state.output = await execute_tools(
                        message, tools     
                    )
                    state.messages.extend(messages)
                else:
                    break

            return state

    return execute
1
Enable passing tools to the agent using a variety of types (including ToolSource which enables use of tools from Model Context Protocol (MCP) servers).
2
Establish any required connections to MCP servers (this isn’t required, but will improve performance by re-using connections across tool calls).
3
Standard LLM inference step yielding an assistant message which we append to our message history.
4
Execute tool calls—note that this may update output and/or result in multiple additional messages being appended in the case that one of the tools is a handoff() to a sub-agent.

This above represents a minimal tool use loop—your custom agents may diverge from it in various ways. For example, you might want to:

  1. Add another termination condition for the output satisfying some criteria.
  2. Add a critique / reflection step between tool calling and generate.
  3. Urge the model to keep going after it decides to stop calling tools.
  4. Handle context window overflow (stop_reason=="model_length") by truncating or summarising the messages.
  5. Examine and possibly filter the tool calls before invoking execute_tools()

For example, you might implement automatic context window truncation in response to context window overflow:

# check for context window overflow
if state.output.stop_reason == "model_length":
    if overflow is not None:
        state.messages = trim_messages(state.messages)
        continue

Note that the standard react() agent provides some of these agent loop enhancements (urging the model to continue and handling context window overflow).

Sample Store

In some cases agents will want to retain state across multiple invocations, or even share state with other agents or tools. This can be accomplished in Inspect using the Store, which provides a sample-scoped scratchpad for arbitrary values.

Typed Store

When developing agents, you should use the typed-interface to the per-sample store, which provides both type-checking and namespacing for store access.

For example, here we define a typed accessor to the store by deriving from the StoreModel class (which in turn derives from Pydantic BaseModel):

from pydantic import Field
from inspect_ai.util import StoreModel

class Activity(StoreModel):
    active: bool = Field(default=False)
    tries: int = Field(default=0)
    actions: list[str] = Field(default_factory=list)

We can then get access to a sample scoped instance of the store for use in agents using the store_as() function:

from inspect_ai.util import store_as

activity = store_as(Activity)

Agent Instances

If you want an agent to have a store-per-instance by default, add an instance parameter to your @agent function and default it to uuid(). Then, forward the instance on to store_as() as well as any tools you call that are also stateful (e.g. web_browser()). For example:

from pydantic import Field
from shortuuid import uuid

from inspect_ai.agent import Agent, agent
from inspect_ai.model import ChatMessage
from inspect_ai.util import StoreModel, store_as

class WebSurferState(StoreModel):
    messages: list[ChatMessage] = Field(default_factory=list)

@agent
def web_surfer(instance: str | None = uuid()) -> Agent:
    
    async def execute(state: AgentState) -> AgentState:

        # get state for this instance
        surfer_state = store_as(WebSurferState, instance=instance)

        ...

        # pass the instance on to web_browser 
        messages, state.output = await get_model().generate_loop(
            state.messages, tools=web_browser(instance=instance)
        )

This enables you to have multiple instances of the web_surfer() agent, each with their own state and web browser.

Named Instances

It’s also possible that you’ll want to create various named store instances that are shared across agents (e.g. each participant in a game might need their own store). Use the instance parameter of store_as() to explicitly create scoped store accessors:

red_team_activity = store_as(Activity, instance="red_team")
blue_team_activity = store_as(Activity, instance="blue_team")

Agent Limits

The Inspect limits system enables you to set a variety of limits on execution including tokens consumed, messages used in converations, clock time, and working time (clock time minus time taken retrying in response to rate limits or waiting on other shared resources).

Limits are often applied at the sample level or using a context manager. It is also possible to specify limits when executing an agent using any of the techniques described above.

To run an agent with one or more limits, pass the limit object in the limits argument to a function like handoff(), as_tool(), as_solver() or run() (see Using Agents for details on the various ways to run agents).

Here we limit an agent we are including as a solver to 500K tokens:

eval(
    task="research_bench", 
    solver=as_solver(web_surfer(), limits=[token_limit(1024*500)])
)

Here we limit an agent handoff() to 500K tokens:

eval(
    task="research_bench", 
    solver=[
        use_tools(
            addition(),
            handoff(web_surfer(), limits=[token_limit(1024*500)]),
        ),
        generate()
    ]
)

Limit Exceeded

Note that when limits are exceeded during an agent’s execution, the way this is handled differs depending on how the agent was executed:

  • For agents used via as_solver(), if a limit is exceeded then the sample will terminate (this is exactly how sample-level limits work).

  • For agents that are run() directly with limits, their limit exceptions will be caught and returned in a tuple. Limits other than the ones passed to run() will propagate up the stack.

    from inspect_ai.agent import run
    
    state, limit_error = await run(
        agent=web_surfer(), 
        input="What were the 3 most popular movies of 2020?",
        limits=[token_limit(1024*500)])
    )
    if limit_error:
        ...
  • For tool based agents (handoff() and as_tool()), if a limit is exceeded then a message to that effect is returned to the model but the sample continues running.

Parameters

The web_surfer agent used an example above doesn’t take any parameters, however, like tools, agents can accept arbitrary parameters.

For example, here is a critic agent that asks a model to contribute to a conversation by critiquing its previous output. There are two types of parameters demonstrated:

  1. Parameters that configure the agent globally (here, the critic model).

  2. Parameters passed by the supervisor agent (in this case the count of critiques to provide):

from inspect_ai.agent import Agent, AgentState, agent
from inspect_ai.model import ChatMessageSystem, Model

@agent
def critic(model: str | Model | None = None) -> Agent:
    
    async def execute(state: AgentState, count: int = 3) -> AgentState:
        """Provide critiques of previous messages in a conversation.
        
        Args:
           state: Agent state
           count: Number of critiques to provide (defaults to 3)
        """
        state.messages.append(
            ChatMessageSystem(
                content=f"Provide {count} critiques of the conversation."
            )
        )
        state.output = await get_model(model).generate(state.messages)
        state.messages.append(state.output.message)
        return state
        
    return execute

You might use this in a multi-agent system as follows:

supervisor = react(
    ...,
    tools=[
        addition(), 
        handoff(web_surfer()), 
        handoff(critic(model="openai/gpt-4o-mini"))
    ]
)

When the supervisor agent decides to hand off to the critic() it will decide how many critiques to request and pass that in the count parameter (or alternatively just accept the default count of 3).

Currying

Note that when you use an agent as a solver there isn’t a mechanism for specifying parameters dynamically during the solver chain. In this case the default value for count will be used:

solver = [
    system_message(...),
    generate(),
    critic(),
    generate()
]

If you need to pass parameters explicitly to the agent execute function, you can curry them using the as_solver() function:

solver = [
    system_message(...),
    generate(),
    as_solver(critic(), count=5),
    generate()
]

Transcripts

Transcripts provide a rich per-sample sequential view of everything that occurs during plan execution and scoring, including:

  • Model interactions (including the raw API call made to the provider).
  • Tool calls (including a sub-transcript of activitywithin the tool)
  • Changes (in JSON Patch format) to the TaskState for the Sample.
  • Scoring (including a sub-transcript of interactions within the scorer).
  • Custom info() messages inserted explicitly into the transcript.
  • Python logger calls (info level or designated custom log-level).

This information is provided within the Inspect log viewer in the Transcript tab (which sits alongside the Messages, Scoring, and Metadata tabs in the per-sample display).

Custom Info

You can insert custom entries into the transcript via the Transcript info() method (which creates an InfoEvent). Access the transcript for the current sample using the transcript() function, for example:

from inspect_ai.log import transcript

transcript().info("here is some custom info")

Strings passed to info() will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to info().

Grouping with Spans

You can create arbitrary groupings of transcript activity using the span() context manager. For example:

from inspect_ai.util import span

async with span("planning"):
    ...

There are two reasons that you might want to create spans:

  1. Any changes to the store which occur during a span will be collected into a StoreEvent that records the changes (in JSON Patch format) that occurred.
  2. The Inspect log viewer will create a visual delineation for the span, which will make it easier to see the flow of activity within the transcript.

Spans are automatically created for sample initialisation, solvers, scorers, subtasks, tool calls, and agent execution.

Parallelism

You can execute subtasks in parallel using the collect() function. For example, to run 3 web_search() coroutines in parallel:

from inspect_ai.util import collect

results = collect(
  web_search(keywords="solar power"),
  web_search(keywords="wind power"),
  web_search(keywords="hydro power"),
)

Note that collect() is similar to asyncio.gather(), but also works when Trio is the Inspect async backend.

The Inspect collect() function also automatically includes each task in a span(), which ensures that its events are grouped together in the transcript.

Using collect() in preference to asyncio.gather() is highly recommended for both Trio compatibility and more legible transcript output.

Background Work

Note

The background() function is available only in the development version of Inspect. To install the development version from GitHub:

pip install git+https://github.com/UKGovernmentBEIS/inspect_ai

The background() function enables you to execute an async task in the background of the current sample. The task terminates when the sample terminates. For example:

import anyio
from inspect_ai.util import background

async def worker():
    try:
        while True:
            # background work
            anyio.sleep(1.0)
    finally:
        # cleanup

background(worker)

The above code demonstrates a couple of important characteristics of a sample background worker:

  1. Background workers typically operate in a loop, often polling a a sandbox or other endpoint for activity. In a loop like this it’s important to sleep at regular intervals so your background work doesn’t monopolise CPU resources.

  2. When the sample ends, background workers are cancelled (which results in a cancelled error being raised in the worker). Therefore, if you need to do cleanup in your worker it should occur in a finally block.

Sandbox Service

Note

The sandbox_service() function is available only in the development version of Inspect. To install the development version from GitHub:

pip install git+https://github.com/UKGovernmentBEIS/inspect_ai

Sandbox services make available a set of methods to a sandbox for calling back into the main Inspect process. For example, the Human Agent uses a sandbox service to enable the human agent to start, stop, score, and submit tasks.

Sandbox service are often run using the background() function to make them available for the lifetime of a sample.

For example, here’s a simple calculator service that provides add and subtract methods to Python code within a sandbox:

from inspect_ai.util import background, sandbox_service

async def calculator_service():
    async def add(x: int, y: int) -> int:
        return x + y

    async def subtract(x: int, y: int) -> int:
        return x - y

    await sandbox_service(
        name="calculator",
        methods=[add, subtract],
        until=lambda: True,
        sandbox=sandbox()
    )

background(calculator_service)

To use the service from within a sandbox, either add it to the sys path or use importlib. For example, if the service is named ‘calculator’:

import sys
sys.path.append("/var/tmp/sandbox-services/calculator")
import calculator

Or:

import importlib.util
spec = importlib.util.spec_from_file_location(
    "calculator", 
    "/var/tmp/sandbox-services/calculator/calculator.py"
)
calculator = importlib.util.module_from_spec(spec)
spec.loader.exec_module(calculator)