Custom Agents
Overview
Inspect agents bear some similarity to solvers in that they are functions that accept and return a state
. However, agent state is intentionally much more narrow—it consists of only conversation history (messages
) and the last model generation (output
). This in turn enables agents to be used more flexibly: they can be employed as solvers, tools, participants in a workflow, or delegates in multi-agent systems.
Below we’ll cover the core Agent protocol, implementing a simple tool use loop, and related APIs for agent memory and observability.
Protocol
An Agent is a function that takes and returns an AgentState. Agent state includes two fields:
Field | Type | Description |
---|---|---|
messages |
List of ChatMessage | Conversation history. |
output |
ModelOutput | Last model output. |
Example
Here’s a simple example that implements a web_surfer()
agent that uses the web_browser() tool to do open-ended web research:
from inspect_ai.agent import Agent, AgentState, agent
from inspect_ai.model import ChatMessageSystem, get_model
from inspect_ai.tool import web_browser
@agent
def web_surfer() -> Agent:
async def execute(state: AgentState) -> AgentState:
"""Web research assistant."""
# some general guidance for the agent
state.messages.append(
ChatMessageSystem(="You are a tenacious web researcher that is "
content+ "expert at using a web browser to answer questions."
)
)
# run a tool loop w/ the web_browser then update & return state
= await get_model().generate_loop(
messages, state.output =web_browser()
state.messages, tools
)
state.messages.extend(messages)return state
return execute
The agent calls the generate_loop()
function which runs the model in a loop until it stops calling tools. In this case the model may make several calls to the web_browser() tool to fulfil the request.
While this example illustrates the basic mechanic of agents, you generally wouldn’t write an agent that does only this (a system prompt with a tool use loop) as the react() agent provides a more sophisticated and flexible version of this pattern.
Tool Loop
Agents often run a tool use loop, and one of the more common reasons for creating a custom agent is to tailor the behaviour of the loop. Here is an agent loop that has a core similar to the built-in react() agent:
from typing import Sequence
from inspect_ai.agent import AgentState, agent
from inspect_ai.model import execute_tools, get_model
from inspect_ai.tool import (
Tool, ToolDef, ToolSource, mcp_connection
)
@agent
def my_agent(tools: Sequence[Tool | ToolDef | ToolSource]):
async def execute(state: AgentState):
# establish MCP server connections required by tools
async with mcp_connection(tools):
while True:
# call model and append to messages
= await get_model().generate(
state.output input=state.messages,
=tools,
tools
)
state.messages.append(output.message)
# make tool calls or terminate if there are none
if output.message.tool_calls:
= await execute_tools(
messages, state.output
message, tools
)
state.messages.extend(messages)else:
break
return state
return execute
- 1
-
Enable passing
tools
to the agent using a variety of types (including ToolSource which enables use of tools from Model Context Protocol (MCP) servers). - 2
- Establish any required connections to MCP servers (this isn’t required, but will improve performance by re-using connections across tool calls).
- 3
- Standard LLM inference step yielding an assistant message which we append to our message history.
- 4
- Execute tool calls—note that this may update output and/or result in multiple additional messages being appended in the case that one of the tools is a handoff() to a sub-agent.
This above represents a minimal tool use loop—your custom agents may diverge from it in various ways. For example, you might want to:
- Add another termination condition for the output satisfying some criteria.
- Add a critique / reflection step between tool calling and generate.
- Urge the model to keep going after it decides to stop calling tools.
- Handle context window overflow (
stop_reason=="model_length"
) by truncating or summarising themessages
. - Examine and possibly filter the tool calls before invoking execute_tools()
For example, you might implement automatic context window truncation in response to context window overflow:
# check for context window overflow
if state.output.stop_reason == "model_length":
if overflow is not None:
= trim_messages(state.messages)
state.messages continue
Note that the standard react() agent provides some of these agent loop enhancements (urging the model to continue and handling context window overflow).
Sample Store
In some cases agents will want to retain state across multiple invocations, or even share state with other agents or tools. This can be accomplished in Inspect using the Store, which provides a sample-scoped scratchpad for arbitrary values.
Typed Store
When developing agents, you should use the typed-interface to the per-sample store, which provides both type-checking and namespacing for store access.
For example, here we define a typed accessor to the store by deriving from the StoreModel class (which in turn derives from Pydantic BaseModel
):
from pydantic import Field
from inspect_ai.util import StoreModel
class Activity(StoreModel):
bool = Field(default=False)
active: int = Field(default=0)
tries: list[str] = Field(default_factory=list) actions:
We can then get access to a sample scoped instance of the store for use in agents using the store_as() function:
from inspect_ai.util import store_as
= store_as(Activity) activity
Agent Instances
If you want an agent to have a store-per-instance by default, add an instance
parameter to your @agent
function and default it to uuid()
. Then, forward the instance
on to store_as() as well as any tools you call that are also stateful (e.g. web_browser()). For example:
from pydantic import Field
from shortuuid import uuid
from inspect_ai.agent import Agent, agent
from inspect_ai.model import ChatMessage
from inspect_ai.util import StoreModel, store_as
class WebSurferState(StoreModel):
list[ChatMessage] = Field(default_factory=list)
messages:
@agent
def web_surfer(instance: str | None = uuid()) -> Agent:
async def execute(state: AgentState) -> AgentState:
# get state for this instance
= store_as(WebSurferState, instance=instance)
surfer_state
...
# pass the instance on to web_browser
= await get_model().generate_loop(
messages, state.output =web_browser(instance=instance)
state.messages, tools )
This enables you to have multiple instances of the web_surfer()
agent, each with their own state and web browser.
Named Instances
It’s also possible that you’ll want to create various named store instances that are shared across agents (e.g. each participant in a game might need their own store). Use the instance
parameter of store_as() to explicitly create scoped store accessors:
= store_as(Activity, instance="red_team")
red_team_activity = store_as(Activity, instance="blue_team") blue_team_activity
Agent Limits
The Inspect limits system enables you to set a variety of limits on execution including tokens consumed, messages used in converations, clock time, and working time (clock time minus time taken retrying in response to rate limits or waiting on other shared resources).
Limits are often applied at the sample level or using a context manager. It is also possible to specify limits when executing an agent using any of the techniques described above.
To run an agent with one or more limits, pass the limit object in the limits
argument to a function like handoff(), as_tool(), as_solver() or run() (see Using Agents for details on the various ways to run agents).
Here we limit an agent we are including as a solver to 500K tokens:
eval(
="research_bench",
task=as_solver(web_surfer(), limits=[token_limit(1024*500)])
solver )
Here we limit an agent handoff() to 500K tokens:
eval(
="research_bench",
task=[
solver
use_tools(
addition(),=[token_limit(1024*500)]),
handoff(web_surfer(), limits
),
generate()
] )
Limit Exceeded
Note that when limits are exceeded during an agent’s execution, the way this is handled differs depending on how the agent was executed:
For agents used via as_solver(), if a limit is exceeded then the sample will terminate (this is exactly how sample-level limits work).
For agents that are run() directly with limits, their limit exceptions will be caught and returned in a tuple. Limits other than the ones passed to run() will propagate up the stack.
from inspect_ai.agent import run = await run( state, limit_error =web_surfer(), agentinput="What were the 3 most popular movies of 2020?", =[token_limit(1024*500)]) limits )if limit_error: ...
For tool based agents (handoff() and as_tool()), if a limit is exceeded then a message to that effect is returned to the model but the sample continues running.
Parameters
The web_surfer
agent used an example above doesn’t take any parameters, however, like tools, agents can accept arbitrary parameters.
For example, here is a critic
agent that asks a model to contribute to a conversation by critiquing its previous output. There are two types of parameters demonstrated:
Parameters that configure the agent globally (here, the critic
model
).Parameters passed by the supervisor agent (in this case the
count
of critiques to provide):
from inspect_ai.agent import Agent, AgentState, agent
from inspect_ai.model import ChatMessageSystem, Model
@agent
def critic(model: str | Model | None = None) -> Agent:
async def execute(state: AgentState, count: int = 3) -> AgentState:
"""Provide critiques of previous messages in a conversation.
Args:
state: Agent state
count: Number of critiques to provide (defaults to 3)
"""
state.messages.append(
ChatMessageSystem(=f"Provide {count} critiques of the conversation."
content
)
)= await get_model(model).generate(state.messages)
state.output
state.messages.append(state.output.message)return state
return execute
You might use this in a multi-agent system as follows:
= react(
supervisor
...,=[
tools
addition(),
handoff(web_surfer()), ="openai/gpt-4o-mini"))
handoff(critic(model
] )
When the supervisor agent decides to hand off to the critic()
it will decide how many critiques to request and pass that in the count
parameter (or alternatively just accept the default count
of 3).
Currying
Note that when you use an agent as a solver there isn’t a mechanism for specifying parameters dynamically during the solver chain. In this case the default value for count
will be used:
= [
solver
system_message(...),
generate(),
critic(),
generate() ]
If you need to pass parameters explicitly to the agent execute
function, you can curry them using the as_solver() function:
= [
solver
system_message(...),
generate(),=5),
as_solver(critic(), count
generate() ]
Transcripts
Transcripts provide a rich per-sample sequential view of everything that occurs during plan execution and scoring, including:
- Model interactions (including the raw API call made to the provider).
- Tool calls (including a sub-transcript of activitywithin the tool)
- Changes (in JSON Patch format) to the TaskState for the Sample.
- Scoring (including a sub-transcript of interactions within the scorer).
- Custom
info()
messages inserted explicitly into the transcript. - Python logger calls (
info
level or designated customlog-level
).
This information is provided within the Inspect log viewer in the Transcript tab (which sits alongside the Messages, Scoring, and Metadata tabs in the per-sample display).
Custom Info
You can insert custom entries into the transcript via the Transcript info()
method (which creates an InfoEvent). Access the transcript for the current sample using the transcript() function, for example:
from inspect_ai.log import transcript
"here is some custom info") transcript().info(
Strings passed to info()
will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to info()
.
Grouping with Spans
You can create arbitrary groupings of transcript activity using the span() context manager. For example:
from inspect_ai.util import span
async with span("planning"):
...
There are two reasons that you might want to create spans:
- Any changes to the store which occur during a span will be collected into a StoreEvent that records the changes (in JSON Patch format) that occurred.
- The Inspect log viewer will create a visual delineation for the span, which will make it easier to see the flow of activity within the transcript.
Spans are automatically created for sample initialisation, solvers, scorers, subtasks, tool calls, and agent execution.
Parallelism
You can execute subtasks in parallel using the collect() function. For example, to run 3 web_search() coroutines in parallel:
from inspect_ai.util import collect
= collect(
results ="solar power"),
web_search(keywords="wind power"),
web_search(keywords="hydro power"),
web_search(keywords )
Note that collect() is similar to asyncio.gather()
, but also works when Trio is the Inspect async backend.
The Inspect collect() function also automatically includes each task in a span(), which ensures that its events are grouped together in the transcript.
Using collect() in preference to asyncio.gather()
is highly recommended for both Trio compatibility and more legible transcript output.
Background Work
The background() function is available only in the development version of Inspect. To install the development version from GitHub:
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
The background() function enables you to execute an async task in the background of the current sample. The task terminates when the sample terminates. For example:
import anyio
from inspect_ai.util import background
async def worker():
try:
while True:
# background work
1.0)
anyio.sleep(finally:
# cleanup
background(worker)
The above code demonstrates a couple of important characteristics of a sample background worker:
Background workers typically operate in a loop, often polling a a sandbox or other endpoint for activity. In a loop like this it’s important to sleep at regular intervals so your background work doesn’t monopolise CPU resources.
When the sample ends, background workers are cancelled (which results in a cancelled error being raised in the worker). Therefore, if you need to do cleanup in your worker it should occur in a
finally
block.
Sandbox Service
The sandbox_service() function is available only in the development version of Inspect. To install the development version from GitHub:
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
Sandbox services make available a set of methods to a sandbox for calling back into the main Inspect process. For example, the Human Agent uses a sandbox service to enable the human agent to start, stop, score, and submit tasks.
Sandbox service are often run using the background() function to make them available for the lifetime of a sample.
For example, here’s a simple calculator service that provides add and subtract methods to Python code within a sandbox:
from inspect_ai.util import background, sandbox_service
async def calculator_service():
async def add(x: int, y: int) -> int:
return x + y
async def subtract(x: int, y: int) -> int:
return x - y
await sandbox_service(
="calculator",
name=[add, subtract],
methods=lambda: True,
until=sandbox()
sandbox
)
background(calculator_service)
To use the service from within a sandbox, either add it to the sys path or use importlib. For example, if the service is named ‘calculator’:
import sys
"/var/tmp/sandbox-services/calculator")
sys.path.append(import calculator
Or:
import importlib.util
= importlib.util.spec_from_file_location(
spec "calculator",
"/var/tmp/sandbox-services/calculator/calculator.py"
)= importlib.util.module_from_spec(spec)
calculator spec.loader.exec_module(calculator)