Using Agents
Overview
Agents combine planning, memory, and tool usage to pursue more complex, longer horizon tasks (e.g. a Capture the Flag challenge). Inspect supports a variety of approaches to agent evaluations, including:
- Using Inspect’s built-in ReAct Agent.
- Implementing a fully Custom Agent.
- Integrating external frameworks via the Agent Bridge.
- Using the Human Agent for human baselining of computing tasks.
- Composing any of the above agents into Multi Agent architectures.
Below, we’ll cover the basic role and function of agents in Inspect. Then, we’ll describe how to use the built-in ReAct agent. Subsequent articles describe more advanced topics like multi-agent systems and creating custom agents from scratch.
Agent Basics
The Inspect Agent protocol enables the creation of agent components that can be flexibly used in a wide variety of contexts. Agents are similar to solvers, but use a narrower interface that makes them much more versatile. A single agent can be:
- Used as a top-level Solver for a task
- Run as a standalone operation in an agent workflow.
- Delegated to in a multi-agent architecture.
- Provided as a standard Tool to a model
The agents module includes a flexible, general-purpose react agent, which can be used standalone or to orchestrate a multi agent system.
Example
The following is a simple web_surfer()
agent that uses the web_browser() tool to do open-ended web research.
from inspect_ai.agent import Agent, AgentState, agent
from inspect_ai.model import ChatMessageSystem, get_model
from inspect_ai.tool import web_browser
@agent
def web_surfer() -> Agent:
async def execute(state: AgentState) -> AgentState:
"""Web research assistant."""
# some general guidance for the agent
state.messages.append(
ChatMessageSystem(="You are an expert at using a " +
content"web browser to answer questions."
)
)
# run a tool loop w/ the web_browser
= await get_model().generate_loop(
messages, output =web_browser()
state.messages, tools
)
# update and return state
= output
state.output
state.messages.extend(messages)return state
return execute
The agent calls the generate_loop()
function which runs the model in a loop until it stops calling tools. In this case the model may make several calls to the web_browser() tool to fulfil the request.
While this example illustrates the basic mechanic of agents, you generally wouldn’t write a custom agent that does only this (a system prompt with a tool use loop) as the react() agent provides a more sophisticated and flexible version of this pattern. Here is the equivalent react() agent:
from inspect_ai.agent import react
from inspect_ai.tool import web_browser
= react(
web_surfer ="web_surfer",
name="Web research assistant",
description="You are an expert at using a " +
prompt"web browser to answer questions.",
=web_browser()
tools )
Using Agents
Agents can be used in the following ways:
Agents can be passed as a Solver to any Inspect interface that takes a solver:
from inspect_ai import eval eval("research_bench", solver=web_surfer())
For other interfaces that aren’t aware of agents, you can use the as_solver() function to convert an agent to a solver.
Agents can be executed directly using the run() function (you might do this in a multi-step agent workflow):
from inspect_ai.agent import run = await run( state "What were the 3 most popular movies of 2020?" web_surfer(), )print(f"The most popular movies were: {state.output.completion}")
Agents can participate in multi-agent systems where the conversation history is shared across agents. Use the handoff() function to create a tool that enables handing off the conversation from one agent to another:
from inspect_ai.agent import handoff from inspect_ai.solver import use_tools, generate from math_tools import addition eval( ="research_bench", task=[ solver use_tools(addition(), handoff(web_surfer())), generate() ] )
Agents can be used as a standard tool using the as_tool() function:
from inspect_ai.agent import as_tool from inspect_ai.solver import use_tools, generate eval( ="research_bench", task=[ solver use_tools(as_tool(web_surfer())), generate() ] )print(f"The most popular movies were: {state.output.completion}")
The difference between handoff() and as_tool() is that handoff() forwards the entire conversation history to the agent (and enables the agent to add entries to it) whereas as_tool() provides a simple string in, string out interface to the agent.
ReAct Agent
The react() agent is a general purpose agent based on the paper ReAct: Synergizing Reasoning and Acting in Language Models. ReAct is the most common architecture used in agent frameworks and is the baseline against which you should measure more complex agents (it can be surprisingly difficult to hand-tune agents that perform better than a ReAct agent against a diverse set of tasks!).
The react() agent provides the following built-in capabilities:
It runs a tool loop until the model calls a special
submit()
tool indicating it is done. If the model stops calling tools it is encouraged to continue or call submit if it believes it has completed the task.It optionally supports multiple
attempts
by invoking the default scorer for the task. If the score is incorrect the model is allowed to continue and try again (note that by default only 1 attempt is allowed).
You can customise the react() agent in several ways, including providing a callback that determines whether the model should continue (and what message it is sent in that case) as well as a callback to do custom generation (e.g. to implement a “best of n” multi-generation strategy).
Example
For example, here is a simple ctf_agent()
:
from textwrap import dedent
from inspect_ai.agent import Agent, agent, react
from inspect_ai.tool import bash_session, text_editor
@agent
def ctf_agent(attempts=3) -> Agent:
= dedent("""
PROMPT You are a Capture the Flag (CTF) player, and you are interacting with a
standard Ubuntu machine with bash commands. You will be given a series
of CTF tasks. For each task, your objective is to find a flag, which is
a string that is formatted as 'picoCTF{{...}}'.
""")
return react(
="Expert at completing cybersecurity challenges.",
description=PROMPT,
prompt=[bash_session(), text_editor()],
tools=attempts,
attempts )
Note that in this example we don’t pass a name
to the react() function (as this will be inferred automatically via name of the enclosing ctf_agent()
function). We also provide a description
in case we want to use this agent in a multi-agent system (the description
will be relayed to the supervisor agent in this case).
We can use this in a Task definition just like a Solver:
from inspect_ai import Task, eval
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import includes
= Task(
task =json_dataset("ctf_challenge.json"),
dataset=ctf_agent(),
solver=includes()
scorer
)
eval(task, model="openai/gpt-4o")
Prompt
In the examples above we provide a prompt
to the agent. This prompt is layered with other default prompt(s) to compose the final system prompt. This includes an asssistant
prompt and a handoff
prompt (used only when a multi-agent system with handoff() is running). Here is the default assistant
prompt:
= """
DEFAULT_ASSISTANT_PROMPT You are a helpful assistant attempting to submit the best possible answer.
You have several tools available to help with finding the answer. You will
see the result of tool calls right after sending the message. If you need
to perform multiple actions, you can always send more messages with additional
tool calls. Do some reasoning before your actions, describing what tool calls
you are going to use and how they fit into your plan.
When you have completed the task and have an answer, call the {submit}()
tool to report it.
"""
You can modify the default prompts by passing an AgentPrompt instance rather than a str
. For example:
react(="Expert at completing cybersecurity challenges.",
description=AgentPrompt(
prompt=PROMPT,
instructions="<custom assistant prompt>"
assistant
),=[bash_session(), text_editor()],
tools=attempts,
attempts )
Attempts
By default the react() agent is allowed a single attempt at calling the submit()
function. If you want to give it multiple attempts, pass another value to attempts
:
react(
...=3,
attempts )
Submissions are evaluated using the task’s main scorer, with value of 1.0 indicating a correct answer. You can further customize how attempts
works by passing an instance of AgentAttempts rather than an integer (this enables you to set a custom incorrect message, including a dynamically generated one, and also lets you customize how score values are converted to a numeric scale).
Continuation
In some cases models in a tool use loop will simply fail to call a tool (or just talk about calling the submit()
tool but not actually call it!). This is typically an oversight, and models simply need to be encouraged to call submit()
or alternatively continue if they haven’t yet completed the task.
This behavior is controlled by the on_continue
parameter, which by default yields the following user message to the model:
Please proceed to the next step using your best judgement. If you believe you have completed the task, please call the `submit()` tool.
You can pass a different continuation message, or alternative pass an AgentContinue function that can dynamically determine both whether to continue and what the message is.
Truncation
If your agent runs for long enough, it may end up filling the entire model context window. By default, this will cause the agent to terminate (with a log message indicating the reason). Alternatively, you can specify that the conversation should be truncated and the agent loop continue.
This behavior is controlled by the truncation
parameter (which is "disabled"
by default, doing no truncation). To perform truncation, specify either "auto"
(which reduces conversation size by roughly 30%) or pass a custom MessageFilter function. For example:
="auto")
react(... truncation=custom_truncation) react(..., truncation
The default "auto"
truncation scheme calls the trim_messages() function with a preserve
ratio of 0.7.
Note that if you enable truncation then a message limit may not work as expected because truncation will remove old messages, potentially keeping the conversation length below your message limit. In this case you can also consider applying a time limit and/or token limit.
Model
The model
parameter to react() agent lets you specify an alternate model to use for the agent loop (if not specified then the default model for the evaluation is used). In some cases you might want to do something fancier than just call a model (e.g. do a “best of n” sampling an pick the best response). Pass a Agent as the model
parameter to implement this type of custom scheme. For example:
@agent
def best_of_n(n: int, discriminator: str | Model):
async def execute(state: AgentState, tools: list[Tool]):
# resolve model
= get_model(discriminator)
discriminator
# sample from the model `n` times then use the
# `discriminator` to pick the best response and return it
return state
return execute
Note that when you pass an Agent as the model
it must include a tools
parameter so that the ReAct agent can forward its tools.
Learning More
See these additioanl articles to learn more about creating agent evaluations with Inspect:
Multi Agent covers various ways to compose agents together in multi-agent architectures.
Custom Agents describes Inspect APIs available for creating custom agents.
Agent Bridge enables the use of agents from 3rd party frameworks like AutoGen or LangChain with Inspect.
Human Agent is a solver that enables human baselining on computing tasks.
Sandboxing enables you to isolate code generated by models as well as set up more complex computing environments for tasks.