Standard Tools

Overview

Inspect has several standard tools built-in, including:

  • Bash and Python for executing arbitrary shell and Python code.

  • Bash Session for creating a stateful bash shell that retains its state across calls from the model.

  • Text Editor which enables viewing, creating and editing text files.

  • Web Browser, which provides the model with a headless Chromium web browser that supports navigation, history, and mouse/keyboard interactions.

  • Computer, which provides the model with a desktop computer (viewed through screenshots) that supports mouse and keyboard interaction.

  • Web Search, which uses the Google Search API to execute and summarise web searches.

  • Think, which provides models the ability to include an additional thinking step as part of getting to its final answer.

Bash and Python

The bash() and python() tools enable execution of arbitrary shell commands and Python code, respectively. These tools require the use of a Sandbox Environment for the execution of untrusted code. For example, here is how you might use them in an evaluation where the model is asked to write code in order to solve capture the flag (CTF) challenges:

from inspect_ai.tool import bash, python

CMD_TIMEOUT = 180

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([
                bash(CMD_TIMEOUT), 
                python(CMD_TIMEOUT)
            ]),
            generate(),
        ],
        scorer=includes(),
        message_limit=30,
        sandbox="docker",
    )

We specify a 3-minute timeout for execution of the bash and python tools to ensure that they don’t perform extremely long running operations.

See the Agents section for more details on how to build evaluations that allow models to take arbitrary actions over a longer time horizon.

Bash Session

The bash_session() tool provides a bash shell that retains its state across calls from the model (as distinct from the bash() tool which executes each command in a fresh session). The prompt, working directory, and environment variables are all retained across calls. The tool also supports a restart action that enables the model to reset its state and work in a fresh session.

Note that a separate bash process is created within the sandbox for each instance of the bash session tool. See the bash_session() reference docs for details on customizing this behavior.

Configuration

Bash sessions require the use of a Sandbox Environment for the execution of untrusted code. In addition, you’ll need some dependencies installed in the sandbox container. Please see Sandbox Dependencies below for additional instructions.

You should add the following to your sandbox Dockerfile in order to use this tool:

ENV PATH="$PATH:/opt/inspect_tool_support/bin"
RUN python -m venv /opt/inspect_tool_support && \
    /opt/inspect_tool_support/bin/pip install inspect-tool-support && \
    /opt/inspect_tool_support/bin/inspect-tool-support post-install

Note that Playwright (used for the web_browser() tool) does not support some versions of Linux (e.g. Kali Linux). If this is the case for your Linux distribution, you should add the --no-web-browser option to the post-install:

RUN inspect-tool-support post-install --no-web-browser

If you don’t have a custom Dockerfile, you can alternatively use the pre-built aisiuk/inspect-tool-support image:

compose.yaml
services:
  default:
    image: aisiuk/inspect-tool-support:latest
    init: true

Task Setup

A task configured to use the bash session tool might look like this:

from inspect_ai import Task, task
from inspect_ai.scorer import includes
from inspect_ai.solver import generate, system_message, use_tools
from inspect_ai.tool import bash_session

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([bash_session(timeout=180)]),
            generate(),
        ],
        scorer=includes(),
        sandbox=("docker", "compose.yaml")
    )

Note that we provide a timeout for bash session commands (this is a best practice to guard against extremely long running commands).

Tool Binding

The schema for the bash_session() tool is based on the standard Anthropic bash tool type. The bash_session() works with all models that support tool calling, but when using Claude, the bash session tool will automatically bind to the native Claude tool definition.

Text Editor

The text_editor() tool enables viewing, creating and editing text files. The tool supports editing files within a protected Sandbox Environment so tasks that use the text editor should have a sandbox defined and configured as described below.

Configuration

The text editor tools requires the use of a Sandbox Environment. In addition, you’ll need some dependencies installed in the sandbox container. Please see Sandbox Dependencies below for additional instructions.

You should add the following to your sandbox Dockerfile in order to use this tool:

ENV PATH="$PATH:/opt/inspect_tool_support/bin"
RUN python -m venv /opt/inspect_tool_support && \
    /opt/inspect_tool_support/bin/pip install inspect-tool-support && \
    /opt/inspect_tool_support/bin/inspect-tool-support post-install

Note that Playwright (used for the web_browser() tool) does not support some versions of Linux (e.g. Kali Linux). If this is the case for your Linux distribution, you should add the --no-web-browser option to the post-install:

RUN inspect-tool-support post-install --no-web-browser

If you don’t have a custom Dockerfile, you can alternatively use the pre-built aisiuk/inspect-tool-support image:

compose.yaml
services:
  default:
    image: aisiuk/inspect-tool-support:latest
    init: true

Task Setup

A task configured to use the text editor tool might look like this (note that this task is also configured to use the bash_session() tool):

from inspect_ai import Task, task
from inspect_ai.scorer import includes
from inspect_ai.solver import generate, system_message, use_tools
from inspect_ai.tool import bash_session, text_editor

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([
                bash_session(timeout=180),
                text_editor(timeout=180)
            ]),
            generate(),
        ],
        scorer=includes(),
        sandbox=("docker", "compose.yaml")
    )

Note that we provide a timeout for the bash session and text editor tools (this is a best practice to guard against extremely long running commands).

Tool Binding

The schema for the text_editor() tool is based on the standard Anthropic text editor tool type. The text_editor() works with all models that support tool calling, but when using Claude, the text editor tool will automatically bind to the native Claude tool definition.

Web Browser

The web browser tools provides models with the ability to browse the web using a headless Chromium browser. Navigation, history, and mouse/keyboard interactions are all supported.

Configuration

Under the hood, the web browser is an instance of Chromium orchestrated by Playwright, and runs in a Sandbox Environment. In addition, you’ll need some dependencies installed in the sandbox container. Please see Sandbox Dependencies below for additional instructions.

Note that Playwright (used for the web_browser() tool) does not support some versions of Linux (e.g. Kali Linux).

You should add the following to your sandbox Dockerfile in order to use this tool:

ENV PATH="$PATH:/opt/inspect_tool_support/bin"
RUN python -m venv /opt/inspect_tool_support && \
    /opt/inspect_tool_support/bin/pip install inspect-tool-support && \
    /opt/inspect_tool_support/bin/inspect-tool-support post-install

If you don’t have a custom Dockerfile, you can alternatively use the pre-built aisiuk/inspect-tool-support image:

compose.yaml
services:
  default:
    image: aisiuk/inspect-tool-support:latest
    init: true

Task Setup

A task configured to use the web browser tools might look like this:

from inspect_ai import Task, task
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python, web_browser

@task
def browser_task():
    return Task(
        dataset=read_dataset(),
        solver=[
            use_tools([bash(), python()] + web_browser()),
            generate(),
        ],
        scorer=match(),
        sandbox=("docker", "compose.yaml"),
    )

Unlike some other tool functions like bash(), the web_browser() function returns a list of tools. Therefore, we concatenate it with a list of the other tools we are using in the call to use_tools().

Note that a separate web browser process is created within the sandbox for each instance of the web browser tool. See the web_browser() reference docs for details on customizing this behavior.

Browsing

If you review the transcripts of a sample with access to the web browser tool, you’ll notice that there are several distinct tools made available for control of the web browser. These tools include:

Tool Description
web_browser_go(url) Navigate the web browser to a URL.
web_browser_click(element_id) Click an element on the page currently displayed by the web browser.
web_browser_type(element_id) Type text into an input on a web browser page.
web_browser_type_submit(element_id, text) Type text into a form input on a web browser page and press ENTER to submit the form.
web_browser_scroll(direction) Scroll the web browser up or down by one page.
web_browser_forward() Navigate the web browser forward in the browser history.
web_browser_back() Navigate the web browser back in the browser history.
web_browser_refresh() Refresh the current page of the web browser.

The return value of each of these tools is a web accessibility tree for the page, which provides a clean view of the content, links, and form fields available on the page (you can look at the accessibility tree for any web page using Chrome Developer Tools).

Disabling Interactions

You can use the web browser tools with page interactions disabled by specifying interactive=False, for example:

use_tools(web_browser(interactive=False))

In this mode, the interactive tools (web_browser_click(), web_browser_type(), and web_browser_type_submit()) are not made available to the model.

Computer

The computer() tool provides models with a computer desktop environment along with the ability to view the screen and perform mouse and keyboard gestures. The computer tool is based on the Anthropic Computer Use Beta reference implementation and works with any model that supports image input.

Configuration

The computer() tool runs within a Docker container. To use it with a task you need to reference the aisiuk/inspect-computer-tool:latest image in your Docker compose file. For example:

compose.yaml
services:
  default:
    image: aisiuk/inspect-computer-tool:latest

You can configure the container to not have Internet access as follows:

compose.yaml
services:
  default:
    image: aisiuk/inspect-computer-tool:latest
    network_mode: none

Note that if you’d like to be able to view the model’s interactions with the computer desktop in realtime, you will need to also do some port mapping to enable a VNC connection with the container. See the VNC Client section below for details on how to do this.

The aisiuk/inspect-computer-tool:latest image is based on the ubuntu:22.04 image and includes the following additional applications pre-installed:

  • Firefox
  • VS Code
  • Xpdf
  • Xpaint
  • galculator

Task Setup

A task configured to use the computer tool might look like this:

from inspect_ai import Task, task
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import computer

@task
def computer_task():
    return Task(
        dataset=read_dataset(),
        solver=[
            use_tools([computer()]),
            generate(),
        ],
        scorer=match(),
        sandbox=("docker", "compose.yaml"),
    )

Options

The computer tool supports the following options:

Option Description
max_screenshots The maximum number of screenshots to play back to the model as input. Defaults to 1 (set to None to have no limit).
timeout Timeout in seconds for computer tool actions. Defaults to 180 (set to None for no timeout).

For example:

solver=[
    use_tools([computer(max_screenshots=2, timeout=300)]),
    generate()
]

Examples

Two of the Inspect examples demonstrate basic computer use:

  • computer — Three simple computing tasks as a minimal demonstration of computer use.

    inspect eval examples/computer
  • intervention — Computer task driven interactively by a human operator.

    inspect eval examples/intervention -T mode=computer --display conversation

VNC Client

You can use a VNC connection to the container to watch computer use in real-time. This requires some additional port-mapping in the Docker compose file. You can define dynamic port ranges for VNC (5900) and a browser based noVNC client (6080) with the following ports entries:

compose.yaml
services:
  default:
    image: aisiuk/inspect-computer-tool:latest
    ports:
      - "5900"
      - "6080"

To connect to the container for a given sample, locate the sample in the Running Samples UI and expand the sample info panel at the top:

Click on the link for the noVNC browser client, or use a native VNC client to connect to the VNC port. Note that the VNC server will take a few seconds to start up so you should give it some time and attempt to reconnect as required if the first connection fails.

The browser based client provides a view-only interface. If you use a native VNC client you should also set it to “view only” so as to not interfere with the model’s use of the computer. For example, for Real VNC Viewer:

Approval

If the container you are using is connected to the Internet, you may want to configure human approval for a subset of computer tool actions. Here are the possible actions (specified using the action parameter to the computer tool):

  • key: Press a key or key-combination on the keyboard.
  • type: Type a string of text on the keyboard.
  • cursor_position: Get the current (x, y) pixel coordinate of the cursor on the screen.
  • mouse_move: Move the cursor to a specified (x, y) pixel coordinate on the screen.
  • Example: execute(action=“mouse_move”, coordinate=(100, 200))
  • left_click: Click the left mouse button.
  • left_click_drag: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.
  • right_click: Click the right mouse button.
  • middle_click: Click the middle mouse button.
  • double_click: Double-click the left mouse button.
  • screenshot: Take a screenshot.

Here is an approval policy that requires approval for key combos (e.g. Enter or a shortcut) and mouse clicks:

approval.yaml
approvers:
  - name: human
    tools:
      - computer(action='key'
      - computer(action='left_click'
      - computer(action='middle_click'
      - computer(action='double_click'

  - name: auto
    tools: "*"

Note that since this is a prefix match and there could be other arguments, we don’t end the tool match pattern with a parentheses.

You can apply this policy using the --approval commmand line option:

inspect eval computer.py --approval approval.yaml

Tool Binding

The computer tool’s schema is based on the standard Anthropoic computer tool-type. When using Claude, the computer tool will automatically bind to the native Claude computer tool definition. This presumably provides improved performance due to fine tuning on the use of the tool but we have not verified this.

If you want to experiement with bypassing the native Claude computer tool type and just register the computer tool as a normal function based tool then specify the --no-internal-tools generation option as follows:

inspect eval computer.py --no-internal-tools

Think

The think() tool provides models with the ability to include an additional thinking step as part of getting to its final answer.

Note that the think() tool is not a substitute for reasoning and extended thinking, but rather an an alternate way of letting models express thinking that is better suited to some tool use scenarios.

Usage

You should read the original think tool article in its entirely to understand where and where not to use the think tool. In summary, good contexts for the think tool include:

  1. Tool output analysis. When models need to carefully process the output of previous tool calls before acting and might need to backtrack in its approach;
  2. Policy-heavy environments. When models need to follow detailed guidelines and verify compliance; and
  3. Sequential decision making. When each action builds on previous ones and mistakes are costly (often found in multi-step domains).

Use the think() tool alongside other tools like this:

from inspect_ai import Task, task
from inspect_ai.scorer import includes
from inspect_ai.solver import generate, system_message, use_tools
from inspect_ai.tool import bash_session, text_editor, think

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([
                bash_session(timeout=180),
                text_editor(timeout=180),
                think()
            ]),
            generate(),
        ],
        scorer=includes(),
        sandbox=("docker", "compose.yaml")
    )

Tool Description

In the original think tool article (which was based on experimenting with Claude) they found that providing clear instructions on when and how to use the think() tool for the particular problem domain it is being used within could sometimes be helpful. For example, here’s the prompt they used with SWE-Bench:

from textwrap import dedent

from inspect_ai import Task, task
from inspect_ai.scorer import includes
from inspect_ai.solver import generate, system_message, use_tools
from inspect_ai.tool import bash_session, text_editor, think

@task
def swe_bench():

    tools = [
        bash_session(timeout=180),
        text_editor(timeout=180),  
        think(dedent("""
            Use the think tool to think about something. It will not obtain
            new information or make any changes to the repository, but just 
            log the thought. Use it when complex reasoning or brainstorming
            is needed. For example, if you explore the repo and discover
            the source of a bug, call this tool to brainstorm several unique
            ways of fixing the bug, and assess which change(s) are likely to 
            be simplest and most effective. Alternatively, if you receive
            some test results, call this tool to brainstorm ways to fix the
            failing tests.
        """))
    ])

    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools(tools),
            generate(),
        ),
        scorer=includes(),
        sandbox=("docker", "compose.yaml")
    )

System Prompt

In the article they also found that when tool instructions are long and/or complex, including instructions about the think() tool in the system prompt can be more effective than placing them in the tool description itself.

Here’s an example of moving the custom think() prompt into the system prompt (note that this was not done in the article’s SWE-Bench experiment, this is merely an example):

from textwrap import dedent

from inspect_ai import Task, task
from inspect_ai.scorer import includes
from inspect_ai.solver import generate, system_message, use_tools
from inspect_ai.tool import bash_session, text_editor, think

@task
def swe_bench():

    think_system_message = system_message(dedent("""
        Use the think tool to think about something. It will not obtain
        new information or make any changes to the repository, but just 
        log the thought. Use it when complex reasoning or brainstorming
        is needed. For example, if you explore the repo and discover
        the source of a bug, call this tool to brainstorm several unique
        ways of fixing the bug, and assess which change(s) are likely to 
        be simplest and most effective. Alternatively, if you receive
        some test results, call this tool to brainstorm ways to fix the
        failing tests.
    """))

    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            think_system_message,
            use_tools([
                bash_session(timeout=180),
                text_editor(timeout=180),  
                think(),
            ]),
            generate(),
        ],
        scorer=includes(),
        sandbox=("docker", "compose.yaml")
    )

Note that the effectivess of using the system prompt will vary considerably across tasks, tools, and models, so should definitely be the subject of experimentation.