Inspect

An open-source framework for large language model evaluations

Welcome

Inspect is a framework for frontier AI evaluations developed by the UK AI Security Institute and Meridian Labs. Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:

Composable building blocks—datasets, agents, tools, and scorers—that make evaluations easy to write and reuse.
A collection of over 200 pre-built evaluations ready to run on any model.
Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.
Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.
Support for agent evaluations, including flexible built-in agents, multi-agent primitives, and the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.
A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, Vagrant, and other systems via an extension API.

We’ll walk through two short “Hello, Inspect” examples below. Read on to learn the basics, then read the documentation on Datasets, Solvers, Scorers, Tools, and Agents to learn how to create more advanced evaluations.

If you are primarily interested in running evaluations rather than developing new ones, see the Evals listing where you’ll find implementations for over 200 popular benchmarks.

Getting Started

To get started using Inspect:

Install Inspect from PyPI with:
```
pip install inspect-ai
```
If you are using VS Code, install the Inspect VS Code Extension (not required but highly recommended).

To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment. For example:

pip install openai
export OPENAI_API_KEY=your-openai-api-key
inspect eval simpleqa.py --model openai/gpt-4o

pip install anthropic
export ANTHROPIC_API_KEY=your-anthropic-api-key
inspect eval simpleqa.py --model anthropic/claude-sonnet-4-0

pip install google-genai
export GOOGLE_API_KEY=your-google-api-key
inspect eval simpleqa.py --model google/gemini-2.5-pro

pip install torch transformers
export HF_TOKEN=your-hf-token
inspect eval simpleqa.py --model hf/meta-llama/Llama-2-7b-chat-hf

Inspect has built-in support for over 20 model providers as well as support for local inference with HuggingFace, vLLM, and SGLang. See the documentation on Model Providers for details on all supported providers.

Hello, Inspect

An Inspect evaluation is a Task that brings together three things:

Dataset that provides labelled samples—typically a table with input and target columns, where input is the prompt and target is the ideal answer or grading guidance.
Solver that produces an answer for each sample. This can be as simple as a single generate() call to the model, or as sophisticated as a full agent that uses tools over many turns.
Scorer that evaluates the output—using text comparisons, model grading, or other custom schemes.

Let’s look at two short examples: a question-answering benchmark and a capture the flag challenge.

Benchmark: SimpleQA

This task evaluates a model on SimpleQA, a benchmark of short, fact-seeking questions(click on the numbers at right for further explanation):

simpleqa.py

from inspect_ai import Task, task
from inspect_ai.dataset import FieldSpec, hf_dataset
from inspect_ai.scorer import model_graded_qa
from inspect_ai.solver import generate

@task
def simpleqa():
    return Task(
        dataset=hf_dataset(
            "codelion/SimpleQA-Verified",
            split="train",
            sample_fields=FieldSpec(
                input="problem",
                target="answer",
            ),
        ),
        solver=generate(),
        scorer=model_graded_qa(),
    )

1: The @task decorator registers the function with Inspect so that inspect eval can discover and run it by name.
2: hf_dataset() loads samples directly from Hugging Face. Inspect also reads CSV, JSON, and in-memory datasets.
3: FieldSpec declaratively maps the dataset’s problem and answer columns onto the sample’s input and target—no custom conversion function required.
4: The generate() solver simply sends each input to the model and collects its response.
5: Because the answers are free-form text, model_graded_qa() uses a model to grade each response against the target.

Run it from the command line with inspect eval, choosing a model with --model:

inspect eval simpleqa.py --model openai/gpt-5

Use inspect view to view the results:

inspect view

Agent: CTF Challenge

Agent evaluations require the model take actions rather than just answer a question. Here’s a Capture the Flag (CTF) task where the react() agent explores a sandboxed system using bash() and todo_write() tools to find a hidden flag:

ctf.py

from inspect_ai import Task, task
from inspect_ai.agent import react
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import includes
from inspect_ai.tool import bash, todo_write

@task
def ctf():
    return Task(
        dataset=json_dataset("challenges.json"),
        solver=react(
            prompt=(
                "You are a Capture the Flag player.
                Explore the system and find the flag."
            ),
            tools=[bash(), todo_write()],
            attempts=3,
        ),
        scorer=includes(),
        sandbox="docker",
    )

1: react() is a built-in agent that runs a reason-act-observe loop, giving the model the supplied tools until it submits an answer (here allowing up to 3 attempts).
2: The includes() scorer passes if the target flag appears in the agent’s submitted answer.
3: sandbox="docker" runs all tool calls inside an isolated Docker container (configured by a Dockerfile or compose.yaml alongside the task).

Use inspect view to view the results and look more carefully at individual transcripts:

See the Tutorial to explore more in-depth examples that demonstrate additional Inspect features and techniques.

Python API

Above we demonstrated using inspect eval from CLI to run evaluations—you can perform all of the same operations from directly within Python using the eval() function. For example:

from inspect_ai import eval
from simpleqa import simpleqa

eval(simpleqa(), model="openai/gpt-5")

LLM Assistance

As you learn and use Inspect we recommend you provide an LLM with the documentation required for it to assist. There are two versions of LLM friendly markdown documentation available:

llms.txt: Documentation index, articles fetched as required (~2k tokens).
llms-guide.txt: Full contents of all documentation (~185k tokens).

There is also a Copy Page button at the top of every page that provides a markdown version of the page.

Learning More

To learn more about using Inspect see the following documentation sections:

Tutorial includes several annotated examples demonstrating various features an capabilities.
Components are the building blocks of an evaluation: tasks, datasets, solvers, and scorers.
Models covers specifying models and providers, along with caching, multimodal input, reasoning, batch mode, and concurrency.
Agents combine planning, memory, and tool use for longer-horizon tasks, including the built-in ReAct agent, multi-agent architectures, and bridges to external frameworks.
Tools extend models with custom and built-in tools, MCP integrations, sandboxing, and tool-call approval.
Running covers running larger eval sets, with error handling, limits, parallelism, and early stopping.
Analysis explains how to read eval logs, extract data frames, and scan transcripts for issues.
Extensions shows how to extend Inspect with new model APIs, components, sandboxes, approvers, hooks, and filesystems.

You may also want to explore the Evals listing of ready-to-run benchmark implementations, the Extensions gallery of community packages, and the Reference for the complete Python and CLI API.

Citation

BibTeX citation:

@software{UK_AI_Security_Institute_Inspect_AI_Framework_2024,
  author = {AI Security Institute, UK},
  title = {Inspect {AI:} {Framework} for {Large} {Language} {Model}
    {Evaluations}},
  date = {2024-05},
  url = {https://github.com/UKGovernmentBEIS/inspect_ai},
  langid = {en}
}

For attribution, please cite this work as:

AI Security Institute, UK. 2024. Inspect AI: Framework for Large Language Model Evaluations. Released May. https://github.com/UKGovernmentBEIS/inspect_ai.