Inspect
An open-source framework for large language model evaluations
Welcome
Inspect is a framework for frontier AI evaluations developed by the UK AI Security Institute and Meridian Labs. Inspect can be used for a broad range of evaluations that measure coding, agentic tasks, reasoning, knowledge, behavior, and multi-modal understanding. Core features of Inspect include:
- Composable building blocks—datasets, agents, tools, and scorers—that make evaluations easy to write and reuse.
- A collection of over 200 pre-built evaluations ready to run on any model.
- Extensive tooling, including a web-based Inspect View tool for monitoring and visualizing evaluations and a VS Code Extension that assists with authoring and debugging.
- Flexible support for tool calling—custom and MCP tools, as well as built-in bash, python, text editing, web search, web browsing, and computer tools.
- Support for agent evaluations, including flexible built-in agents, multi-agent primitives, and the ability to run arbitrary external agents like Claude Code, Codex CLI, and Gemini CLI.
- A sandboxing system that supports running untrusted model code in Docker, Kubernetes, Modal, Proxmox, and other systems via an extension API.
We’ll walk through two short “Hello, Inspect” examples below. Read on to learn the basics, then read the documentation on Datasets, Solvers, Scorers, Tools, and Agents to learn how to create more advanced evaluations.
If you are primarily interested in running evaluations rather than developing new ones, see the Evals listing where you’ll find implementations for over 200 popular benchmarks.
Getting Started
To get started using Inspect:
Install Inspect from PyPI with:
pip install inspect-aiIf you are using VS Code, install the Inspect VS Code Extension (not required but highly recommended).
To develop and run evaluations, you’ll also need access to a model, which typically requires installation of a Python package as well as ensuring that the appropriate API key is available in the environment. For example:
pip install openai
export OPENAI_API_KEY=your-openai-api-key
inspect eval simpleqa.py --model openai/gpt-4opip install anthropic
export ANTHROPIC_API_KEY=your-anthropic-api-key
inspect eval simpleqa.py --model anthropic/claude-sonnet-4-0pip install google-genai
export GOOGLE_API_KEY=your-google-api-key
inspect eval simpleqa.py --model google/gemini-2.5-propip install torch transformers
export HF_TOKEN=your-hf-token
inspect eval simpleqa.py --model hf/meta-llama/Llama-2-7b-chat-hfInspect has built-in support for over 20 model providers as well as support for local inference with HuggingFace, vLLM, and SGLang. See the documentation on Model Providers for details on all supported providers.
Hello, Inspect
An Inspect evaluation is a Task that brings together three things:
Dataset that provides labelled samples—typically a table with
inputandtargetcolumns, whereinputis the prompt andtargetis the ideal answer or grading guidance.Solver that produces an answer for each sample. This can be as simple as a single generate() call to the model, or as sophisticated as a full agent that uses tools over many turns.
Scorer that evaluates the output—using text comparisons, model grading, or other custom schemes.
Let’s look at two short examples: a question-answering benchmark and a capture the flag challenge.
Benchmark: SimpleQA
This task evaluates a model on SimpleQA, a benchmark of short, fact-seeking questions(click on the numbers at right for further explanation):
simpleqa.py
from inspect_ai import Task, task
from inspect_ai.dataset import FieldSpec, hf_dataset
from inspect_ai.scorer import model_graded_qa
from inspect_ai.solver import generate
@task
def simpleqa():
return Task(
dataset=hf_dataset(
"codelion/SimpleQA-Verified",
split="train",
sample_fields=FieldSpec(
input="problem",
target="answer",
),
),
solver=generate(),
scorer=model_graded_qa(),
)- 1
-
The
@taskdecorator registers the function with Inspect so thatinspect evalcan discover and run it by name. - 2
- hf_dataset() loads samples directly from Hugging Face. Inspect also reads CSV, JSON, and in-memory datasets.
- 3
-
FieldSpec declaratively maps the dataset’s
problemandanswercolumns onto the sample’sinputandtarget—no custom conversion function required. - 4
-
The generate() solver simply sends each
inputto the model and collects its response. - 5
-
Because the answers are free-form text, model_graded_qa() uses a model to grade each response against the
target.
Run it from the command line with inspect eval, choosing a model with --model:
inspect eval simpleqa.py --model openai/gpt-5Agent: CTF Challenge
Agent evaluations require the model take actions rather than just answer a question. Here’s a Capture the Flag (CTF) task where the react() agent explores a sandboxed system using bash() and todo_write() tools to find a hidden flag:
ctf.py
from inspect_ai import Task, task
from inspect_ai.agent import react
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import includes
from inspect_ai.tool import bash, todo_write
@task
def ctf():
return Task(
dataset=json_dataset("challenges.json"),
solver=react(
prompt=(
"You are a Capture the Flag player.
Explore the system and find the flag."
),
tools=[bash(), todo_write()],
attempts=3,
),
scorer=includes(),
sandbox="docker",
)- 1
-
react() is a built-in agent that runs a reason-act-observe loop, giving the model the supplied
toolsuntil it submits an answer (here allowing up to 3 attempts). - 2
- The includes() scorer passes if the target flag appears in the agent’s submitted answer.
- 3
-
sandbox="docker"runs all tool calls inside an isolated Docker container (configured by aDockerfileorcompose.yamlalongside the task).
See the Tutorial to explore more in-depth examples that demonstrate additional Inspect features and techniques.
Eval Logs
By default, eval logs are written to the ./logs sub-directory of the current working directory. When the eval is complete you will find a link to the log at the bottom of the task results summary.
If you are using VS Code, we recommend installing the Inspect VS Code Extension and using its integrated log browsing and viewing.
For other editors, you can use the inspect view command to open a log viewer in the browser (you only need to do this once as the viewer will automatically update when new evals are run):
inspect viewSee the Log Viewer section for additional details on using Inspect View.
Python API
Above we demonstrated using inspect eval from CLI to run evaluations—you can perform all of the same operations from directly within Python using the eval() function. For example:
from inspect_ai import eval
from simpleqa import simpleqa
eval(simpleqa(), model="openai/gpt-5")LLM Assistance
As you learn and use Inspect we recommend you provide an LLM with the documentation required for it to assist. There are two versions of LLM friendly markdown documentation available:
llms.txt: Documentation index, articles fetched as required (~2k tokens).
llms-guide.txt: Full contents of all documentation (~185k tokens).
There is also a Copy Page button at the top of every page that provides a markdown version of the page.
Learning More
To learn more about using Inspect see the following documentation sections:
Tutorial includes several annotated examples demonstrating various features an capabilities.
Components are the building blocks of an evaluation: tasks, datasets, solvers, and scorers.
Models covers specifying models and providers, along with caching, multimodal input, reasoning, batch mode, and concurrency.
Agents combine planning, memory, and tool use for longer-horizon tasks, including the built-in ReAct agent, multi-agent architectures, and bridges to external frameworks.
Tools extend models with custom and built-in tools, MCP integrations, sandboxing, and tool-call approval.
Running covers running larger eval sets, with error handling, limits, parallelism, and early stopping.
Analysis explains how to read eval logs, extract data frames, and scan transcripts for issues.
Extensions shows how to extend Inspect with new model APIs, components, sandboxes, approvers, hooks, and filesystems.
You may also want to explore the Evals listing of ready-to-run benchmark implementations, the Extensions gallery of community packages, and the Reference for the complete Python and CLI API.
Citation
@software{UK_AI_Security_Institute_Inspect_AI_Framework_2024,
author = {AI Security Institute, UK},
title = {Inspect {AI:} {Framework} for {Large} {Language} {Model}
{Evaluations}},
date = {2024-05},
url = {https://github.com/UKGovernmentBEIS/inspect_ai},
langid = {en}
}
