Agent Checkpointing
The agent checkpointing feature described below requires the development version of Inspect. You can install the development version from GitHub with:
pip install git+https://github.com/UKGovernmentBEIS/inspect_aiOverview
Evaluations that run for hours (or even days) can take advantage of checkpointing for resilience against unexpected termination (infrastructure failure, running out of memory, etc.). When enabled, Inspect’s checkpointing periodically persists the state of each sample so the evaluation can be resumed mid-sample from its most recent checkpoint.
Checkpointing works by saving three classes of data at a specified interval:
- The current state of the main agent (e.g. messages, compaction, etc.).
- Filesystem state within sandboxes (home + other configured directories).
- The store and event history of the sample.
Checkpointing does not save arbitrary in-memory process state, running tools, or other external side effects.
When Inspect retries a failed sample, checkpointed data is automatically restored and the sample continued from where it left off. Note that checkpointing needs explicit support in agent scaffolds. Built-in agents like react() and deepagent() support checkpointing natively, see the section below on Custom Agents for details on how to add checkpointing to your own agents.
Basic Usage
To enable checkpointing for an evaluation, use the --checkpoint CLI option. For example:
# checkpoint using default trigger (500K tokens)
inspect eval ctf.py --checkpoint
# checkpoint using explicit triggers
inspect eval ctf.py --checkpoint=turn:3
inspect eval ctf.py --checkpoint=time:15m
inspect eval ctf.py --checkpoint=token:100K
# pass a yaml or json config file
inspect eval ctf.py --checkpoint=checkpoint.yaml- 1
-
Time-value suffixes are
s,m,h,d - 2
-
Token-value suffixes are
K,M, andB
Or from Python:
from inspect_ai import eval
from inspect_ai.util import CheckpointConfig, TurnInterval
# checkpoint using default trigger (500K tokens)
eval("ctf.py", model="openai/gpt-5", checkpoint=True)
# checkpoint every 3 turns
eval(
"ctf.py", model="openai/gpt-5",
checkpoint=CheckpointConfig(
trigger=TurnInterval(every=3)
)
)Note that the above example assumes that you are using a checkpointing-aware agent (e.g. the built-in react() and deepagent()). You can add checkpointing to your own agents by following the recipe described in Custom Agents.
Recovery
If a crash occurs during an evaluation that was configured for checkpointing, recovery occurs when the sample is retried. For both eval_set() and eval_retry() this occurs automatically:
inspect eval-set arc.py ctf.py --checkpoint # CRASH!
inspect eval-set arc.py ctf.py --checkpoint # recover
inspect eval arc.py --checkpoint # CRASH!
inspect eval-retry logs/<log-file-name>.eval # recoverFor incomplete samples, Inspect locates the checkpoint directory, restores each sample’s most recent committed checkpoint, and continues. The following is restored before the agent runs:
- Agent State: Any state values the agent registered (e.g. message or compaction history).
- Sandbox: The sandbox setup script is run and the configured sandbox paths are restored from backup into a fresh sandbox container.
- Events and Store: Inspect’s internal per-sample state is rehydrated (the events appear under a synthesized
prior_runspan in the new.evallog).
By default checkpoints are deleted once the eval completes successfully; use retention="retain" in the CheckpointConfig to preserve them.
Triggers
Checkpoints are recorded according to the configured trigger. All triggers fire at the next turn boundary after the trigger condition is reached. Agents are never interrupted mid-turn, and in-flight tool calls are never paused to checkpoint.
For example, here we configure an eval to checkpoint every 1M tokens:
from inspect_ai import eval
from inspect_ai.util import CheckpointConfig, TokenInterval
eval(
"arc.py", model="openai/gpt-5",
checkpoint=CheckpointConfig(
trigger=TokenInterval(every=1_000_000)
)
)You can configure triggers based on turns, time, or tokens, or can alternatively specify manual only checkpointing (applicable for Custom Agents):
| Trigger | Description |
|---|---|
| TurnInterval | Fire every N agent turns. |
| TimeInterval | Fire after approximately N seconds/minutes of time has elapsed since the last checkpoint. |
| TokenInterval | Fire each time the sample’s running total token usage crosses another N-token boundary. Sample total tokens are read from sample_total_tokens(). |
| Manual | Never fires automatically. The agent (or another caller) requests a checkpoint with await cp.checkpoint(). |
Note that time and token intervals effectively checkpoint at ≥ N because firing waits for the next turn boundary.
Configuration
CheckpointConfig is the full configuration object passed to the Python API (it can equivalently be passed as YAML to the CLI). All fields default to None so that partial configurations can be layered (see Configuration Layers below).
| Field | Description |
|---|---|
trigger |
The trigger that decides when to fire (defaults to 500K tokens). |
sandbox_paths |
Per-sandbox-name map of absolute paths inside the sandbox to capture. Defaults to sandbox user’s home directory. |
checkpoints_location |
Override the parent directory under which the per-eval checkpoint directory lands (defaults to a sibling of the eval log file). |
max_consecutive_failures |
If set, fail the sample after N consecutive failed checkpoint attempts. Defaults to None (unlimited tolerance). Specify 0 to make any single failure fatal. |
retention |
"delete" or "retain". "delete" (default) removes the checkpoint directory on successful eval completion; "retain" keeps it. |
from datetime import timedelta
from inspect_ai.util import CheckpointConfig, TimeInterval
eval(
"arc.py", model="openai/gpt-4o",
checkpoint=CheckpointConfig(
trigger=TimeInterval(every=timedelta(minutes=15)),
max_consecutive_failures=3,
retention="retain",
),
)The equivalent YAML for use with --checkpoint=checkpoint.yaml:
trigger:
type: time
every: 15m
max_consecutive_failures: 3
retention: retainTask Configuration
Tasks can directly specify and enable checkpointing in the same manner that evals do (this lets you run a set of evals some of which have checkpointing and some of which don’t). For example:
from inspect_ai import Task, task
from inspect_ai.util import CheckpointConfig, TimeInterval
from datetime import timedelta
@task
def my_task():
return Task(
dataset=...,
solver=...,
checkpoint=CheckpointConfig(
trigger=TimeInterval(every=timedelta(minutes=15)),
sandbox_paths={"default": ["/workspace"]},
),
)Note that when sandbox_paths are specified by a Task or Sample they completely override the default (which is the sandbox home directory) so you should be sure to include the home directory along with any custom paths.
Sample Configuration
Samples can also specify checkpointing configuration, however they cannot enable checkpointing independently (this needs to be done at the task or eval level). For example:
from inspect_ai import eval
from inspect_ai.dataset import Sample
from inspect_ai.util import CheckpointSampleConfig
# specify config at the sample level
Sample(
input="...",
target="...",
checkpoint=CheckpointSampleConfig(
sandbox_paths={
"default": ["/root", "/workspace", "/var/state"],
},
),
)
# enable at the eval or task level (uses the sample config)
eval("ctf.py", model="openai/gpt-5", checkpoint=True)Configuration Layers
When more than one of the sample / task / eval layers supplies a configuration, Inspect merges them per-field at sample-run time. Each layer’s fields default to None; the highest-priority layer with a non-None value wins per field. Precedence is eval > sample > task. Sample beats task because the task defines an agent’s standard policy for all of its samples and an individual sample specializes it. Eval/CLI is always highest because it’s the operator’s run-time override.
Sandbox Paths
sandbox_paths is a map of container name to paths that should be captured for that container. Sandboxes that use only a single container can use the "default" key to refer to that container. Note that sandbox_paths is treated as a single whole-dict value, not key-wise merged. To add a path to one sandbox while inheriting others, the higher-priority layer must redeclare the full map. Also, while sandbox home directories are included by default, if you specify sandbox_paths explicitly you must explicitly include the home directory if you want it checkpointed.
Checkpoint Flow
When checkpointing is enabled, each sample proceeds as follows:
Tick: At each agent turn boundary, the agent calls
cp.tick(). The trigger decides whether this tick is a checkpoint moment.Capture: When the trigger fires, Inspect writes the host-side context files (agent-tracked state, events, and store) and runs a restic backup on the agent scaffold host and each configured sandbox path.
Record: Inspect writes a
ckpt-NNNNN.jsoncheckpoint file at the per-sample root. A checkpoint is available for resumption only once this file is in place.Emit: A structured
CheckpointEventis emitted into the normal event stream (and into the.evallog) carrying the trigger, turn, duration, snapshot ids, and size.
Crashed cycles that didn’t reach step 3 leave orphan snapshots with no checkpoint file; resume discards them automatically.
Custom Agents
A custom agent participates in checkpointing by entering a checkpointer() context and calling tick() at each turn boundary.
The key checkpointing mechanism for agents is the cp.track method, which gives the agent a way to declare “this variable is important state, capture it at each checkpoint and restore it for me on retry.”
On a fresh run, track returns the initial_value the agent passes in. On a retry of the same sample, track returns whatever the registered callback captured at the most recent checkpoint of the prior run, so the agent’s important state comes back automatically and can pick up where it left off.
from inspect_ai.util import Attempt, checkpointer
async def my_agent(state):
async with checkpointer() as cp:
# each `track` call: register a callback that
# captures this variable at every checkpoint
# fire, AND get back the value captured at the
# most recent prior checkpoint (or the initial
# value on a fresh run).
attempt_count = cp.track(
"attempt_count",
lambda: attempt_count,
0,
)
state.messages = cp.track(
"messages",
lambda: state.messages,
state.messages,
)
# scoring-phase resume: agent already finished in
# the prior attempt — tracked state is restored
# above, return so scoring can re-run.
if cp.attempt == Attempt.RESUME_FOR_SCORING:
return state
# agent loop
while True:
# fire checkpoint as per trigger
await cp.tick()
...
# for manual triggers the agent fires explicitly:
# await cp.checkpoint()The Checkpointer context has the following members:
| Member | Description |
|---|---|
await cp.tick() |
Turn-boundary signal; may fire a checkpoint depending on the trigger. |
await cp.checkpoint() |
Force a checkpoint regardless of trigger. |
cp.track(key, callback, initial_value, *, value_type=None) |
Declare a piece of persisted agent state. callback is invoked at every checkpoint fire to capture the current value; the method returns the value captured at the most recent prior checkpoint on a retry, or initial_value on a fresh run. |
cp.attempt |
Attempt.INITIAL (fresh), Attempt.RESUME (mid-agent resume), or Attempt.RESUME_FOR_SCORING (agent completed in the prior attempt — restore tracked state and return). |
cp.track() is generic over the value type. Pydantic models and JSON primitive values are handled automatically; for other shapes (collections, generics, dataclasses, lists of models) pass value_type=....
Limitations
Checkpointing enables restoration of the most important agent context after a crash, but has some limitations you should bear in mind when using it:
Turn-boundary granularity only — A long-running tool call (e.g. a 10-minute subprocess) blocks the next checkpoint until the call returns.
Filesystem state only — Anything outside
sandbox_pathsis not captured. In-memory process state inside the sandbox (running daemons, open sockets, RAM) is lost on resume. The agent is responsible for tolerating this.Failure tolerance — A failed checkpoint attempt records an InfoEvent (
source="checkpoint") and logs a warning, but by default the sample keeps running. (durability is treated as a nicety, not a correctness requirement). Setmax_consecutive_failures=Nto bound this, or0for strict mode.Restic binary — On first use, Inspect fetches a pinned restic binary and caches it. For offline / air-gapped environments, pre-warm the cache with
inspect download restic.