Tasks
Overview
This article documents both basic and advanced use of Inspect tasks, which are the fundamental unit of integration for datasets, solvers, and scorers. The following topics are explored:
- Task Basics describes the core components and options of tasks.
- Parameters covers adding parameters to tasks to make them flexible and adaptable.
- Solvers describes how to create tasks that can be used with many different solvers.
- Task Reuse documents how to flexibly derive new tasks from existing task definitions.
- Configuration explains how to override task options at runtime with task_with(), environment variables, eval(), and the CLI.
- Packaging illustrates how you can distribute tasks within Python packages.
- Exploratory provides guidance on doing exploratory task development.
Task Basics
Tasks provide a recipe for an evaluation consisting minimally of a dataset, a solver, and a scorer (and possibly other options) and is returned from a function decorated with @task. For example:
from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought, generate
@task
def security_guide():
return Task(
dataset=json_dataset("security_guide.json"),
solver=[chain_of_thought(), generate()],
scorer=model_graded_fact()
)For convenience, tasks always define a default solver. That said, it is often desirable to design tasks that can work with any solver so that you can experiment with different strategies. The Solvers section below goes into depth on how to create tasks that can be flexibly used with any solver.
Task Options
While many tasks can be defined with only a dataset, solver, and scorer, there are lots of other useful Task options. We won’t describe these options in depth here, but rather provide a list along with links to other sections of the documentation that cover their usage:
| Option | Description | Docs |
|---|---|---|
epochs |
Epochs to run for each dataset sample. | Epochs |
setup |
Setup solver(s) to run prior to the main solver. | Sample Setup |
cleanup |
Cleanup function to call at task completion. | Task Cleanup |
sandbox |
Sandbox configuration for un-trusted code execution. | Sandboxing |
approval |
Approval policy for tool calls. | Tool Approval |
metrics |
Metrics to use in place of scorer metrics. | Metrics |
model |
Model for evaluation (typically specified by eval rather than the task). |
Models |
model_roles |
Named models for use with get_model() (e.g. a grader). | Model Roles |
config |
Config for model generation (also typically specified in eval). |
Generate Config |
fail_on_error |
Failure tolerance for samples. | Failure Threshold |
continue_on_fail |
Continue running after sample errors, failing only at the end. | Handling Errors |
score_on_error |
Score samples that error rather than failing the run. | Handling Errors |
message_limit, token_limit, time_limit, working_limit, cost_limit |
Limits to apply to sample execution. | Sample Limits |
early_stopping |
Stop a task early based on previously scored samples. | Early Stopping |
name, display_name, version, metadata, tags |
Identifying attributes recorded in the eval log. | Eval Logs |
viewer |
Log viewer config (e.g. how scanner results render). | Task Views |
You by and large don’t need to worry about these options until you want to use the features they are linked to.
Parameters
Task parameters make it easy to run variants of your task without changing its source code. Task parameters are simply the arguments to your @task decorated function. For example, here we provide parameters (and default values) for system and grader prompts, as well as the grader model:
security.py
from inspect_ai import Task, task
from inspect_ai.dataset import example_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, system_message
@task
def security_guide(
system="devops.txt",
grader="expert.txt",
grader_model="openai/gpt-4o"
):
return Task(
dataset=example_dataset("security_guide"),
solver=[system_message(system), generate()],
scorer=model_graded_fact(
template=grader, model=grader_model
)
)Let’s say we had an alternate system prompt in a file named "researcher.txt". We could run the task with this prompt as follows:
inspect eval security.py -T system="researcher.txt"The -T CLI flag is used to specify parameter values. You can include multiple -T flags. For example:
inspect eval security.py \
-T system="researcher.txt" -T grader="hacker.txt"If you have several task parameters you want to specify together, you can put them in a YAML or JSON file and use the --task-config CLI option. For example:
config.yaml
system: "researcher.txt"
grader: "hacker.txt"Reference this file from the CLI with:
inspect eval security.py --task-config=config.yamlIf you want to bundle task parameters together with model, generation, and solver settings in a single file, use --run-config instead. See Run Config File.
For a broader view of how task parameters relate to task_with(), environment variables, eval(), and CLI overrides, see Configuration.
Solvers
While tasks always include a default solver, you can also vary the solver to explore other strategies and elicitation techniques. This section covers best practices for creating solver-independent tasks.
Solver Parameter
You can substitute an alternate solver for the solver that is built in to your Task using the --solver command line parameter (or solver argument to the eval() function).
For example, let’s start with a simple CTF challenge task:
from inspect_ai import Task, task
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python
from inspect_ai.scorer import includes
@task
def ctf():
return Task(
dataset=read_dataset(),
solver=[
use_tools([
bash(timeout=180),
python(timeout=180)
]),
generate()
],
sandbox="docker",
scorer=includes()
)This task uses the most naive solver possible (a simple tool use loop with no additional elicitation). That might be okay for initial task development, but we’ll likely want to try lots of different strategies. We start by breaking the solver into its own function and adding an alternative solver that uses a react() agent
from inspect_ai import Task, task
from inspect_ai.agent import react
from inspect_ai.dataset._dataset import Sample
from inspect_ai.scorer import includes
from inspect_ai.solver import chain, generate, solver, use_tools
from inspect_ai.tool import bash, python
@solver
def ctf_tool_loop():
return chain([
use_tools([
bash(timeout=180),
python(timeout=180)
]),
generate()
])
@solver
def ctf_agent(attempts: int = 3):
return react(
tools=[bash(timeout=180), python(timeout=180)],
attempts=attempts,
)
@task
def ctf():
# return task
return Task(
dataset=read_dataset(),
solver=ctf_tool_loop(),
sandbox="docker",
scorer=includes(),
)Note that we use the chain() function to combine multiple solvers into a composite one.
You can now switch between solvers when running the evaluation:
# run with the default solver (ctf_tool_loop)
inspect eval ctf.py
# run with the ctf agent solver
inspect eval ctf.py --solver=ctf_agent
# run with a different number of attempts
inspect eval ctf.py --solver=ctf_agent -S attempts=5Note the use of the -S CLI option to pass an alternate value for attempts to the ctf_agent() solver.
Setup Parameter
In some cases, there will be important steps in the setup of a task that should not be substituted when another solver is used with the task. For example, you might have a step that does dynamic prompt engineering based on values in the sample metadata or you might have a step that initialises resources in a sample’s sandbox.
In these scenarios you can define a setup solver that is always run even when another solver is substituted. For example, here we adapt our initial example to include a setup step:
# prompt solver which should always be run
@solver
def ctf_prompt():
async def solve(state, generate):
# TODO: dynamic prompt engineering
return state
return solve
@task
def ctf(solver: Solver | None = None):
# use default tool loop solver if no solver specified
if solver is None:
solver = ctf_tool_loop()
# return task
return Task(
dataset=read_dataset(),
setup=ctf_prompt(),
solver=solver,
sandbox="docker",
scorer=includes()
)Task Cleanup
You can use the cleanup parameter for executing code at the end of each sample run. The cleanup function is passed the TaskState and is called for both successful runs and runs where are exception is thrown. Extending the example from above:
async def ctf_cleanup(state: TaskState):
## perform cleanup
...
Task(
dataset=read_dataset(),
setup=ctf_prompt(),
solver=solver,
cleanup=ctf_cleanup,
scorer=includes()
)Note that like solvers, cleanup functions should be async.
Task Reuse
The basic mechanism for task re-use is to create flexible and adaptable base @task functions (which often have many parameters) and then derive new higher-level tasks from them by creating additional @task functions that call the base function.
In some cases though you might not have full control over the base @task function (e.g. it’s published in a Python package you aren’t the maintainer of) but you nevertheless want to flexibly create derivative tasks from it. To do this, you can use the task_with() function, which provides a straightforward way to modify the properties of an existing task. The Configuration section below covers task_with() alongside the other ways to override task options at runtime.
For example, imagine you are dealing with a Task that hard-codes its sandbox to a particular Dockerfile included with the task, and further hard codes its solver to a simple agent:
from inspect_ai import Task, task
from inspect_ai.agent import react
from inspect_ai.tool import bash
from inspect_ai.scorer import includes
@task
def hard_coded():
return Task(
dataset=read_dataset(),
solver=react(tools=[bash()]),
sandbox=("docker", "compose.yaml"),
scorer=includes()
)Using task_with(), you can adapt this task to use a different solver and sandbox entirely. For example, here we import the original hard_coded() task from a hypothetical ctf_tasks package and provide it with a different solver and sandbox, as well as give it a message_limit (which we in turn also expose as a parameter of the adapted task):
from inspect_ai import task, task_with
from inspect_ai.solver import solver
from ctf_tasks import hard_coded
@solver
def my_custom_agent():
## custom agent implementation
...
@task
def adapted(message_limit: int = 20):
return task_with(
hard_coded(), # original task definition
solver=my_custom_agent(),
sandbox=("docker", "custom-compose.yaml"),
message_limit=message_limit
)Tasks are recipes for an evaluation and represent the convergence of many considerations (datasets, solvers, sandbox environments, limits, and scoring). Task variations often lie at the intersection of these, and the task_with() function is intended to help you produce exactly the variation you need for a given evaluation.
Note that task_with() modifies the passed task in-place, so if you want to create multiple variations of a single task using task_with() you should create the underlying task multiple times (once for each call to task_with()). For example:
adapted1 = task_with(hard_coded(), ...)
adapted2 = task_with(hard_coded(), ...)Configuration
A task definition provides defaults for everything an evaluation needs, but you will often want to run a task with different settings without editing its source. Task options can be set or overridden at four layers, each taking precedence over the ones before it:
- Task definition: defaults baked into the
@taskfunction andTask()constructor. - task_with(): programmatic overrides applied to a task before passing it to eval().
- Environment variables /
.envfiles: project or session defaults set outside code. - eval() / CLI: runtime overrides, which take highest precedence.
| Lowest | Highest | ||
|---|---|---|---|
| Task definition | task_with() | .env / env vars |
eval() / CLI |
The first two layers are described earlier in this article: defaults and parameters in the task definition, and task_with() for adapting a task you don’t control. The sections below cover the remaining two layers, then provide a reference for what can be set where.
Environment Variables
Every CLI flag can be set as an environment variable using the INSPECT_EVAL_ prefix (with hyphens converted to underscores). Set these in the shell, or place them in a .env file that Inspect reads automatically from the current directory (searching parent directories if not found). Use this layer for project or session defaults you want applied across runs without specifying them each time:
.env
INSPECT_EVAL_MODEL=anthropic/claude-sonnet-4-5
INSPECT_EVAL_TEMPERATURE=0.0
INSPECT_EVAL_MAX_CONNECTIONS=20
INSPECT_EVAL_MAX_RETRIES=5Variables set in the shell take precedence over values in a .env file. See Options for details on .env file handling.
eval() and CLI
Parameters passed to eval() or on the inspect eval command line take highest precedence, and apply to all tasks being evaluated in the call.
from inspect_ai import eval
eval(
simpleqa(),
model="anthropic/claude-sonnet-4-5",
temperature=0.0,
max_tokens=4096,
epochs=5,
limit=100,
message_limit=50,
model_roles={"grader": "google/gemini-2.0-flash"},
)The same overrides on the command line:
inspect eval inspect_evals/simpleqa \
--model anthropic/claude-sonnet-4-5 \
--temperature 0.0 \
--max-tokens 4096 \
--epochs 5 \
--limit 100 \
--message-limit 50 \
--model-role grader=google/gemini-2.0-flashSee Eval Options for the full list of CLI flags.
Override Reference
The table below lists task and runtime parameters and the layers at which each can be set:
| Parameter | Task | task_with |
eval |
CLI flag |
|---|---|---|---|---|
| Task structure | ||||
dataset |
yes | yes | ||
setup |
yes | yes | ||
solver |
yes | yes | yes | --solver (name or file.py@name) |
cleanup |
yes | yes | ||
scorer |
yes | yes | ||
metrics |
yes | yes | ||
| Model | ||||
model |
yes | yes | yes | --model |
config (includes temperature, max_tokens, etc.) |
yes | yes | yes (via **kwargs) |
individual flags or --generate-config |
model_roles |
yes | yes | yes | --model-role |
| Execution limits | ||||
epochs |
yes | yes | yes | --epochs |
message_limit |
yes | yes | yes | --message-limit |
token_limit |
yes | yes | yes | --token-limit |
time_limit |
yes | yes | yes | --time-limit |
working_limit |
yes | yes | yes | --working-limit |
cost_limit |
yes | yes | yes | --cost-limit |
early_stopping |
yes | yes | ||
| Error handling | ||||
fail_on_error |
yes | yes | yes | --fail-on-error |
continue_on_fail |
yes | yes | yes | --continue-on-fail |
retry_on_error |
yes | --retry-on-error |
||
score_on_error |
yes | yes | yes | --score-on-error |
debug_errors |
yes | --debug-errors |
||
| Environment | ||||
sandbox |
yes | yes | yes | --sandbox |
sandbox_cleanup |
yes | yes | --no-sandbox-cleanup |
|
approval |
yes | yes | yes | --approval |
| Task identity | ||||
name |
yes | yes | ||
version |
yes | yes | ||
metadata |
yes | yes (overwrites) | yes (merges) | --metadata |
tags |
yes | yes (overwrites) | yes (merges) | --tags |
| Sample selection | ||||
limit |
yes | --limit |
||
sample_id |
yes | --sample-id |
||
sample_shuffle |
yes | --sample-shuffle |
||
| Eval-level controls | ||||
task_args |
args/kwargs | yes | -T key=value |
|
score |
yes | --no-score |
||
score_display |
yes | --no-score-display |
||
trace |
yes | --trace |
Blank cells indicate that a parameter cannot be set at that layer. The task_args row refers to setting these fields as arguments of the Task object, as opposed to passing a task_args dictionary.
Generation Config
GenerateConfig parameters (temperature, max_tokens, top_p, and so on) can be set at every layer.
In the task definition via config:
Task(
...,
config=GenerateConfig(temperature=0.5, max_tokens=2048)
)With task_with() via config:
task_with(my_task(), config=GenerateConfig(temperature=0.0))With eval() as keyword arguments:
eval(my_task(), temperature=0.0, max_tokens=4096)On the CLI as individual flags:
inspect eval my_task.py --temperature 0.0 --max-tokens 4096Or from a YAML/JSON file using --generate-config:
inspect eval my_task.py --generate-config config.yamlwhere config.yaml contains GenerateConfig fields:
config.yaml
temperature: 0.5
max_tokens: 2048Individual CLI flags (e.g. --temperature) take precedence over values in the config file. To bundle generation parameters alongside a full eval configuration (task, model, model roles, solver), use --run-config instead (see Run Config File).
Model Roles
Model roles assign models to named purposes within a task (for example, a “grader” model for scoring). They can be set on Task, with task_with(), with eval(), or on the CLI with --model-role (see the override reference for where each form fits). The most common pattern:
Task(..., model_roles={"grader": "openai/gpt-4o"})
eval(my_task(), model_roles={"grader": "google/gemini-2.0-flash"})Inside a solver or scorer, resolve the role with get_model():
model = get_model(role="grader", default="openai/gpt-4o")For inline YAML/JSON examples and role-resolution details, see Model Roles.
Run Config File
The --run-config option specifies a single YAML or JSON file that captures a full eval configuration (task, model, model roles, generation parameters, solver, and eval settings) in one place. CLI flags still override values from the file.
inspect eval --run-config run.yamlThe file schema mirrors the structure of the corresponding eval() parameters:
run.yaml
task:
task: inspect_evals/simpleqa
args:
split: test
model:
model: anthropic/claude-sonnet-4-5
args:
max_retries: 3
model_roles:
grader:
model: openai/gpt-4o
config:
temperature: 0.0
generate_config:
temperature: 0.5
max_tokens: 4096
seed: 42
solver:
solver: my_solvers.py@chain_of_thought
args:
cot_template: detailed
eval_config:
limit: 100
epochs: 3
message_limit: 50All top-level keys are optional. This lets you create “paper config” files that record the generation and eval settings from a paper without hard-coding a specific model, leaving the model to be supplied on the CLI:
# paper_config.yaml specifies only generate_config, eval_config, and model_roles
inspect eval inspect_evals/simpleqa \
--model anthropic/claude-sonnet-4-5 \
--run-config paper_config.yamlTo run with a different value than the file specifies, pass the corresponding flag:
inspect eval --run-config run.yaml --temperature 0.9--run-config cannot be combined with --generate-config, --task-config, or --solver-config. Use --run-config for a single file; use the individual options to compose configuration from multiple files.
To generate a run config from an existing eval log, use inspect log export-config, which writes the realised configuration as --run-config-compatible YAML:
inspect log export-config logs/my_run.eval > run.yaml
inspect eval --run-config run.yamlSee Exporting Run Config for details.
Scorer Override
The scorer can only be overridden with task_with() during a live eval; there is no eval() parameter or CLI flag for it:
task_with(my_task(), scorer=my_custom_scorer())Some task authors expose scorer selection as a task parameter, which can then be set with -T:
inspect eval my_task.py -T scorer=originalThis is a convention rather than a framework feature: the @task function must explicitly handle the parameter.
You can re-score an existing log file with a different scorer using inspect score. The --scorer flag accepts a name (any function decorated with @scorer, see Custom Scorers) or a file.py@name reference:
# scorer registered via @scorer decorator
inspect score log_file.eval --scorer my_scorer
# scorer defined in a file
inspect score log_file.eval --scorer scorers.py@custom_scorerCommon Patterns
When consuming a task from a package (such as inspect_evals) and customising it, here is a recommended approach for each scenario:
| Need | How |
|---|---|
| Different model | eval() / --model |
| Different temperature or max_tokens | eval() / --temperature / --max-tokens |
| Bundle of generation params | --generate-config config.yaml |
| Full run config (paper reproduction) | --run-config run.yaml |
| Different solver | eval(solver=...) / --solver / task_with() |
| Different scorer | task_with(task, scorer=...) |
| Different grader model | --model-role grader=... / eval(model_roles=) |
| Different metrics | task_with(task, metrics=[...]) |
| Subset of samples | --limit / --sample-id |
| Different epochs | --epochs |
Every component except scorer, dataset, and metrics can be overridden without modifying the task’s source. If the task author uses get_model(role="grader") for model-graded scoring, the grader model is also overridable at runtime via --model-role.
Packaging
A convenient way to distribute tasks is to include them in a Python package. This makes it very easy for others to run your task and ensure they have all of the required dependencies.
Tasks in packages can be registered such that users can easily refer to them by name from the CLI. For example, the Inspect Evals package includes a suite of tasks that can be run as follows:
inspect eval inspect_evals/gaia
inspect eval inspect_evals/swe_benchExample
Here’s an example that walks through all of the requirements for registering tasks in packages. Let’s say your package is named evals and has a task named mytask in the tasks.py file:
evals/
evals/
tasks.py
_registry.py
pyproject.toml
The _registry.py file serves as a place to import things that you want registered with Inspect. For example:
_registry.py
from .tasks import mytaskYou can then register mytask (and anything else imported into _registry.py) as a setuptools entry point. This will ensure that inspect can resolve references to your package from the CLI. Here is how this looks in pyproject.toml:
[project.entry-points.inspect_ai]
evals = "evals._registry"[tool.poetry.plugins.inspect_ai]
evals = "evals._registry"Now, anyone that has installed your package can run the task as follows:
inspect eval evals/mytaskThe same packaging mechanism works for solvers, scorers, and tools. See Components for how to distribute and reference each component type.
Hugging Face
Datasets hosted on Hugging Face Hub can include an eval.yaml file that provides Inspect task definitions. For example, the OpenEvals/aime_24 dataset can be evaluated with:
inspect eval hf/OpenEvals/aime_24 --model openai/gpt-5Here are the eval.yaml definitions for several Hugging Face datasets:
A dataset’s eval.yaml file defines a list of tasks. Here are the fields that can be included in a task definition and how they are used in constructing Task instances:
| Field | Default | Usage |
|---|---|---|
config |
“default” | hf_dataset(name) |
split |
“test” | hf_dataset(split) |
field_spec |
None | hf_dataset(sample_fields) |
shuffle_choices |
None | dataset.shuffle_choices() |
epochs |
1 | Epochs(epochs) |
epoch_reducer |
“mean” | Epochs(epoch_reducer) |
solvers |
None | Task(solver) |
scorer |
None | Task(scorer) |
id |
None | hf/org/dataset/name |
field_spec.choicescan be either a single string (the key for one field in each record) or a list of strings (multiple fields, whose values will form the choices list for each sample).field_spec.targetcan be:- A literal value, specified as
literal:<value>, where<value>will be used directly as the target. - A field name corresponding to a letter, or an integer; in this case, the integer (e.g., 0, 1, 2) will be mapped to a letter (
A,B,C, etc.) for use as the target.
- A literal value, specified as
field_spec.input_imageis an optional field name for multimodal tasks. When specified, it should reference a field containing image data as a data URI (base64 encoded). The image will be combined with the text input to create a multimodal chat message. For example:
Multiple Tasks
Datasets can define multiple named tasks. For example, the OpenEvals/MuSR dataset defines 3 tasks: musr:murder_mysteries, musr:object_placements, and musr:team_allocation. If you call inspect eval with no task qualification, all 3 tasks will be run. If you append a task name, only that task will be run:
# run all 3 tasks defined by OpenEvals/MuSR
inspect eval hf/OpenEvals/MuSR --model openai/gpt-5
# run only the musr:murder_mysteries task
inspect eval hf/OpenEvals/MuSR/musr:murder_mysteries --model openai/gpt-5Note that when running multiple tasks, you may want to increase --max-tasks for more concurrency:
inspect eval hf/OpenEvals/MuSR --model openai/gpt-5 --max-tasks 3Revisions
All of the examples above execute evals from the main branch. You can alternatively execute from a branch, tag, or revision hash by appending an @ qualifier. For example:
inspect eval hf/OpenEvals/MuSR@df154a5 --model openai/gpt-5Exploratory
When developing tasks and solvers, you often want to explore how changing prompts, generation options, solvers, and models affect performance on a task. You can do this by creating multiple tasks with varying parameters and passing them all to the eval_set() function.
Returning to the example from above, the system and grader parameters point to files we are using as system message and grader model templates. At the outset we might want to explore every possible combination of these parameters, along with different models. We can use the itertools.product function to do this:
from itertools import product
# 'grid' will be a permutation of all parameters
params = {
"system": ["devops.txt", "researcher.txt"],
"grader": ["hacker.txt", "expert.txt"],
"grader_model": ["openai/gpt-4o", "google/gemini-2.5-pro"],
}
grid = list(product(*(params[name] for name in params)))
# run the evals and capture the logs
logs = eval_set(
[
security_guide(system, grader, grader_model)
for system, grader, grader_model in grid
],
model=["google/gemini-2.5-flash", "mistral/mistral-large-latest"],
log_dir="security-tasks"
)
# analyze the logs...
plot_results(logs)Note that we also pass a list of model to try out the task on multiple models. This eval set will produce in total 16 tasks accounting for the parameter and model variation.
See the article on Eval Sets to learn more about using eval sets. See the article on Eval Logs for additional details on working with evaluation logs.
Inspect Flow
For larger or repeated explorations, Inspect Flow builds on this pattern. It’s a companion package for running and managing evaluations at scale, with declarative configuration, parameter sweeps (matrix patterns across tasks, models, and hyperparameters), reusable defaults, and reuse of evaluation logs across runs.