Tasks

Overview

This article documents both basic and advanced use of Inspect tasks, which are the fundamental unit of integration for datasets, solvers, and scorers. The following topics are explored:

  • Task Basics describes the core components and options of tasks.
  • Parameters covers adding parameters to tasks to make them flexible and adaptable.
  • Solvers describes how to create tasks that can be used with many different solvers.
  • Task Reuse documents how to flexibly derive new tasks from existing task definitions.
  • Configuration explains how to override task options at runtime with task_with(), environment variables, eval(), and the CLI.
  • Packaging illustrates how you can distribute tasks within Python packages.
  • Exploratory provides guidance on doing exploratory task development.

Task Basics

Tasks provide a recipe for an evaluation consisting minimally of a dataset, a solver, and a scorer (and possibly other options) and is returned from a function decorated with @task. For example:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought, generate

@task
def security_guide():
    return Task(
        dataset=json_dataset("security_guide.json"),
        solver=[chain_of_thought(), generate()],
        scorer=model_graded_fact()
    )

For convenience, tasks always define a default solver. That said, it is often desirable to design tasks that can work with any solver so that you can experiment with different strategies. The Solvers section below goes into depth on how to create tasks that can be flexibly used with any solver.

Task Options

While many tasks can be defined with only a dataset, solver, and scorer, there are lots of other useful Task options. We won’t describe these options in depth here, but rather provide a list along with links to other sections of the documentation that cover their usage:

Option Description Docs
epochs Epochs to run for each dataset sample. Epochs
setup Setup solver(s) to run prior to the main solver. Sample Setup
cleanup Cleanup function to call at task completion. Task Cleanup
sandbox Sandbox configuration for un-trusted code execution. Sandboxing
approval Approval policy for tool calls. Tool Approval
metrics Metrics to use in place of scorer metrics. Metrics
model Model for evaluation (typically specified by eval rather than the task). Models
model_roles Named models for use with get_model() (e.g. a grader). Model Roles
config Config for model generation (also typically specified in eval). Generate Config
fail_on_error Failure tolerance for samples. Failure Threshold
continue_on_fail Continue running after sample errors, failing only at the end. Handling Errors
score_on_error Score samples that error rather than failing the run. Handling Errors
message_limit, token_limit, time_limit, working_limit, cost_limit Limits to apply to sample execution. Sample Limits
early_stopping Stop a task early based on previously scored samples. Early Stopping
name, display_name, version, metadata, tags Identifying attributes recorded in the eval log. Eval Logs
viewer Log viewer config (e.g. how scanner results render). Task Views

You by and large don’t need to worry about these options until you want to use the features they are linked to.

Parameters

Task parameters make it easy to run variants of your task without changing its source code. Task parameters are simply the arguments to your @task decorated function. For example, here we provide parameters (and default values) for system and grader prompts, as well as the grader model:

security.py
from inspect_ai import Task, task
from inspect_ai.dataset import example_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, system_message

@task
def security_guide(
    system="devops.txt",
    grader="expert.txt",
    grader_model="openai/gpt-4o"
):
   return Task(
      dataset=example_dataset("security_guide"),
      solver=[system_message(system), generate()],
      scorer=model_graded_fact(
          template=grader, model=grader_model
      )
   )

Let’s say we had an alternate system prompt in a file named "researcher.txt". We could run the task with this prompt as follows:

inspect eval security.py -T system="researcher.txt"

The -T CLI flag is used to specify parameter values. You can include multiple -T flags. For example:

inspect eval security.py \
   -T system="researcher.txt" -T grader="hacker.txt"

If you have several task parameters you want to specify together, you can put them in a YAML or JSON file and use the --task-config CLI option. For example:

config.yaml
system: "researcher.txt"
grader: "hacker.txt"

Reference this file from the CLI with:

inspect eval security.py --task-config=config.yaml

If you want to bundle task parameters together with model, generation, and solver settings in a single file, use --run-config instead. See Run Config File.

For a broader view of how task parameters relate to task_with(), environment variables, eval(), and CLI overrides, see Configuration.

Solvers

While tasks always include a default solver, you can also vary the solver to explore other strategies and elicitation techniques. This section covers best practices for creating solver-independent tasks.

Solver Parameter

You can substitute an alternate solver for the solver that is built in to your Task using the --solver command line parameter (or solver argument to the eval() function).

For example, let’s start with a simple CTF challenge task:

from inspect_ai import Task, task
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python
from inspect_ai.scorer import includes

@task
def ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            use_tools([
                bash(timeout=180),
                python(timeout=180)
            ]),
            generate()
        ],
        sandbox="docker",
        scorer=includes()
    )

This task uses the most naive solver possible (a simple tool use loop with no additional elicitation). That might be okay for initial task development, but we’ll likely want to try lots of different strategies. We start by breaking the solver into its own function and adding an alternative solver that uses a react() agent

from inspect_ai import Task, task
from inspect_ai.agent import react
from inspect_ai.dataset._dataset import Sample
from inspect_ai.scorer import includes
from inspect_ai.solver import chain, generate, solver, use_tools
from inspect_ai.tool import bash, python


@solver
def ctf_tool_loop():
    return chain([
        use_tools([
            bash(timeout=180),
            python(timeout=180)
        ]),
        generate()
    ])

@solver
def ctf_agent(attempts: int = 3):
    return react(
        tools=[bash(timeout=180), python(timeout=180)],
        attempts=attempts,
    )


@task
def ctf():
    # return task
    return Task(
        dataset=read_dataset(),
        solver=ctf_tool_loop(),
        sandbox="docker",
        scorer=includes(),
    )

Note that we use the chain() function to combine multiple solvers into a composite one.

You can now switch between solvers when running the evaluation:

# run with the default solver (ctf_tool_loop)
inspect eval ctf.py

# run with the ctf agent solver
inspect eval ctf.py --solver=ctf_agent

# run with a different number of attempts
inspect eval ctf.py --solver=ctf_agent -S attempts=5

Note the use of the -S CLI option to pass an alternate value for attempts to the ctf_agent() solver.

Setup Parameter

In some cases, there will be important steps in the setup of a task that should not be substituted when another solver is used with the task. For example, you might have a step that does dynamic prompt engineering based on values in the sample metadata or you might have a step that initialises resources in a sample’s sandbox.

In these scenarios you can define a setup solver that is always run even when another solver is substituted. For example, here we adapt our initial example to include a setup step:

# prompt solver which should always be run
@solver
def ctf_prompt():
    async def solve(state, generate):
        # TODO: dynamic prompt engineering
        return state

    return solve

@task
def ctf(solver: Solver | None = None):
    # use default tool loop solver if no solver specified
    if solver is None:
        solver = ctf_tool_loop()

    # return task
    return Task(
        dataset=read_dataset(),
        setup=ctf_prompt(),
        solver=solver,
        sandbox="docker",
        scorer=includes()
    )

Task Cleanup

You can use the cleanup parameter for executing code at the end of each sample run. The cleanup function is passed the TaskState and is called for both successful runs and runs where are exception is thrown. Extending the example from above:

async def ctf_cleanup(state: TaskState):
    ## perform cleanup
    ...

Task(
    dataset=read_dataset(),
    setup=ctf_prompt(),
    solver=solver,
    cleanup=ctf_cleanup,
    scorer=includes()
)

Note that like solvers, cleanup functions should be async.

Task Reuse

The basic mechanism for task re-use is to create flexible and adaptable base @task functions (which often have many parameters) and then derive new higher-level tasks from them by creating additional @task functions that call the base function.

In some cases though you might not have full control over the base @task function (e.g. it’s published in a Python package you aren’t the maintainer of) but you nevertheless want to flexibly create derivative tasks from it. To do this, you can use the task_with() function, which provides a straightforward way to modify the properties of an existing task. The Configuration section below covers task_with() alongside the other ways to override task options at runtime.

For example, imagine you are dealing with a Task that hard-codes its sandbox to a particular Dockerfile included with the task, and further hard codes its solver to a simple agent:

from inspect_ai import Task, task
from inspect_ai.agent import react
from inspect_ai.tool import bash
from inspect_ai.scorer import includes

@task
def hard_coded():
    return Task(
        dataset=read_dataset(),
        solver=react(tools=[bash()]),
        sandbox=("docker", "compose.yaml"),
        scorer=includes()
    )

Using task_with(), you can adapt this task to use a different solver and sandbox entirely. For example, here we import the original hard_coded() task from a hypothetical ctf_tasks package and provide it with a different solver and sandbox, as well as give it a message_limit (which we in turn also expose as a parameter of the adapted task):

from inspect_ai import task, task_with
from inspect_ai.solver import solver

from ctf_tasks import hard_coded

@solver
def my_custom_agent():
    ## custom agent implementation
    ...

@task
def adapted(message_limit: int = 20):
    return task_with(
        hard_coded(),  # original task definition
        solver=my_custom_agent(),
        sandbox=("docker", "custom-compose.yaml"),
        message_limit=message_limit
    )

Tasks are recipes for an evaluation and represent the convergence of many considerations (datasets, solvers, sandbox environments, limits, and scoring). Task variations often lie at the intersection of these, and the task_with() function is intended to help you produce exactly the variation you need for a given evaluation.

Note that task_with() modifies the passed task in-place, so if you want to create multiple variations of a single task using task_with() you should create the underlying task multiple times (once for each call to task_with()). For example:

adapted1 = task_with(hard_coded(), ...)
adapted2 = task_with(hard_coded(), ...)

Configuration

A task definition provides defaults for everything an evaluation needs, but you will often want to run a task with different settings without editing its source. Task options can be set or overridden at four layers, each taking precedence over the ones before it:

  1. Task definition: defaults baked into the @task function and Task() constructor.
  2. task_with(): programmatic overrides applied to a task before passing it to eval().
  3. Environment variables / .env files: project or session defaults set outside code.
  4. eval() / CLI: runtime overrides, which take highest precedence.
Precedence order, with each layer overriding those to its left
Lowest Highest
Task definition task_with() .env / env vars eval() / CLI

The first two layers are described earlier in this article: defaults and parameters in the task definition, and task_with() for adapting a task you don’t control. The sections below cover the remaining two layers, then provide a reference for what can be set where.

Environment Variables

Every CLI flag can be set as an environment variable using the INSPECT_EVAL_ prefix (with hyphens converted to underscores). Set these in the shell, or place them in a .env file that Inspect reads automatically from the current directory (searching parent directories if not found). Use this layer for project or session defaults you want applied across runs without specifying them each time:

.env
INSPECT_EVAL_MODEL=anthropic/claude-sonnet-4-5
INSPECT_EVAL_TEMPERATURE=0.0
INSPECT_EVAL_MAX_CONNECTIONS=20
INSPECT_EVAL_MAX_RETRIES=5

Variables set in the shell take precedence over values in a .env file. See Options for details on .env file handling.

eval() and CLI

Parameters passed to eval() or on the inspect eval command line take highest precedence, and apply to all tasks being evaluated in the call.

from inspect_ai import eval

eval(
    simpleqa(),
    model="anthropic/claude-sonnet-4-5",
    temperature=0.0,
    max_tokens=4096,
    epochs=5,
    limit=100,
    message_limit=50,
    model_roles={"grader": "google/gemini-2.0-flash"},
)

The same overrides on the command line:

inspect eval inspect_evals/simpleqa \
    --model anthropic/claude-sonnet-4-5 \
    --temperature 0.0 \
    --max-tokens 4096 \
    --epochs 5 \
    --limit 100 \
    --message-limit 50 \
    --model-role grader=google/gemini-2.0-flash

See Eval Options for the full list of CLI flags.

Override Reference

The table below lists task and runtime parameters and the layers at which each can be set:

Parameter Task task_with eval CLI flag
Task structure
dataset yes yes
setup yes yes
solver yes yes yes --solver (name or file.py@name)
cleanup yes yes
scorer yes yes
metrics yes yes
Model
model yes yes yes --model
config (includes temperature, max_tokens, etc.) yes yes yes (via **kwargs) individual flags or --generate-config
model_roles yes yes yes --model-role
Execution limits
epochs yes yes yes --epochs
message_limit yes yes yes --message-limit
token_limit yes yes yes --token-limit
time_limit yes yes yes --time-limit
working_limit yes yes yes --working-limit
cost_limit yes yes yes --cost-limit
early_stopping yes yes
Error handling
fail_on_error yes yes yes --fail-on-error
continue_on_fail yes yes yes --continue-on-fail
retry_on_error yes --retry-on-error
score_on_error yes yes yes --score-on-error
debug_errors yes --debug-errors
Environment
sandbox yes yes yes --sandbox
sandbox_cleanup yes yes --no-sandbox-cleanup
approval yes yes yes --approval
Task identity
name yes yes
version yes yes
metadata yes yes (overwrites) yes (merges) --metadata
tags yes yes (overwrites) yes (merges) --tags
Sample selection
limit yes --limit
sample_id yes --sample-id
sample_shuffle yes --sample-shuffle
Eval-level controls
task_args args/kwargs yes -T key=value
score yes --no-score
score_display yes --no-score-display
trace yes --trace

Blank cells indicate that a parameter cannot be set at that layer. The task_args row refers to setting these fields as arguments of the Task object, as opposed to passing a task_args dictionary.

Generation Config

GenerateConfig parameters (temperature, max_tokens, top_p, and so on) can be set at every layer.

In the task definition via config:

Task(
    ...,
    config=GenerateConfig(temperature=0.5, max_tokens=2048)
)

With task_with() via config:

task_with(my_task(), config=GenerateConfig(temperature=0.0))

With eval() as keyword arguments:

eval(my_task(), temperature=0.0, max_tokens=4096)

On the CLI as individual flags:

inspect eval my_task.py --temperature 0.0 --max-tokens 4096

Or from a YAML/JSON file using --generate-config:

inspect eval my_task.py --generate-config config.yaml

where config.yaml contains GenerateConfig fields:

config.yaml
temperature: 0.5
max_tokens: 2048

Individual CLI flags (e.g. --temperature) take precedence over values in the config file. To bundle generation parameters alongside a full eval configuration (task, model, model roles, solver), use --run-config instead (see Run Config File).

Model Roles

Model roles assign models to named purposes within a task (for example, a “grader” model for scoring). They can be set on Task, with task_with(), with eval(), or on the CLI with --model-role (see the override reference for where each form fits). The most common pattern:

Task(..., model_roles={"grader": "openai/gpt-4o"})
eval(my_task(), model_roles={"grader": "google/gemini-2.0-flash"})

Inside a solver or scorer, resolve the role with get_model():

model = get_model(role="grader", default="openai/gpt-4o")

For inline YAML/JSON examples and role-resolution details, see Model Roles.

Run Config File

The --run-config option specifies a single YAML or JSON file that captures a full eval configuration (task, model, model roles, generation parameters, solver, and eval settings) in one place. CLI flags still override values from the file.

inspect eval --run-config run.yaml

The file schema mirrors the structure of the corresponding eval() parameters:

run.yaml
task:
  task: inspect_evals/simpleqa
  args:
    split: test

model:
  model: anthropic/claude-sonnet-4-5
  args:
    max_retries: 3

model_roles:
  grader:
    model: openai/gpt-4o
    config:
      temperature: 0.0

generate_config:
  temperature: 0.5
  max_tokens: 4096
  seed: 42

solver:
  solver: my_solvers.py@chain_of_thought
  args:
    cot_template: detailed

eval_config:
  limit: 100
  epochs: 3
  message_limit: 50

All top-level keys are optional. This lets you create “paper config” files that record the generation and eval settings from a paper without hard-coding a specific model, leaving the model to be supplied on the CLI:

# paper_config.yaml specifies only generate_config, eval_config, and model_roles
inspect eval inspect_evals/simpleqa \
    --model anthropic/claude-sonnet-4-5 \
    --run-config paper_config.yaml

To run with a different value than the file specifies, pass the corresponding flag:

inspect eval --run-config run.yaml --temperature 0.9

--run-config cannot be combined with --generate-config, --task-config, or --solver-config. Use --run-config for a single file; use the individual options to compose configuration from multiple files.

To generate a run config from an existing eval log, use inspect log export-config, which writes the realised configuration as --run-config-compatible YAML:

inspect log export-config logs/my_run.eval > run.yaml
inspect eval --run-config run.yaml

See Exporting Run Config for details.

Scorer Override

The scorer can only be overridden with task_with() during a live eval; there is no eval() parameter or CLI flag for it:

task_with(my_task(), scorer=my_custom_scorer())

Some task authors expose scorer selection as a task parameter, which can then be set with -T:

inspect eval my_task.py -T scorer=original

This is a convention rather than a framework feature: the @task function must explicitly handle the parameter.

TipRe-scoring existing logs

You can re-score an existing log file with a different scorer using inspect score. The --scorer flag accepts a name (any function decorated with @scorer, see Custom Scorers) or a file.py@name reference:

# scorer registered via @scorer decorator
inspect score log_file.eval --scorer my_scorer

# scorer defined in a file
inspect score log_file.eval --scorer scorers.py@custom_scorer

Common Patterns

When consuming a task from a package (such as inspect_evals) and customising it, here is a recommended approach for each scenario:

Need How
Different model eval() / --model
Different temperature or max_tokens eval() / --temperature / --max-tokens
Bundle of generation params --generate-config config.yaml
Full run config (paper reproduction) --run-config run.yaml
Different solver eval(solver=...) / --solver / task_with()
Different scorer task_with(task, scorer=...)
Different grader model --model-role grader=... / eval(model_roles=)
Different metrics task_with(task, metrics=[...])
Subset of samples --limit / --sample-id
Different epochs --epochs

Every component except scorer, dataset, and metrics can be overridden without modifying the task’s source. If the task author uses get_model(role="grader") for model-graded scoring, the grader model is also overridable at runtime via --model-role.

Packaging

A convenient way to distribute tasks is to include them in a Python package. This makes it very easy for others to run your task and ensure they have all of the required dependencies.

Tasks in packages can be registered such that users can easily refer to them by name from the CLI. For example, the Inspect Evals package includes a suite of tasks that can be run as follows:

inspect eval inspect_evals/gaia
inspect eval inspect_evals/swe_bench

Example

Here’s an example that walks through all of the requirements for registering tasks in packages. Let’s say your package is named evals and has a task named mytask in the tasks.py file:

evals/
  evals/
    tasks.py
    _registry.py
  pyproject.toml

The _registry.py file serves as a place to import things that you want registered with Inspect. For example:

_registry.py
from .tasks import mytask

You can then register mytask (and anything else imported into _registry.py) as a setuptools entry point. This will ensure that inspect can resolve references to your package from the CLI. Here is how this looks in pyproject.toml:

[project.entry-points.inspect_ai]
evals = "evals._registry"
[tool.poetry.plugins.inspect_ai]
evals = "evals._registry"

Now, anyone that has installed your package can run the task as follows:

inspect eval evals/mytask

The same packaging mechanism works for solvers, scorers, and tools. See Components for how to distribute and reference each component type.

Hugging Face

Datasets hosted on Hugging Face Hub can include an eval.yaml file that provides Inspect task definitions. For example, the OpenEvals/aime_24 dataset can be evaluated with:

inspect eval hf/OpenEvals/aime_24 --model openai/gpt-5

Here are the eval.yaml definitions for several Hugging Face datasets:

A dataset’s eval.yaml file defines a list of tasks. Here are the fields that can be included in a task definition and how they are used in constructing Task instances:

Field Default Usage
config “default” hf_dataset(name)
split “test” hf_dataset(split)
field_spec None hf_dataset(sample_fields)
shuffle_choices None dataset.shuffle_choices()
epochs 1 Epochs(epochs)
epoch_reducer “mean” Epochs(epoch_reducer)
solvers None Task(solver)
scorer None Task(scorer)
id None hf/org/dataset/name
  • field_spec.choices can be either a single string (the key for one field in each record) or a list of strings (multiple fields, whose values will form the choices list for each sample).
  • field_spec.target can be:
    • A literal value, specified as literal:<value>, where <value> will be used directly as the target.
    • A field name corresponding to a letter, or an integer; in this case, the integer (e.g., 0, 1, 2) will be mapped to a letter (A, B, C, etc.) for use as the target.
  • field_spec.input_image is an optional field name for multimodal tasks. When specified, it should reference a field containing image data as a data URI (base64 encoded). The image will be combined with the text input to create a multimodal chat message. For example:

Multiple Tasks

Datasets can define multiple named tasks. For example, the OpenEvals/MuSR dataset defines 3 tasks: musr:murder_mysteries, musr:object_placements, and musr:team_allocation. If you call inspect eval with no task qualification, all 3 tasks will be run. If you append a task name, only that task will be run:

# run all 3 tasks defined by OpenEvals/MuSR
inspect eval hf/OpenEvals/MuSR --model openai/gpt-5

# run only the musr:murder_mysteries task
inspect eval hf/OpenEvals/MuSR/musr:murder_mysteries --model openai/gpt-5

Note that when running multiple tasks, you may want to increase --max-tasks for more concurrency:

inspect eval hf/OpenEvals/MuSR --model openai/gpt-5 --max-tasks 3

Revisions

All of the examples above execute evals from the main branch. You can alternatively execute from a branch, tag, or revision hash by appending an @ qualifier. For example:

inspect eval hf/OpenEvals/MuSR@df154a5 --model openai/gpt-5

Exploratory

When developing tasks and solvers, you often want to explore how changing prompts, generation options, solvers, and models affect performance on a task. You can do this by creating multiple tasks with varying parameters and passing them all to the eval_set() function.

Returning to the example from above, the system and grader parameters point to files we are using as system message and grader model templates. At the outset we might want to explore every possible combination of these parameters, along with different models. We can use the itertools.product function to do this:

from itertools import product

# 'grid' will be a permutation of all parameters
params = {
    "system": ["devops.txt", "researcher.txt"],
    "grader": ["hacker.txt", "expert.txt"],
    "grader_model": ["openai/gpt-4o", "google/gemini-2.5-pro"],
}
grid = list(product(*(params[name] for name in params)))

# run the evals and capture the logs
logs = eval_set(
    [
        security_guide(system, grader, grader_model)
        for system, grader, grader_model in grid
    ],
    model=["google/gemini-2.5-flash", "mistral/mistral-large-latest"],
    log_dir="security-tasks"
)

# analyze the logs...
plot_results(logs)

Note that we also pass a list of model to try out the task on multiple models. This eval set will produce in total 16 tasks accounting for the parameter and model variation.

See the article on Eval Sets to learn more about using eval sets. See the article on Eval Logs for additional details on working with evaluation logs.

Inspect Flow

For larger or repeated explorations, Inspect Flow builds on this pattern. It’s a companion package for running and managing evaluations at scale, with declarative configuration, parameter sweeps (matrix patterns across tasks, models, and hyperparameters), reusable defaults, and reuse of evaluation logs across runs.