inspect_ai
Evaluation
eval
Evaluate tasks using a Model.
def eval(
    tasks: Tasks,
    model: str | Model | list[str] | list[Model] | None | NotGiven = NOT_GIVEN,
    model_base_url: str | None = None,
    model_args: dict[str, Any] | str = dict(),
    model_roles: dict[str, str | Model] | None = None,
    task_args: dict[str, Any] | str = dict(),
    sandbox: SandboxEnvironmentType | None = None,
    sandbox_cleanup: bool | None = None,
    solver: Solver | SolverSpec | Agent | list[Solver] | None = None,
    tags: list[str] | None = None,
    metadata: dict[str, Any] | None = None,
    trace: bool | None = None,
    display: DisplayType | None = None,
    approval: str | list[ApprovalPolicy] | None = None,
    log_level: str | None = None,
    log_level_transcript: str | None = None,
    log_dir: str | None = None,
    log_format: Literal["eval", "json"] | None = None,
    limit: int | tuple[int, int] | None = None,
    sample_id: str | int | list[str] | list[int] | list[str | int] | None = None,
    sample_shuffle: bool | int | None = None,
    epochs: int | Epochs | None = None,
    fail_on_error: bool | float | None = None,
    continue_on_fail: bool | None = None,
    retry_on_error: int | None = None,
    debug_errors: bool | None = None,
    message_limit: int | None = None,
    token_limit: int | None = None,
    time_limit: int | None = None,
    working_limit: int | None = None,
    max_samples: int | None = None,
    max_tasks: int | None = None,
    max_subprocesses: int | None = None,
    max_sandboxes: int | None = None,
    log_samples: bool | None = None,
    log_realtime: bool | None = None,
    log_images: bool | None = None,
    log_buffer: int | None = None,
    log_shared: bool | int | None = None,
    log_header_only: bool | None = None,
    run_samples: bool = True,
    score: bool = True,
    score_display: bool | None = None,
    eval_set_id: str | None = None,
    **kwargs: Unpack[GenerateConfigArgs],
) -> list[EvalLog]- tasksTasks
- 
Task(s) to evaluate. If None, attempt to evaluate a task in the current working directory 
- modelstr | Model | list[str] | list[Model] | None | NotGiven
- 
Model(s) for evaluation. If not specified use the value of the INSPECT_EVAL_MODEL environment variable. Specify Noneto define no default model(s), which will leave model usage entirely up to tasks.
- model_base_urlstr | None
- 
Base URL for communicating with the model API. 
- model_argsdict[str, Any] | str
- 
Model creation args (as a dictionary or as a path to a JSON or YAML config file) 
- model_rolesdict[str, str | Model] | None
- 
Named roles for use in get_model(). 
- task_argsdict[str, Any] | str
- 
Task creation arguments (as a dictionary or as a path to a JSON or YAML config file) 
- sandboxSandboxEnvironmentType | None
- 
Sandbox environment type (or optionally a str or tuple with a shorthand spec) 
- sandbox_cleanupbool | None
- 
Cleanup sandbox environments after task completes (defaults to True) 
- solverSolver | SolverSpec | Agent | list[Solver] | None
- 
Alternative solver for task(s). Optional (uses task solver by default). 
- tagslist[str] | None
- 
Tags to associate with this evaluation run. 
- metadatadict[str, Any] | None
- 
Metadata to associate with this evaluation run. 
- tracebool | None
- 
Trace message interactions with evaluated model to terminal. 
- displayDisplayType | None
- 
Task display type (defaults to ‘full’). 
- approvalstr | list[ApprovalPolicy] | None
- 
Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy. 
- log_levelstr | None
- 
Level for logging to the console: “debug”, “http”, “sandbox”, “info”, “warning”, “error”, “critical”, or “notset” (defaults to “warning”) 
- log_level_transcriptstr | None
- 
Level for logging to the log file (defaults to “info”) 
- log_dirstr | None
- 
Output path for logging results (defaults to file log in ./logs directory). 
- log_formatLiteral['eval', 'json'] | None
- 
Format for writing log files (defaults to “eval”, the native high-performance format). 
- limitint | tuple[int, int] | None
- 
Limit evaluated samples (defaults to all samples). 
- sample_idstr | int | list[str] | list[int] | list[str | int] | None
- 
Evaluate specific sample(s) from the dataset. Use plain ids or preface with task names as required to disambiguate ids across tasks (e.g. popularity:10)..
- sample_shufflebool | int | None
- 
Shuffle order of samples (pass a seed to make the order deterministic). 
- epochsint | Epochs | None
- 
Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to “mean”) 
- fail_on_errorbool | float | None
- 
Trueto fail on first sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.
- continue_on_failbool | None
- 
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default).
- retry_on_errorint | None
- 
Number of times to retry samples if they encounter errors (by default, no retries occur). 
- debug_errorsbool | None
- 
Raise task errors (rather than logging them) so they can be debugged (defaults to False). 
- message_limitint | None
- 
Limit on total messages used for each sample. 
- token_limitint | None
- 
Limit on total tokens used for each sample. 
- time_limitint | None
- 
Limit on clock time (in seconds) for samples. 
- working_limitint | None
- 
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources. 
- max_samplesint | None
- 
Maximum number of samples to run in parallel (default is max_connections) 
- max_tasksint | None
- 
Maximum number of tasks to run in parallel (defaults to number of models being evaluated) 
- max_subprocessesint | None
- 
Maximum number of subprocesses to run in parallel (default is os.cpu_count()) 
- max_sandboxesint | None
- 
Maximum number of sandboxes (per-provider) to run in parallel. 
- log_samplesbool | None
- 
Log detailed samples and scores (defaults to True) 
- log_realtimebool | None
- 
Log events in realtime (enables live viewing of samples in inspect view). Defaults to True. 
- log_imagesbool | None
- 
Log base64 encoded version of images, even if specified as a filename or URL (defaults to False) 
- log_bufferint | None
- 
Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most all cases, 100 for JSON logs on remote filesystems). 
- log_sharedbool | int | None
- 
Sync sample events to log directory so that users on other systems can see log updates in realtime (defaults to no syncing). Specify Trueto sync every 10 seconds, otherwise an integer to sync everynseconds.
- log_header_onlybool | None
- 
If True, the function should return only log headers rather than full logs with samples (defaults toFalse).
- run_samplesbool
- 
Run samples. If False, a log withstatus=="started"and an emptysampleslist is returned.
- scorebool
- 
Score output (defaults to True) 
- score_displaybool | None
- 
Show scoring metrics in realtime (defaults to True) 
- eval_set_idstr | None
- 
Unique id for eval set (this is passed from eval_set() and should not be specified directly). 
- **kwargsUnpack[GenerateConfigArgs]
- 
Model generation options. 
eval_retry
Retry a previously failed evaluation task.
def eval_retry(
    tasks: str | EvalLogInfo | EvalLog | list[str] | list[EvalLogInfo] | list[EvalLog],
    log_level: str | None = None,
    log_level_transcript: str | None = None,
    log_dir: str | None = None,
    log_format: Literal["eval", "json"] | None = None,
    max_samples: int | None = None,
    max_tasks: int | None = None,
    max_subprocesses: int | None = None,
    max_sandboxes: int | None = None,
    sandbox_cleanup: bool | None = None,
    trace: bool | None = None,
    display: DisplayType | None = None,
    fail_on_error: bool | float | None = None,
    continue_on_fail: bool | None = None,
    retry_on_error: int | None = None,
    debug_errors: bool | None = None,
    log_samples: bool | None = None,
    log_realtime: bool | None = None,
    log_images: bool | None = None,
    log_buffer: int | None = None,
    log_shared: bool | int | None = None,
    score: bool = True,
    score_display: bool | None = None,
    max_retries: int | None = None,
    timeout: int | None = None,
    attempt_timeout: int | None = None,
    max_connections: int | None = None,
) -> list[EvalLog]- tasksstr | EvalLogInfo | EvalLog | list[str] | list[EvalLogInfo] | list[EvalLog]
- 
Log files for task(s) to retry. 
- log_levelstr | None
- 
Level for logging to the console: “debug”, “http”, “sandbox”, “info”, “warning”, “error”, “critical”, or “notset” (defaults to “warning”) 
- log_level_transcriptstr | None
- 
Level for logging to the log file (defaults to “info”) 
- log_dirstr | None
- 
Output path for logging results (defaults to file log in ./logs directory). 
- log_formatLiteral['eval', 'json'] | None
- 
Format for writing log files (defaults to “eval”, the native high-performance format). 
- max_samplesint | None
- 
Maximum number of samples to run in parallel (default is max_connections) 
- max_tasksint | None
- 
Maximum number of tasks to run in parallel (defaults to number of models being evaluated) 
- max_subprocessesint | None
- 
Maximum number of subprocesses to run in parallel (default is os.cpu_count()) 
- max_sandboxesint | None
- 
Maximum number of sandboxes (per-provider) to run in parallel. 
- sandbox_cleanupbool | None
- 
Cleanup sandbox environments after task completes (defaults to True) 
- tracebool | None
- 
Trace message interactions with evaluated model to terminal. 
- displayDisplayType | None
- 
Task display type (defaults to ‘full’). 
- fail_on_errorbool | float | None
- 
Trueto fail on a sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.
- continue_on_failbool | None
- 
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default).
- retry_on_errorint | None
- 
Number of times to retry samples if they encounter errors (by default, no retries occur). 
- debug_errorsbool | None
- 
Raise task errors (rather than logging them) so they can be debugged (defaults to False). 
- log_samplesbool | None
- 
Log detailed samples and scores (defaults to True) 
- log_realtimebool | None
- 
Log events in realtime (enables live viewing of samples in inspect view). Defaults to True. 
- log_imagesbool | None
- 
Log base64 encoded version of images, even if specified as a filename or URL (defaults to False) 
- log_bufferint | None
- 
Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most all cases, 100 for JSON logs on remote filesystems). 
- log_sharedbool | int | None
- 
Sync sample events to log directory so that users on other systems can see log updates in realtime (defaults to no syncing). Specify Trueto sync every 10 seconds, otherwise an integer to sync everynseconds.
- scorebool
- 
Score output (defaults to True) 
- score_displaybool | None
- 
Show scoring metrics in realtime (defaults to True) 
- max_retriesint | None
- 
Maximum number of times to retry request. 
- timeoutint | None
- 
Request timeout (in seconds) 
- attempt_timeoutint | None
- 
Timeout (in seconds) for any given attempt (if exceeded, will abandon attempt and retry according to max_retries). 
- max_connectionsint | None
- 
Maximum number of concurrent connections to Model API (default is per Model API) 
eval_set
Evaluate a set of tasks.
def eval_set(
    tasks: Tasks,
    log_dir: str,
    retry_attempts: int | None = None,
    retry_wait: float | None = None,
    retry_connections: float | None = None,
    retry_cleanup: bool | None = None,
    model: str | Model | list[str] | list[Model] | None | NotGiven = NOT_GIVEN,
    model_base_url: str | None = None,
    model_args: dict[str, Any] | str = dict(),
    model_roles: dict[str, str | Model] | None = None,
    task_args: dict[str, Any] | str = dict(),
    sandbox: SandboxEnvironmentType | None = None,
    sandbox_cleanup: bool | None = None,
    solver: Solver | SolverSpec | Agent | list[Solver] | None = None,
    tags: list[str] | None = None,
    metadata: dict[str, Any] | None = None,
    trace: bool | None = None,
    display: DisplayType | None = None,
    approval: str | list[ApprovalPolicy] | None = None,
    score: bool = True,
    log_level: str | None = None,
    log_level_transcript: str | None = None,
    log_format: Literal["eval", "json"] | None = None,
    limit: int | tuple[int, int] | None = None,
    sample_id: str | int | list[str] | list[int] | list[str | int] | None = None,
    sample_shuffle: bool | int | None = None,
    epochs: int | Epochs | None = None,
    fail_on_error: bool | float | None = None,
    continue_on_fail: bool | None = None,
    retry_on_error: int | None = None,
    debug_errors: bool | None = None,
    message_limit: int | None = None,
    token_limit: int | None = None,
    time_limit: int | None = None,
    working_limit: int | None = None,
    max_samples: int | None = None,
    max_tasks: int | None = None,
    max_subprocesses: int | None = None,
    max_sandboxes: int | None = None,
    log_samples: bool | None = None,
    log_realtime: bool | None = None,
    log_images: bool | None = None,
    log_buffer: int | None = None,
    log_shared: bool | int | None = None,
    bundle_dir: str | None = None,
    bundle_overwrite: bool = False,
    log_dir_allow_dirty: bool | None = None,
    **kwargs: Unpack[GenerateConfigArgs],
) -> tuple[bool, list[EvalLog]]- tasksTasks
- 
Task(s) to evaluate. If None, attempt to evaluate a task in the current working directory 
- log_dirstr
- 
Output path for logging results (required to ensure that a unique storage scope is assigned for the set). 
- retry_attemptsint | None
- 
Maximum number of retry attempts before giving up (defaults to 10). 
- retry_waitfloat | None
- 
Time to wait between attempts, increased exponentially. (defaults to 30, resulting in waits of 30, 60, 120, 240, etc.). Wait time per-retry will in no case by longer than 1 hour. 
- retry_connectionsfloat | None
- 
Reduce max_connections at this rate with each retry (defaults to 1.0, which results in no reduction). 
- retry_cleanupbool | None
- 
Cleanup failed log files after retries (defaults to True) 
- modelstr | Model | list[str] | list[Model] | None | NotGiven
- 
Model(s) for evaluation. If not specified use the value of the INSPECT_EVAL_MODEL environment variable. Specify Noneto define no default model(s), which will leave model usage entirely up to tasks.
- model_base_urlstr | None
- 
Base URL for communicating with the model API. 
- model_argsdict[str, Any] | str
- 
Model creation args (as a dictionary or as a path to a JSON or YAML config file) 
- model_rolesdict[str, str | Model] | None
- 
Named roles for use in get_model(). 
- task_argsdict[str, Any] | str
- 
Task creation arguments (as a dictionary or as a path to a JSON or YAML config file) 
- sandboxSandboxEnvironmentType | None
- 
Sandbox environment type (or optionally a str or tuple with a shorthand spec) 
- sandbox_cleanupbool | None
- 
Cleanup sandbox environments after task completes (defaults to True) 
- solverSolver | SolverSpec | Agent | list[Solver] | None
- 
Alternative solver(s) for evaluating task(s). ptional (uses task solver by default). 
- tagslist[str] | None
- 
Tags to associate with this evaluation run. 
- metadatadict[str, Any] | None
- 
Metadata to associate with this evaluation run. 
- tracebool | None
- 
Trace message interactions with evaluated model to terminal. 
- displayDisplayType | None
- 
Task display type (defaults to ‘full’). 
- approvalstr | list[ApprovalPolicy] | None
- 
Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy. 
- scorebool
- 
Score output (defaults to True) 
- log_levelstr | None
- 
Level for logging to the console: “debug”, “http”, “sandbox”, “info”, “warning”, “error”, “critical”, or “notset” (defaults to “warning”) 
- log_level_transcriptstr | None
- 
Level for logging to the log file (defaults to “info”) 
- log_formatLiteral['eval', 'json'] | None
- 
Format for writing log files (defaults to “eval”, the native high-performance format). 
- limitint | tuple[int, int] | None
- 
Limit evaluated samples (defaults to all samples). 
- sample_idstr | int | list[str] | list[int] | list[str | int] | None
- 
Evaluate specific sample(s) from the dataset. Use plain ids or preface with task names as required to disambiguate ids across tasks (e.g. popularity:10).
- sample_shufflebool | int | None
- 
Shuffle order of samples (pass a seed to make the order deterministic). 
- epochsint | Epochs | None
- 
Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to “mean”) 
- fail_on_errorbool | float | None
- 
Trueto fail on first sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.
- continue_on_failbool | None
- 
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default).
- retry_on_errorint | None
- 
Number of times to retry samples if they encounter errors (by default, no retries occur). 
- debug_errorsbool | None
- 
Raise task errors (rather than logging them) so they can be debugged (defaults to False). 
- message_limitint | None
- 
Limit on total messages used for each sample. 
- token_limitint | None
- 
Limit on total tokens used for each sample. 
- time_limitint | None
- 
Limit on clock time (in seconds) for samples. 
- working_limitint | None
- 
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources. 
- max_samplesint | None
- 
Maximum number of samples to run in parallel (default is max_connections) 
- max_tasksint | None
- 
Maximum number of tasks to run in parallel (defaults to the greater of 4 and the number of models being evaluated) 
- max_subprocessesint | None
- 
Maximum number of subprocesses to run in parallel (default is os.cpu_count()) 
- max_sandboxesint | None
- 
Maximum number of sandboxes (per-provider) to run in parallel. 
- log_samplesbool | None
- 
Log detailed samples and scores (defaults to True) 
- log_realtimebool | None
- 
Log events in realtime (enables live viewing of samples in inspect view). Defaults to True. 
- log_imagesbool | None
- 
Log base64 encoded version of images, even if specified as a filename or URL (defaults to False) 
- log_bufferint | None
- 
Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most all cases, 100 for JSON logs on remote filesystems). 
- log_sharedbool | int | None
- 
Sync sample events to log directory so that users on other systems can see log updates in realtime (defaults to no syncing). Specify Trueto sync every 10 seconds, otherwise an integer to sync everynseconds.
- bundle_dirstr | None
- 
If specified, the log viewer and logs generated by this eval set will be bundled into this directory. 
- bundle_overwritebool
- 
Whether to overwrite files in the bundle_dir. (defaults to False). 
- log_dir_allow_dirtybool | None
- 
If True, allow the log directory to contain unrelated logs. If False, ensure that the log directory only contains logs for tasks in this eval set (defaults to False). 
- **kwargsUnpack[GenerateConfigArgs]
- 
Model generation options. 
score
Score an evaluation log.
def score(
    log: EvalLog,
    scorers: "Scorers",
    epochs_reducer: ScoreReducers | None = None,
    action: ScoreAction | None = None,
    display: DisplayType | None = None,
    copy: bool = True,
) -> EvalLog- logEvalLog
- 
Evaluation log. 
- scorers'Scorers'
- 
List of Scorers to apply to log 
- epochs_reducerScoreReducers | None
- 
Reducer function(s) for aggregating scores in each sample. Defaults to previously used reducer(s). 
- actionScoreAction | None
- 
Whether to append or overwrite this score 
- displayDisplayType | None
- 
Progress/status display 
- copybool
- 
Whether to deepcopy the log before scoring. 
Tasks
Task
Evaluation task.
Tasks are the basis for defining and running evaluations.
class TaskMethods
- __init__
- 
Create a task. def __init__( self, dataset: Dataset | Sequence[Sample] | None = None, setup: Solver | list[Solver] | None = None, solver: Solver | Agent | list[Solver] = generate(), cleanup: Callable[[TaskState], Awaitable[None]] | None = None, scorer: "Scorers" | None = None, metrics: list[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None = None, model: str | Model | None = None, config: GenerateConfig = GenerateConfig(), model_roles: dict[str, str | Model] | None = None, sandbox: SandboxEnvironmentType | None = None, approval: str | list[ApprovalPolicy] | None = None, epochs: int | Epochs | None = None, fail_on_error: bool | float | None = None, continue_on_fail: bool | None = None, message_limit: int | None = None, token_limit: int | None = None, time_limit: int | None = None, working_limit: int | None = None, display_name: str | None = None, name: str | None = None, version: int | str = 0, metadata: dict[str, Any] | None = None, **kwargs: Unpack[TaskDeprecatedArgs], ) -> None- datasetDataset | Sequence[Sample] | None
- 
Dataset to evaluate 
- setupSolver | list[Solver] | None
- 
Setup step (always run even when the main solveris replaced).
- solverSolver | Agent | list[Solver]
- 
Solver or list of solvers. Defaults to generate(), a normal call to the model. 
- cleanupCallable[[TaskState], Awaitable[None]] | None
- 
Optional cleanup function for task. Called after all solvers have run for each sample (including if an exception occurs during the run) 
- scorer'Scorers' | None
- 
Scorer used to evaluate model output. 
- metricslist[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None
- 
Alternative metrics (overrides the metrics provided by the specified scorer). 
- modelstr | Model | None
- 
Default model for task (Optional, defaults to eval model). 
- configGenerateConfig
- 
Model generation config for default model (does not apply to model roles) 
- model_rolesdict[str, str | Model] | None
- 
Named roles for use in get_model(). 
- sandboxSandboxEnvironmentType | None
- 
Sandbox environment type (or optionally a str or tuple with a shorthand spec) 
- approvalstr | list[ApprovalPolicy] | None
- 
Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy. 
- epochsint | Epochs | None
- 
Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to “mean”) 
- fail_on_errorbool | float | None
- 
Trueto fail on first sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.
- continue_on_failbool | None
- 
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default).
- message_limitint | None
- 
Limit on total messages used for each sample. 
- token_limitint | None
- 
Limit on total tokens used for each sample. 
- time_limitint | None
- 
Limit on clock time (in seconds) for samples. 
- working_limitint | None
- 
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources. 
- display_namestr | None
- 
Task display name (e.g. for plotting). If not specified then defaults to the registered task name. 
- namestr | None
- 
Task name. If not specified is automatically determined based on the registered name of the task. 
- versionint | str
- 
Version of task (to distinguish evolutions of the task spec or breaking changes to it) 
- metadatadict[str, Any] | None
- 
Additional metadata to associate with the task. 
- **kwargsUnpack[TaskDeprecatedArgs]
- 
Deprecated arguments. 
 
task_with
Task adapted with alternate values for one or more options.
This function modifies the passed task in place and returns it. If you want to create multiple variations of a single task using task_with() you should create the underlying task multiple times.
def task_with(
    task: Task,
    *,
    dataset: Dataset | Sequence[Sample] | None | NotGiven = NOT_GIVEN,
    setup: Solver | list[Solver] | None | NotGiven = NOT_GIVEN,
    solver: Solver | list[Solver] | NotGiven = NOT_GIVEN,
    cleanup: Callable[[TaskState], Awaitable[None]] | None | NotGiven = NOT_GIVEN,
    scorer: "Scorers" | None | NotGiven = NOT_GIVEN,
    metrics: list[Metric | dict[str, list[Metric]]]
    | dict[str, list[Metric]]
    | None
    | NotGiven = NOT_GIVEN,
    model: str | Model | NotGiven = NOT_GIVEN,
    config: GenerateConfig | NotGiven = NOT_GIVEN,
    model_roles: dict[str, str | Model] | NotGiven = NOT_GIVEN,
    sandbox: SandboxEnvironmentType | None | NotGiven = NOT_GIVEN,
    approval: str | list[ApprovalPolicy] | None | NotGiven = NOT_GIVEN,
    epochs: int | Epochs | None | NotGiven = NOT_GIVEN,
    fail_on_error: bool | float | None | NotGiven = NOT_GIVEN,
    continue_on_fail: bool | None | NotGiven = NOT_GIVEN,
    message_limit: int | None | NotGiven = NOT_GIVEN,
    token_limit: int | None | NotGiven = NOT_GIVEN,
    time_limit: int | None | NotGiven = NOT_GIVEN,
    working_limit: int | None | NotGiven = NOT_GIVEN,
    name: str | None | NotGiven = NOT_GIVEN,
    version: int | NotGiven = NOT_GIVEN,
    metadata: dict[str, Any] | None | NotGiven = NOT_GIVEN,
) -> Task- taskTask
- 
Task to adapt 
- datasetDataset | Sequence[Sample] | None | NotGiven
- 
Dataset to evaluate 
- setupSolver | list[Solver] | None | NotGiven
- 
Setup step (always run even when the main solveris replaced).
- solverSolver | list[Solver] | NotGiven
- 
Solver or list of solvers. Defaults to generate(), a normal call to the model. 
- cleanupCallable[[TaskState], Awaitable[None]] | None | NotGiven
- 
Optional cleanup function for task. Called after all solvers have run for each sample (including if an exception occurs during the run) 
- scorer'Scorers' | None | NotGiven
- 
Scorer used to evaluate model output. 
- metricslist[Metric | dict[str, list[Metric]]] | dict[str, list[Metric]] | None | NotGiven
- 
Alternative metrics (overrides the metrics provided by the specified scorer). 
- modelstr | Model | NotGiven
- 
Default model for task (Optional, defaults to eval model). 
- configGenerateConfig | NotGiven
- 
Model generation config for default model (does not apply to model roles) 
- model_rolesdict[str, str | Model] | NotGiven
- 
Named roles for use in get_model(). 
- sandboxSandboxEnvironmentType | None | NotGiven
- 
Sandbox environment type (or optionally a str or tuple with a shorthand spec) 
- approvalstr | list[ApprovalPolicy] | None | NotGiven
- 
Tool use approval policies. Either a path to an approval policy config file or a list of approval policies. Defaults to no approval policy. 
- epochsint | Epochs | None | NotGiven
- 
Epochs to repeat samples for and optional score reducer function(s) used to combine sample scores (defaults to “mean”) 
- fail_on_errorbool | float | None | NotGiven
- 
Trueto fail on first sample error (default);Falseto never fail on sample errors; Value between 0 and 1 to fail if a proportion of total samples fails. Value greater than 1 to fail eval if a count of samples fails.
- continue_on_failbool | None | NotGiven
- 
Trueto continue running and only fail at the end if thefail_on_errorcondition is met.Falseto fail eval immediately when thefail_on_errorcondition is met (default).
- message_limitint | None | NotGiven
- 
Limit on total messages used for each sample. 
- token_limitint | None | NotGiven
- 
Limit on total tokens used for each sample. 
- time_limitint | None | NotGiven
- 
Limit on clock time (in seconds) for samples. 
- working_limitint | None | NotGiven
- 
Limit on working time (in seconds) for sample. Working time includes model generation, tool calls, etc. but does not include time spent waiting on retries or shared resources. 
- namestr | None | NotGiven
- 
Task name. If not specified is automatically determined based on the name of the task directory (or “task”) if its anonymous task (e.g. created in a notebook and passed to eval() directly) 
- versionint | NotGiven
- 
Version of task (to distinguish evolutions of the task spec or breaking changes to it) 
- metadatadict[str, Any] | None | NotGiven
- 
Additional metadata to associate with the task. 
Epochs
Task epochs.
Number of epochs to repeat samples over and optionally one or more reducers used to combine scores from samples across epochs. If not specified the “mean” score reducer is used.
class EpochsAttributes
- reducerlist[ScoreReducer] | None
- 
One or more reducers used to combine scores from samples across epochs (defaults to “mean”) 
Methods
- __init__
- 
Task epochs. def __init__(self, epochs: int, reducer: ScoreReducers | None = None) -> None- epochsint
- 
Number of epochs 
- reducerScoreReducers | None
- 
One or more reducers used to combine scores from samples across epochs (defaults to “mean”) 
 
TaskInfo
Task information (file, name, and attributes).
class TaskInfo(BaseModel)Attributes
- filestr
- 
File path where task was loaded from. 
- namestr
- 
Task name (defaults to function name) 
- attribsdict[str, Any]
- 
Task attributes (arguments passed to @task)
Tasks
One or more tasks.
Tasks to be evaluated. Many forms of task specification are supported including directory names, task functions, task classes, and task instances (a single task or list of tasks can be specified). None is a request to read a task out of the current working directory.
Tasks: TypeAlias = (
    str
    | PreviousTask
    | ResolvedTask
    | TaskInfo
    | Task
    | Callable[..., Task]
    | type[Task]
    | list[str]
    | list[PreviousTask]
    | list[ResolvedTask]
    | list[TaskInfo]
    | list[Task]
    | list[Callable[..., Task]]
    | list[type[Task]]
    | None
)View
view
Run the Inspect View server.
def view(
    log_dir: str | None = None,
    recursive: bool = True,
    host: str = DEFAULT_SERVER_HOST,
    port: int = DEFAULT_VIEW_PORT,
    authorization: str | None = None,
    log_level: str | None = None,
    fs_options: dict[str, Any] = {},
) -> None- log_dirstr | None
- 
Directory to view logs from. 
- recursivebool
- 
Recursively list files in log_dir.
- hoststr
- 
Tcp/ip host (defaults to “127.0.0.1”). 
- portint
- 
Tcp/ip port (defaults to 7575). 
- authorizationstr | None
- 
Validate requests by checking for this authorization header. 
- log_levelstr | None
- 
Level for logging to the console: “debug”, “http”, “sandbox”, “info”, “warning”, “error”, “critical”, or “notset” (defaults to “warning”) 
- fs_optionsdict[str, Any]
- 
Additional arguments to pass through to the filesystem provider (e.g. S3FileSystem). Use{"anon": True }if you are accessing a public S3 bucket with no credentials.
Decorators
task
Decorator for registering tasks.
def task(*args: Any, name: str | None = None, **attribs: Any) -> Any- *argsAny
- 
Function returning Task targeted by plain task decorator without attributes (e.g. @task)
- namestr | None
- 
Optional name for task. If the decorator has no name argument then the name of the function will be used to automatically assign a name. 
- **attribsAny
- 
(dict[str,Any]): Additional task attributes.