inspect_ai.hooks

Registration

Hooks

Base class for hooks.

Note that whenever hooks are called, they are wrapped in a try/except block to catch any exceptions that may occur. This is to ensure that a hook failure does not affect the overall execution of the eval. If a hook fails, a warning will be logged.

class Hooks

Methods

enabled

Check if the hook should be enabled.

Default implementation returns True.

Hooks may wish to override this to e.g. check the presence of an environment variable or a configuration setting.

Will be called frequently, so consider caching the result if the computation is expensive.

def enabled(self) -> bool
on_eval_set_start

On eval set start.

A “eval set” is an invocation of eval_set() for a log directory. Note that the eval_set_id will be stable across multiple invocations of eval_set() for the same log directory.

async def on_eval_set_start(self, data: EvalSetStart) -> None
data EvalSetStart

Eval set start data.

on_eval_set_end

On eval set end.

async def on_eval_set_end(self, data: EvalSetEnd) -> None
data EvalSetEnd

Eval set end data.

on_run_start

On run start.

A “run” is a single invocation of eval() or eval_retry() which may contain many Tasks, each with many Samples and many epochs. Note that eval_retry() can be invoked multiple times within an eval_set().

async def on_run_start(self, data: RunStart) -> None
data RunStart

Run start data.

on_run_end

On run end.

async def on_run_end(self, data: RunEnd) -> None
data RunEnd

Run end data.

on_task_start

On task start.

async def on_task_start(self, data: TaskStart) -> None
data TaskStart

Task start data.

on_task_end

On task end.

async def on_task_end(self, data: TaskEnd) -> None
data TaskEnd

Task end data.

on_sample_init

On sample init.

Called when a sample has been scheduled and is about to begin initialization, before sandbox environments are created. This hook can be used to gate sandbox resource provisioning.

If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

async def on_sample_init(self, data: SampleInit) -> None
data SampleInit

Sample init data.

on_sample_start

On sample start.

Called when a sample is about to be start. If the sample errors and retries, this will not be called again.

If a sample is run for multiple epochs, this will be called once per epoch.

async def on_sample_start(self, data: SampleStart) -> None
data SampleStart

Sample start data.

on_sample_event

On sample event.

Called when a sample event is emmitted. Pending events are not logged here (i.e. ToolEvent and ModelEvent are not logged until they are complete).

async def on_sample_event(self, data: SampleEvent) -> None
data SampleEvent

Sample event.

on_sample_end

On sample end.

Called when a sample has either completed successfully, or when a sample has errored and has no retries remaining.

If a sample is run for multiple epochs, this will be called once per epoch.

async def on_sample_end(self, data: SampleEnd) -> None
data SampleEnd

Sample end data.

on_sample_attempt_start

On sample attempt start.

Fired at the beginning of every attempt (including the first). Unlike on_sample_start which fires once per sample, this fires on retries too.

async def on_sample_attempt_start(self, data: SampleAttemptStart) -> None
data SampleAttemptStart

Sample attempt start data.

on_sample_attempt_end

On sample attempt end.

Fired at the end of every attempt (including the last). Unlike on_sample_end which fires once per sample, this fires on retries too.

async def on_sample_attempt_end(self, data: SampleAttemptEnd) -> None
data SampleAttemptEnd

Sample attempt end data.

on_model_usage

Called when a call to a model’s generate() method completes successfully without hitting Inspect’s local cache.

Note that this is not called when Inspect’s local cache is used and is a cache hit (i.e. if no external API call was made). Provider-side caching will result in this being called.

async def on_model_usage(self, data: ModelUsageData) -> None
data ModelUsageData

Model usage data.

on_model_cache_usage

Called when a call to a model’s generate() method completes successfully by hitting Inspect’s local cache.

async def on_model_cache_usage(self, data: ModelCacheUsageData) -> None
data ModelCacheUsageData

Cached model usage data.

on_sample_scoring

Called before the sample is scored.

Can be used by hooks to demarcate the end of solver execution and the start of scoring.

async def on_sample_scoring(self, data: SampleScoring) -> None
data SampleScoring

Sample scoring data.

override_api_key

Optionally override an API key.

When overridden, this method may return a new API key value which will be used in place of the original one during the eval.

def override_api_key(self, data: ApiKeyOverride) -> str | None
data ApiKeyOverride

Api key override data.

hooks

Decorator for registering a hook subscriber.

Either decorate a subclass of Hooks, or a function which returns the type of a subclass of Hooks. This decorator will instantiate the hook class and store it in the registry.

def hooks(name: str, description: str) -> Callable[..., Type[T]]
name str

Name of the subscriber (e.g. “audit logging”).

description str

Short description of the hook (e.g. “Copies eval files to S3 bucket for auditing.”).

Hook Data

ApiKeyOverride

Api key override hook event data.

@dataclass(frozen=True)
class ApiKeyOverride

Attributes

env_var_name str

The name of the environment var containing the API key (e.g. OPENAI_API_KEY).

value str

The original value of the environment variable.

ModelUsageData

Model usage hook event data.

@dataclass(frozen=True)
class ModelUsageData

Attributes

model_name str

The name of the model that was used.

usage ModelUsage

The model usage metrics.

call_duration float

The duration of the model call in seconds. If HTTP retries were made, this is the time taken for the successful call. This excludes retry waiting (e.g. exponential backoff) time.

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str | None

The globally unique identifier for the run (if any).

eval_id str | None

The globally unique identifier for the task execution (if any).

task_name str | None

The name of the task that generated this usage (if any).

retries int

The number of HTTP retries made before the successful call.

EvalSetStart

Eval set start hook event data.

@dataclass(frozen=True)
class EvalSetStart

Attributes

eval_set_id str

The globally unique identifier for the eval set. Note that the eval_set_id will be stable across multiple invocations of eval_set() for the same log directory

log_dir str

The log directory for the eval set.

EvalSetEnd

Eval set end event data.

@dataclass(frozen=True)
class EvalSetEnd

Attributes

eval_set_id str

The globally unique identifier for the eval set. Note that the eval_set_id will be stable across multiple invocations of eval_set() for the same log directory

log_dir str

The log directory for the eval set.

RunEnd

Run end hook event data.

@dataclass(frozen=True)
class RunEnd

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

exception BaseException | None

The exception that occurred during the run, if any. If None, the run completed successfully.

logs EvalLogs

All eval logs generated during the run. Can be headers only if the run was an eval_set().

RunStart

Run start hook event data.

@dataclass(frozen=True)
class RunStart

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

task_names list[str]

The names of the tasks which will be used in the run.

SampleEnd

Sample end hook event data.

@dataclass(frozen=True)
class SampleEnd

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

sample_id str

The globally unique identifier for the sample execution.

sample EvalSample

The sample that has run.

SampleInit

Sample init hook event data.

@dataclass(frozen=True)
class SampleInit

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

sample_id str

The globally unique identifier for the sample execution.

summary EvalSampleSummary

Summary of the sample to be initialized.

SampleStart

Sample start hook event data.

@dataclass(frozen=True)
class SampleStart

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

sample_id str

The globally unique identifier for the sample execution.

summary EvalSampleSummary

Summary of the sample to be run.

SampleAttemptStart

Sample attempt start hook event data.

Fired at the beginning of every attempt (including the first). Unlike on_sample_start which fires once per sample, this fires on retries too.

@dataclass(frozen=True)
class SampleAttemptStart

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

sample_id str

The globally unique identifier for the sample execution.

summary EvalSampleSummary

Summary of the sample to be run.

attempt int

1-based attempt number.

SampleAttemptEnd

Sample attempt end hook event data.

Fired at the end of every attempt (including the last). Unlike on_sample_end which fires once per sample, this fires on retries too.

@dataclass(frozen=True)
class SampleAttemptEnd

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

sample_id str

The globally unique identifier for the sample execution.

summary EvalSampleSummary

Summary of the sample.

attempt int

1-based attempt number.

error EvalError | None

The error from this attempt, if any.

will_retry bool

Whether the sample will be retried after this attempt.

SampleEvent

Sample event hook event data.

@dataclass(frozen=True)
class SampleEvent

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

sample_id str

The globally unique identifier for the sample execution.

event Event

Sample events.

TaskEnd

Task end hook event data.

@dataclass(frozen=True)
class TaskEnd

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for the task execution.

log EvalLog

The log generated for the task. Can be header only if the run was an eval_set()

TaskStart

Task start hook event data.

@dataclass(frozen=True)
class TaskStart

Attributes

eval_set_id str | None

The globally unique identifier for the eval set (if any).

run_id str

The globally unique identifier for the run.

eval_id str

The globally unique identifier for this task execution.

spec EvalSpec

Specification of the task.