Running Evals

Once an evaluation is developed, Inspect provides a number of tools for running it reliably and at scale:

Eval Sets Describe, run, and analyse larger sets of evaluation tasks with automatic retry and resumption.
Parallelism Run multiple tasks and models in parallel and tune sandbox concurrency.
Handling Errors Deal with runtime errors and recover from crashes during evaluation.
Setting Limits Set time, message, token, and cost limits on tasks, samples, and agent execution.
Early Stopping End tasks early based on the scores of previously completed samples.
Tracing Diagnose runtime issues with advanced execution tracing tools.

If you are just getting started running evaluations, see the inspect eval command line interface and the eval() function covered in the Welcome tutorial.