Scoring

Scoring turns the raw output a model produces for each sample into a Score, and aggregates those scores into the metrics that summarise an evaluation. The scoring system is documented across the following articles:

Article	Description
Standard Scorers	The built-in scorers (text matching, multiple choice, math, model grading, perplexity) and how to choose among them.
Custom Scorers	Write your own scorers using the Score, Value, and Target types, including scorers that call models or inspect a sandbox.
Model Grading	Use another model to grade open-ended answers; customise templates, instructions, grader models, and chat history.
Scoring Metrics	Built-in metrics, grouping, clustered standard errors, custom metrics, and reducing epochs.
Multiple Scorers	Use several scorers together, emit multiple scores from one scorer, and reduce multiple scores into one.
Scoring Workflow	Defer scoring with `--no-score`, re-score logs with `inspect score`, and edit scores.
Perplexity	Score how well a model predicts text using prompt log probabilities.

To review transcripts for issues that could undermine results (refusals, evaluation awareness, environment misconfiguration) rather than grading task success, see Scanners. To customise how scores render in the log viewer, see Task Views.