Scoring

Scoring turns the raw output a model produces for each sample into a Score, and aggregates those scores into the metrics that summarise an evaluation. The scoring system is documented across the following articles:

Article Description
Standard Scorers The built-in scorers (text matching, multiple choice, math, model grading, perplexity) and how to choose among them.
Custom Scorers Write your own scorers using the Score, Value, and Target types, including scorers that call models or inspect a sandbox.
Model Grading Use another model to grade open-ended answers; customise templates, instructions, grader models, and chat history.
Scoring Metrics Built-in metrics, grouping, clustered standard errors, custom metrics, and reducing epochs.
Multiple Scorers Use several scorers together, emit multiple scores from one scorer, and reduce multiple scores into one.
Scoring Workflow Defer scoring with --no-score, re-score logs with inspect score, and edit scores.
Perplexity Score how well a model predicts text using prompt log probabilities.

To review transcripts for issues that could undermine results (refusals, evaluation awareness, environment misconfiguration) rather than grading task success, see Scanners. To customise how scores render in the log viewer, see Task Views.