Standard Scorers
Overview
A scorer compares a model’s output against the target for each sample and returns a Score. You attach one to a task with the scorer argument. Here match() checks that the model’s answer ends with the target:
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate
@task
def capitals():
return Task(
dataset=[Sample(input="What is the capital of France?", target="Paris")],
solver=generate(),
scorer=match(),
)Available Scorers
Inspect includes both text matching scorers as well as model graded scorers. Below is a summary of these scorers. See the inspect_ai.scorer reference for complete function signatures and options.
- includes()
-
Check whether the
targetappears anywhere in the model output (a substring match). Case sensitive or insensitive (defaults to insensitive). - match()
-
Check whether the
targetappears at a known position:begin,end(the default), orany. Withlocation="exact"the whole output must equal the target. Ignores case and white-space by default. Passnumeric=Trueto compare numbers rather than text; currency symbols ($,€,£), thousands separators (,), and formatting markers (*,_) are stripped first. - pattern()
-
Extract the answer from model output using a regular expression, for cases where the answer is embedded in templated text. Requires at least one capture group; with multiple groups, set
match_all=Trueto require every captured value to match the target (the default matches any one group). Returns aNOANSWERscore when the pattern does not match. - answer()
-
For prompts that instruct the model to end with
ANSWER: X. Extracts the letter, word, or remainder of the line that follows. - model_graded_qa()
-
Have another model assess whether the output is a correct answer, based on grading guidance in
target. Use it for open-ended answers. The built-in template can be customised; see Model Grading. - model_graded_fact()
-
Like model_graded_qa() but narrower: have another model assess whether the output contains the fact set out in
target. Use it when the output is too complex to assess with match() or pattern(). See Model Grading. - exact()
-
Normalize the answer and target(s) and require the whole output to match one or more targets exactly, returning
CORRECTon a match. Reportsmeanandstderrmetrics. - f1()
-
Compute the F1 score (the harmonic mean of precision and recall) over token overlap, for short free-text answers such as extractive QA. Accepts an
answer_fnto extract the answer from the completion and astop_wordslist to exclude from tokenization. Reportsmeanandstderrmetrics. - choice()
-
Score multiple-choice questions produced by the multiple_choice() solver. Unshuffles any choices the solver shuffled before scoring, and supports multiple correct answers via a comma-separated
target(e.g."A,B"). - math()
-
Compare answers for mathematical equivalence rather than as text. Extracts answers (supporting both
\boxed{}LaTeX notation and plain text), normalizes expressions, and uses SymPy to check equivalence across LaTeX, fractions, roots, percentages, and algebra. Requires the optionalsympydependency (install withpip install sympy). - perplexity()
-
Compute per-token negative log-likelihood (NLL) from prompt log probabilities, for full-text perplexity benchmarks (WikiText, C4). Requires
prompt_logprobsin GenerateConfig. See Perplexity. - target_perplexity()
- Compute NLL of target-completion tokens only, given a prompt context, for benchmarks like ARC-C, MMLU, and HumanEval where only trailing target tokens are scored. See Perplexity.
Metrics
Each scorer provides one or more built-in metrics. Most report accuracy and stderr; exact() and f1() report mean and stderr; and the perplexity scorers report perplexity_per_token and perplexity_per_seq. You can override these by passing your own metrics to the Task:
Task(
dataset=dataset,
solver=generate(),
scorer=match(),
metrics=[custom_metric()],
)See Scoring Metrics for the built-in metrics, metric grouping, clustered standard errors, and writing your own.
Learn More
The rest of the Scoring section covers everything beyond the standard scorers:
Custom Scorers: write your own scorers using the Score, Value, and Target types.
Model Grading: customise the model graders, use multiple grader models, and present chat history.
Multiple Scorers: use several scorers together, emit multiple scores, and reduce them.
Scoring Workflow: defer scoring, re-score logs with
inspect score, and edit scores.Perplexity: score how well a model predicts text using prompt log probabilities.