Standard Scorers

Overview

A scorer compares a model’s output against the target for each sample and returns a Score. You attach one to a task with the scorer argument. Here match() checks that the model’s answer ends with the target:

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match
from inspect_ai.solver import generate

@task
def capitals():
    return Task(
        dataset=[Sample(input="What is the capital of France?", target="Paris")],
        solver=generate(),
        scorer=match(),
    )

Available Scorers

Inspect includes both text matching scorers as well as model graded scorers. Below is a summary of these scorers. See the inspect_ai.scorer reference for complete function signatures and options.

includes(): Check whether the target appears anywhere in the model output (a substring match). Case sensitive or insensitive (defaults to insensitive).
match(): Check whether the target appears at a known position: begin, end (the default), or any. With location="exact" the whole output must equal the target. Ignores case and white-space by default. Pass numeric=True to compare numbers rather than text; currency symbols ($, €, £), thousands separators (,), and formatting markers (*, _) are stripped first.
pattern(): Extract the answer from model output using a regular expression, for cases where the answer is embedded in templated text. Requires at least one capture group; with multiple groups, set match_all=True to require every captured value to match the target (the default matches any one group). Returns a NOANSWER score when the pattern does not match.
answer(): For prompts that instruct the model to end with ANSWER: X. Extracts the letter, word, or remainder of the line that follows.
model_graded_qa(): Have another model assess whether the output is a correct answer, based on grading guidance in target. Use it for open-ended answers. The built-in template can be customised; see Model Grading.
model_graded_fact(): Like model_graded_qa() but narrower: have another model assess whether the output contains the fact set out in target. Use it when the output is too complex to assess with match() or pattern(). See Model Grading.
exact(): Normalize the answer and target(s) and require the whole output to match one or more targets exactly, returning CORRECT on a match. Reports mean and stderr metrics.
f1(): Compute the F1 score (the harmonic mean of precision and recall) over token overlap, for short free-text answers such as extractive QA. Accepts an answer_fn to extract the answer from the completion and a stop_words list to exclude from tokenization. Reports mean and stderr metrics.
choice(): Score multiple-choice questions produced by the multiple_choice() solver. Unshuffles any choices the solver shuffled before scoring, and supports multiple correct answers via a comma-separated target (e.g. "A,B").
math(): Compare answers for mathematical equivalence rather than as text. Extracts answers (supporting both \boxed{} LaTeX notation and plain text), normalizes expressions, and uses SymPy to check equivalence across LaTeX, fractions, roots, percentages, and algebra. Requires the optional sympy dependency (install with pip install sympy).
perplexity(): Compute per-token negative log-likelihood (NLL) from prompt log probabilities, for full-text perplexity benchmarks (WikiText, C4). Requires prompt_logprobs in GenerateConfig. See Perplexity.
target_perplexity(): Compute NLL of target-completion tokens only, given a prompt context, for benchmarks like ARC-C, MMLU, and HumanEval where only trailing target tokens are scored. See Perplexity.

Metrics

Each scorer provides one or more built-in metrics. Most report accuracy and stderr; exact() and f1() report mean and stderr; and the perplexity scorers report perplexity_per_token and perplexity_per_seq. You can override these by passing your own metrics to the Task:

Task(
    dataset=dataset,
    solver=generate(),
    scorer=match(),
    metrics=[custom_metric()],
)

See Scoring Metrics for the built-in metrics, metric grouping, clustered standard errors, and writing your own.

Learn More

The rest of the Scoring section covers everything beyond the standard scorers:

Custom Scorers: write your own scorers using the Score, Value, and Target types.
Model Grading: customise the model graders, use multiple grader models, and present chat history.
Multiple Scorers: use several scorers together, emit multiple scores, and reduce them.
Scoring Workflow: defer scoring, re-score logs with inspect score, and edit scores.
Perplexity: score how well a model predicts text using prompt log probabilities.