Multiple Scorers

Overview

There are several ways to use multiple scorers in an evaluation:

You can provide a list of scorers in a Task definition (this is the best option when scorers are entirely independent)
You can yield multiple scores from a Scorer (this is the best option when scores share code and/or expensive computations).
You can use multiple scorers and then aggregate them into a single scorer (e.g. majority voting).

Example

A single scorer can return several named scores at once, which is useful when the scores share work or a model call. The scorer below returns both whether the target appears in the output and whether the response stayed within a word budget, attaching metrics per score key:

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Target, mean, scorer, stderr
from inspect_ai.solver import TaskState, generate


@scorer(metrics={"correct": [mean(), stderr()], "concise": [mean()]})
def answer_quality(max_words: int = 50):
    async def score(state: TaskState, target: Target) -> Score:
        completion = state.output.completion
        correct = 1 if target.text.lower() in completion.lower() else 0
        concise = 1 if len(completion.split()) <= max_words else 0
        return Score(
            value={
                "correct": correct,
                "concise": concise,
            },
            answer=completion,
        )

    return score


@task
def capitals():
    return Task(
        dataset=[
            Sample(
                input="What is the capital of France? Be brief.", 
                target="Paris"
            ),
        ],
        solver=generate(),
        scorer=answer_quality(),
    )

This produces two scores per sample, correct and concise, each aggregated by its own metrics. The sections below cover the individual patterns for combining scorers and scores.

Customise how scores are displayed

Once a task emits several scores per sample, the viewer’s defaults rarely show them the way you want. You can take control of the sample list — choosing which score columns appear, how they’re sorted, and shading numeric cells with a heat scale — so the scores that matter stand out:

A customised sample list with per-score columns and heat-scale shading.

See Task Views to configure columns, sorting, score colours, and the per-sample score panel for your own task.

List of Scorers

Task definitions can specify multiple scorers. For example, the below task will use two different models to grade the results, storing two scores with each sample, one for each of the two models:

Task(
    dataset=dataset,
    solver=[
        system_message(SYSTEM_MESSAGE),
        generate()
    ],
    scorer=[
        model_graded_qa(model="openai/gpt-4"),
        model_graded_qa(model="google/gemini-2.5-pro")
    ],
)

This is useful when there is more than one way to score a result and you would like preserve the individual score values with each sample (versus reducing the multiple scores to a single value).

Scorer with Multiple Values

You may also create a scorer which yields multiple scores. This is useful when the scores use data that is shared or expensive to compute. For example:

@scorer(
1    metrics={
        "a_count": [mean(), stderr()],
        "e_count": [mean(), stderr()]
    }
)
def letter_count():
    async def score(state: TaskState, target: Target):
        answer = state.output.completion
        a_count = answer.count("a")
        e_count = answer.count("e")
2        return Score(
            value={"a_count": a_count, "e_count": e_count},
            answer=answer
        )

    return score

task = Task(
    dataset=[Sample(input="Tell me a story.")],
    scorer=letter_count(),
)

1: The metrics for this scorer are a dictionary that defines metrics to be applied to scores (by name).
2: The score value itself is a dictionary, with keys corresponding to the keys defined in the metrics on the @scorer decorator.

The above example will produce two scores, a_count and e_count, each of which will have metrics for mean and stderr.

When working with complex score values and metrics, you may use globs as keys for mapping metrics to scores. For example, a more succinct way to write the previous example:

@scorer(
    metrics={
        "*": [mean(), stderr()],
    }
)

Glob keys will each be resolved and a complete list of matching metrics will be applied to each score key. For example to compute mean for all score keys, and only compute stderr for e_count you could write:

@scorer(
    metrics={
        "*": [mean()],
        "e_count": [stderr()]
    }
)

Scorer with Complex Metrics

Sometime, it is useful for a scorer to compute multiple values (returning a dictionary as the score value) and to have metrics computed both for each key in the score dictionary, but also for the dictionary as a whole. For example:

@scorer(
1    metrics=[{
        "a_count": [mean(), stderr()],
        "e_count": [mean(), stderr()]
    }, total_count()]
)
def letter_count():
    async def score(state: TaskState, target: Target):
        answer = state.output.completion
        a_count = answer.count("a")
        e_count = answer.count("e")
2        return Score(
            value={"a_count": a_count, "e_count": e_count},
            answer=answer
        )

    return score

@metric
def total_count() -> Metric:
    def metric(scores: list[SampleScore]) -> int | float:
        total = 0.0
        for score in scores:
            total += (
3                score.score.value["a_count"]
                + score.score.value["e_count"]
            )
        return total
    return metric

task = Task(
    dataset=[Sample(input="Tell me a story.")],
    scorer=letter_count(),
)

1: The metrics for this scorer are a list. One element is a dictionary that defines metrics to be applied to scores (by name); the other element is a Metric which will receive the entire score dictionary.
2: The score value itself is a dictionary, with keys corresponding to the keys defined in the metrics on the @scorer decorator.
3: The total_count metric will compute a metric based upon the entire score dictionary (since it isn’t being mapped onto the dictionary by key)

Reducing Multiple Scores

It’s possible to use multiple scorers in parallel, then reduce their output into a final overall score. This is done using the multi_scorer() function. For example, this is roughly how the built in model graders use multiple models for grading:

multi_scorer(
    scorers = [model_graded_qa(model=model) for model in models],
    reducer = "mode"
)

Use of multi_scorer() requires both a list of scorers as well as a reducer which determines how a list of scores will be turned into a single score. In this case we use the “mode” reducer which returns the score that appeared most frequently in the answers (i.e. a majority vote). See Reducing Epochs for the full set of built-in reducers.

Sandbox Access

If your Solver is an Agent with tool use, you might want to inspect the contents of the tool sandbox to score the task.

The contents of the sandbox for the Sample are available to the scorer; simply call await sandbox().read_file() (or .exec()).

For example:

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Target, accuracy, scorer
from inspect_ai.solver import Plan, TaskState, generate, use_tools
from inspect_ai.tool import bash
from inspect_ai.util import sandbox


@scorer(metrics=[accuracy()])
def check_file_exists():
    async def score(state: TaskState, target: Target):
        try:
            _ = await sandbox().read_file(target.text)
            exists = True
        except FileNotFoundError:
            exists = False
        return Score(value=1 if exists else 0)

    return score


@task
def challenge() -> Task:
    return Task(
        dataset=[
            Sample(
                input="Create a file called hello-world.txt",
                target="hello-world.txt",
            )
        ],
        solver=[use_tools([bash()]), generate()],
        sandbox="local",
        scorer=check_file_exists(),
    )

Scanners as Scorers

If instead of grading task success you want to flag transcripts that exhibit a particular behaviour (refusals, evaluation awareness, reward hacking), you can write a scanner and add it to a task’s scorers. The scanner’s Result is converted to a Score and aggregated like any other scorer. See Scanners as Scorers for details.