Scoring Metrics
Overview
Each scorer provides one or more built-in metrics (typically accuracy and stderr) corresponding to the most typically useful metrics for that scorer.
You can override scorer’s built-in metrics by passing an alternate list of metrics to the Task. For example:
Task(
dataset=dataset,
solver=[
system_message(SYSTEM_MESSAGE),
multiple_choice()
],
scorer=choice(),
metrics=[custom_metric()]
)If you still want to compute the built-in metrics, we re-specify them along with the custom metrics:
metrics=[accuracy(), stderr(), custom_metric()]Built-In Metrics
Inspect includes some simple built in metrics for calculating accuracy, mean, etc. Built in metrics can be imported from the inspect_ai.scorer module. Below is a summary of these metrics. See the inspect_ai.scorer reference for complete function signatures and options.
-
Compute proportion of total answers which are correct. For correct/incorrect scores assigned 1 or 0, can optionally assign 0.5 for partially correct answers.
-
Mean of all scores.
var()Sample variance over all scores.
-
Standard deviation over all scores (see below for details on computing clustered standard errors).
-
Standard error of the mean.
-
Standard deviation of a bootstrapped estimate of the mean. 1000 samples are taken by default (modify this using the
num_samplesoption).
Metric Grouping
The grouped() function applies a given metric to subgroups of samples defined by a key in sample metadata, creating a separate metric for each group along with an "all" metric that aggregates across all samples or groups. Each sample must have a value for whatever key is used for grouping.
For example, let’s say you wanted to create a separate accuracy metric for each distinct “category” variable defined in Sample metadata:
@task
def gpqa():
return Task(
dataset=read_gpqa_dataset("gpqa_main.csv"),
solver=[
system_message(SYSTEM_MESSAGE),
multiple_choice(),
],
scorer=choice(),
metrics=[grouped(accuracy(), "category"), stderr()]
)The metrics passed to the Task override the default metrics of the choice() scorer.
Note that the "all" metric by default takes the selected metric over all of the samples. If you prefer that it take the mean of the individual grouped values, pass all="groups":
grouped(accuracy(), "category", all="groups")You can customize the metric names using the name_template parameter. The template uses {group_name} as a placeholder for the group value:
grouped(accuracy(), "category", name_template="category_{group_name}")This would produce metrics named category_physics, category_chemistry, etc. instead of just physics, chemistry. It does not affect the “all” metric, so that can be named separately.
Clustered Stderr
The stderr() metric supports computing clustered standard errors via the cluster parameter. Most scorers already include stderr() as a built-in metric, so to compute clustered standard errors you’ll want to specify custom metrics for your task (which will override the scorer’s built in metrics).
For example, let’s say you wanted to cluster on a “category” variable defined in Sample metadata:
@task
def gpqa():
return Task(
dataset=read_gpqa_dataset("gpqa_main.csv"),
solver=[
system_message(SYSTEM_MESSAGE),
multiple_choice(),
],
scorer=choice(),
metrics=[accuracy(), stderr(cluster="category")]
)The metrics passed to the Task override the default metrics of the choice() scorer.
Custom Metrics
You can also add your own metrics with @metric decorated functions. For example, here is the implementation of the mean metric:
import numpy as np
from inspect_ai.scorer import Metric, Score, metric
@metric
def mean() -> Metric:
"""Compute mean of all scores.
Returns:
mean metric
"""
def metric(scores: list[SampleScore]) -> float:
return np.mean([score.score.as_float() for score in scores]).item()
return metricNote that the Score class contains a Value that is a union over several scalar and collection types. As a convenience, Score includes a set of accessor methods to treat the value as a simpler form (e.g. above we use the score.as_float() accessor).
Example
This task pairs a float-valued scorer with a custom pass_rate() metric (the fraction of samples scoring at or above a threshold), reported alongside the built-in mean() and stderr():
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import (
Metric,
SampleScore,
Score,
Target,
mean,
metric,
scorer,
stderr,
)
from inspect_ai.solver import TaskState, generate
@metric
def pass_rate(threshold: float = 0.5) -> Metric:
"""Proportion of samples scoring at or above `threshold`."""
def metric(scores: list[SampleScore]) -> float:
if not scores:
return 0.0
passed = [s for s in scores if s.score.as_float() >= threshold]
return len(passed) / len(scores)
return metric
@scorer(metrics=[mean(), stderr(), pass_rate()])
def word_overlap():
async def score(state: TaskState, target: Target) -> Score:
output = state.output.completion.lower()
words = target.text.lower().split()
hits = sum(1 for word in words if word in output)
return Score(value=hits / len(words) if words else 0.0)
return score
@task
def colors():
return Task(
dataset=[Sample(input="Name three primary colors.", target="red green blue")],
solver=generate(),
scorer=word_overlap(),
)The eval log reports mean, stderr, and pass_rate for the word_overlap scorer. Because pass_rate is attached to the scorer via @scorer(metrics=...), it is applied automatically; you can also override a scorer’s metrics per-task as shown in the Overview.
Reducing Epochs
If a task is run over more than one epoch, multiple scores will be generated for each sample. These scores are then reduced to a single score representing the score for the sample across all the epochs.
By default, this is done by taking the mean of all sample scores, but you may specify other strategies for reducing the samples by passing an Epochs, which includes both a count and one or more reducers to combine sample scores with. For example:
@task
def gpqa():
return Task(
dataset=read_gpqa_dataset("gpqa_main.csv"),
solver=[
system_message(SYSTEM_MESSAGE),
multiple_choice(),
],
scorer=choice(),
epochs=Epochs(5, "mode"),
)You may also specify more than one reducer which will compute metrics using each of the reducers. For example:
@task
def gpqa():
return Task(
...
epochs=Epochs(5, ["at_least_2", "at_least_5"]),
)Built-in Reducers
Inspect includes several built in reducers which are summarised below.
| Reducer | Description |
|---|---|
| mean | Reduce to the average of all scores. |
| median | Reduce to the median of all scores |
| mode | Reduce to the most common score. |
| max | Reduce to the maximum of all scores. |
| pass_at_{k} | Probability of at least 1 correct sample given k epochs (https://arxiv.org/pdf/2107.03374) |
| pass_k_{k} | Probability that all k epoch attempts succeed (https://arxiv.org/pdf/2406.12045) |
| at_least_{k} | 1 if at least k samples are correct, else 0. |
The built in reducers will compute a reduced value for the score and populate the fields answer and explanation only if their value is equal across all epochs. The metadata field will always be reduced to the value of metadata in the first epoch. If your custom metrics function needs differing behavior for reducing fields, you should also implement your own custom reducer and merge or preserve fields in some way.
Custom Reducers
You can also add your own reducer with @score_reducer decorated functions. Here’s a somewhat simplified version of the code for the mean reducer:
import statistics
from inspect_ai.scorer import (
Score, ScoreReducer, score_reducer, value_to_float
)
@score_reducer(name="mean")
def mean_score() -> ScoreReducer:
to_float = value_to_float()
def reduce(scores: list[Score]) -> Score:
"""Compute a mean value of all scores."""
values = [to_float(score.value) for score in scores]
mean_value = statistics.mean(values)
return Score(value=mean_value)
return reduce