Custom Scorers
Overview
Custom scorers are functions that take a TaskState and Target, and yield a Score.
async def score(state: TaskState, target: Target):
# Compare state / model output with target
# to yield a score
return Score(value=...)First we’ll talk about the core Score and Value objects, then provide some examples of custom scorers to make things more concrete.
Note that score above is declared as an async function. When creating custom scorers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your scorer is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism before proceeding.
Example
This scorer extracts the last number from the model’s output and marks the sample correct when it falls within a relative tolerance of the target. It registers metrics with @scorer, reads the model output from state, compares against target.text, and returns a Score with an answer and explanation:
import re
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import (
CORRECT,
INCORRECT,
Score,
Target,
accuracy,
scorer,
stderr,
)
from inspect_ai.solver import TaskState, generate
@scorer(metrics=[accuracy(), stderr()])
def close_enough(rel_tol: float = 0.01):
async def score(state: TaskState, target: Target) -> Score:
numbers = re.findall(r"-?\d+(?:\.\d+)?", state.output.completion)
if not numbers:
return Score(value=INCORRECT, explanation="No number found in output.")
answer = numbers[-1]
expected = float(target.text)
correct = abs(float(answer) - expected) <= rel_tol * abs(expected)
return Score(
value=CORRECT if correct else INCORRECT,
answer=answer,
explanation=state.output.completion,
)
return score
@task
def arithmetic():
return Task(
dataset=[
Sample(input="What is 18 * 7? Reply with just the number.", target="126"),
],
solver=generate(),
scorer=close_enough(),
)Run it with inspect eval like any other task:
inspect eval arithmetic.py --model openai/gpt-4oThe sections below describe the pieces this example relies on.
Score
The components of Score include:
| Field | Type | Description |
|---|---|---|
value |
Value | Value assigned to the sample (e.g. “C” or “I”, or a raw numeric value). |
answer |
str |
Text extracted from model output for comparison (optional). |
explanation |
str |
Explanation of score, e.g. full model output or grader model output (optional). |
metadata |
dict[str,Any] |
Additional metadata about the score to record in the log file (optional). |
For example, the following are all valid Score objects:
Score(value="C")
Score(value="I")
Score(value=0.6)
Score(
value="C" if extracted == target.text else "I",
answer=extracted,
explanation=state.output.completion
)Score.value may be any Value that your metrics know how to interpret. Built-in correctness scorers use the constants CORRECT ("C"), INCORRECT ("I"), PARTIAL ("P"), and NOANSWER ("N"). The default value_to_float() converter used by metrics such as accuracy() maps these values to 1.0, 0.0, 0.5, and 0.0 respectively. It also converts numeric values, numeric strings, and common boolean strings such as "yes" / "no" and "true" / "false".
You can return other strings, but aggregate metrics need a converter that understands them. For example:
from inspect_ai.scorer import accuracy, value_to_float
accuracy(to_float=value_to_float(correct="pass", incorrect="fail"))If you are extracting an answer from within a completion (e.g. looking for text using a regex pattern, looking at the beginning or end of the completion, etc.) you should strive to always return an answer as part of your Score, as this makes it much easier to understand the details of scoring when viewing the eval log file.
Unscored Samples
When a scorer cannot produce a value for a sample (e.g. an external grader returned no result, the model refused, or an error occurred) but you still want to record context, use Score.unscored():
return Score.unscored(
answer=extracted,
explanation="grader returned no result",
metadata={"reason": "timeout"},
)Unscored samples are skipped by aggregate metrics and epoch reducers and are counted toward EvalScore.unscored_samples rather than included as zeros. This works for scalar, dict-valued, and list-valued scorers.
Value
Value is union over the main scalar types as well as a list or dict of the same types:
Value = Union[
str | int | float | bool,
Sequence[str | int | float | bool],
Mapping[str, str | int | float | bool],
]The vast majority of scorers will use str (e.g. for correct/incorrect via “C” and “I”) or float (the other types are there to meet more complex scenarios). One thing to keep in mind is that whatever Value type you use in a scorer must be supported by the metrics declared for the scorer (more on this below).
Next, we’ll take a look at the source code for a couple of the built in scorers as a jumping off point for implementing your own scorers. If you are working on custom scorers, you should also review the Scoring Workflow for tips on optimising your development process.
Models in Scorers
You’ll often want to use models in the implementation of scorers. Use the get_model() function to get either the currently evaluated model or another model interface. For example:
# use the model being evaluated for grading
grader_model = get_model()
# use another model for grading
grader_model = get_model("google/gemini-2.5-pro")Use the config parameter of get_model() to override default generation options:
grader_model = get_model(
"google/gemini-2.5-pro",
config = GenerateConfig(temperature = 0.9, max_connections = 10)
)Example: Includes
Here is the source code for the built-in includes() scorer:
1@scorer(metrics=[accuracy(), stderr()])
def includes(ignore_case: bool = True):
2 async def score(state: TaskState, target: Target):
# check for correct
answer = state.output.completion
3 target = target.text
if ignore_case:
correct = answer.lower().rfind(target.lower()) != -1
else:
correct = answer.rfind(target) != -1
# return score
return Score(
4 value = CORRECT if correct else INCORRECT,
5 answer=answer
)
return score- 1
-
The function applies the
@scorerdecorator and registers two metrics for use with the scorer. - 2
-
The
scorefunction is declared asasync. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this scorer doesn’t call a model but others will). - 3
-
We make use of the
textproperty on the Target. This is a convenience property to get a simple text value out of the Target (as targets can technically be a list of strings). - 4
-
We use the special constants
CORRECTandINCORRECTfor the score value (as the accuracy(), stderr(), and bootstrap_stderr() metrics know how to convert these special constants to float values (1.0 and 0.0 respectively). - 5
-
We provide the full model completion as the answer for the score (
answeris optional, but highly recommended as it is often useful to refer to during evaluation development).
Example: Model Grading
Here’s a somewhat simplified version of the code for the model_graded_qa() scorer:
@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
template: str = DEFAULT_MODEL_GRADED_QA_TEMPLATE,
instructions: str = DEFAULT_MODEL_GRADED_QA_INSTRUCTIONS,
grade_pattern: str = DEFAULT_GRADE_PATTERN,
model: str | Model | None = None,
) -> Scorer:
# resolve grading template and instructions,
# (as they could be file paths or URLs)
template = resource(template)
instructions = resource(instructions)
# resolve model
grader_model = get_model(model)
async def score(state: TaskState, target: Target) -> Score:
# format the model grading template
score_prompt = template.format(
question=state.input_text,
answer=state.output.completion,
criterion=target.text,
instructions=instructions,
)
# query the model for the score
result = await grader_model.generate(score_prompt)
# extract the grade
match = re.search(grade_pattern, result.completion)
if match:
return Score(
value=match.group(1),
answer=match.group(0),
explanation=result.completion,
)
else:
return Score(
value=INCORRECT,
explanation="Grade not found in model output: "
+ f"{result.completion}",
)
return scoreNote that the call to model_grader.generate() is done with await. This is critical to ensure that the scorer participates correctly in the scheduling of generation work.
Note also we use the input_text property of the TaskState to access a string version of the original user input to substitute it into the grading template. Using the input_text has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in messages); and (2) It normalises the input to a string (as it could have been a message list).
For the full set of customisation options on the built-in graders, see Model Grading.