inspect_ai.scorer
Scorers
match
Scorer which matches text or a number.
@scorer(metrics=[accuracy(), stderr()])
def match(
location: Literal["begin", "end", "any", "exact"] = "end",
*,
ignore_case: bool = True,
numeric: bool = False,
) -> ScorerlocationLiteral['begin', 'end', 'any', 'exact']-
Location to match at. “any” matches anywhere in the output; “exact” requires the output be exactly equal to the target (module whitespace, etc.)
ignore_casebool-
Do case insensitive comparison.
numericbool-
Is this a numeric match? (in this case different punctuation removal rules are used and numbers are normalized before comparison).
includes
Check whether the specified text is included in the model output.
@scorer(metrics=[accuracy(), stderr()])
def includes(ignore_case: bool = True) -> Scorerignore_casebool-
Use a case insensitive comparison.
pattern
Scorer which extracts the model answer using a regex.
Note that at least one regex group is required to match against the target.
The regex can have a single capture group or multiple groups. In the case of multiple groups, the scorer can be configured to match either one or all of the extracted groups
@scorer(metrics=[accuracy(), stderr()])
def pattern(pattern: str, ignore_case: bool = True, match_all: bool = False) -> Scorerpatternstr-
Regular expression for extracting the answer from model output.
ignore_casebool-
Ignore case when comparing the extract answer to the targets. (Default: True)
match_allbool-
With multiple captures, do all captured values need to match the target? (Default: False)
answer
Scorer for model output that preceded answers with ANSWER:.
Some solvers including multiple_choice solicit answers from the model prefaced with “ANSWER:”. This scorer extracts answers of this form for comparison with the target.
Note that you must specify a type for the answer scorer.
@scorer(metrics=[accuracy(), stderr()])
def answer(pattern: Literal["letter", "word", "line"]) -> ScorerpatternLiteral['letter', 'word', 'line']-
Type of answer to extract. “letter” is used with multiple choice and extracts a single letter; “word” will extract the next word (often used for yes/no answers); “line” will take the rest of the line (used for more more complex answers that may have embedded spaces). Note that when using “line” your prompt should instruct the model to answer with a separate line at the end.
choice
Scorer for multiple choice answers, required by the multiple_choice solver.
This assumes that the model was called using a template ordered with letters corresponding to the answers, so something like:
What is the capital of France?
A) Paris
B) Berlin
C) London
The target for the dataset will then have a letter corresponding to the correct answer, e.g. the Target would be "A" for the above question. If multiple choices are correct, the Target can be an array of these letters.
@scorer(metrics=[accuracy(), stderr()])
def choice() -> Scorerf1
Scorer which produces an F1 score
Computes the F1 score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).
@scorer(metrics=[mean(), stderr()])
def f1(
answer_fn: Callable[[str], str] | None = None, stop_words: list[str] | None = None
) -> Scoreranswer_fnCallable[[str], str] | None-
Custom function to extract the answer from the completion (defaults to using the completion).
stop_wordslist[str] | None-
Stop words to include in answer tokenization.
exact
Scorer which produces an exact match score
Normalizes the text of the answer and target(s) and performs an exact matching comparison of the text. This scorer will return CORRECT when the answer is an exact match to one or more targets.
@scorer(metrics=[mean(), stderr()])
def exact() -> Scorermodel_graded_qa
Score a question/answer task using a model.
@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
template: str | None = None,
instructions: str | None = None,
grade_pattern: str | None = None,
include_history: bool | Callable[[TaskState], str] = False,
partial_credit: bool = False,
model: list[str | Model] | str | Model | None = None,
model_role: str | None = "grader",
) -> Scorertemplatestr | None-
Template for grading prompt. This template has four variables: -
question,criterion,answer, andinstructions(which is fed from theinstructionsparameter). Variables from samplemetadataare also available in the template. instructionsstr | None-
Grading instructions. This should include a prompt for the model to answer (e.g. with with chain of thought reasoning) in a way that matches the specified
grade_pattern, for example, the defaultgrade_patternlooks for one of GRADE: C, GRADE: P, or GRADE: I. grade_patternstr | None-
Regex to extract the grade from the model response. Defaults to looking for e.g. GRADE: C The regex should have a single capture group that extracts exactly the letter C, P, I.
include_historybool | Callable[[TaskState], str]-
Whether to include the full chat history in the presented question. Defaults to
False, which presents only the original sample input. Optionally provide a function to customise how the chat history is presented. partial_creditbool-
Whether to allow for “partial” credit for answers (by default assigned a score of 0.5). Defaults to
False. Note that this parameter is only used with the defaultinstructions(as custom instructions provide their own prompts for grades). modellist[str | Model] | str | Model | None-
Model or models to use for grading. If a list is provided, each model grades independently and the final grade is computed by majority vote. When this parameter is provided, it takes precedence over
model_role. model_rolestr | None-
Named model role to use for grading (default: “grader”). Ignored if
modelis provided. If specified and a model is bound to this role (e.g. via themodel_rolesargument to eval()), that model is used. If no role-bound model is available, the model being evaluated (the default model) is used.
model_graded_fact
Score a question/answer task with a fact response using a model.
@scorer(metrics=[accuracy(), stderr()])
def model_graded_fact(
template: str | None = None,
instructions: str | None = None,
grade_pattern: str | None = None,
include_history: bool | Callable[[TaskState], str] = False,
partial_credit: bool = False,
model: list[str | Model] | str | Model | None = None,
model_role: str | None = "grader",
) -> Scorertemplatestr | None-
Template for grading prompt. This template uses four variables:
question,criterion,answer, andinstructions(which is fed from theinstructionsparameter). Variables from samplemetadataare also available in the template. instructionsstr | None-
Grading instructions. This should include a prompt for the model to answer (e.g. with with chain of thought reasoning) in a way that matches the specified
grade_pattern, for example, the defaultgrade_patternlooks for one of GRADE: C, GRADE: P, or GRADE: I). grade_patternstr | None-
Regex to extract the grade from the model response. Defaults to looking for e.g. GRADE: C The regex should have a single capture group that extracts exactly the letter C, P, or I.
include_historybool | Callable[[TaskState], str]-
Whether to include the full chat history in the presented question. Defaults to
False, which presents only the original sample input. Optionally provide a function to customise how the chat history is presented. partial_creditbool-
Whether to allow for “partial” credit for answers (by default assigned a score of 0.5). Defaults to
False. Note that this parameter is only used with the defaultinstructions(as custom instructions provide their own prompts for grades). modellist[str | Model] | str | Model | None-
Model or models to use for grading. If a list is provided, each model grades independently and the final grade is computed by majority vote. When this parameter is provided, it takes precedence over
model_role. model_rolestr | None-
Named model role to use for grading (default: “grader”). Ignored if
modelis provided. If specified and a model is bound to this role (e.g. via themodel_rolesargument to eval()), that model is used. If no role-bound model is available, the model being evaluated (the default model) is used.
multi_scorer
Returns a Scorer that runs multiple Scorers in parallel and aggregates their results into a single Score using the provided reducer function.
def multi_scorer(scorers: list[Scorer], reducer: str | ScoreReducer) -> Scorerscorerslist[Scorer]-
a list of Scorers.
reducerstr | ScoreReducer-
a function which takes in a list of Scores and returns a single Score.
Metrics
accuracy
Compute proportion of total answers which are correct.
@metric
def accuracy(to_float: ValueToFloat = value_to_float()) -> Metricto_floatValueToFloat-
Function for mapping Value to float for computing metrics. The default
value_to_float()maps CORRECT (“C”) to 1.0, INCORRECT (“I”) to 0, PARTIAL (“P”) to 0.5, and NOANSWER (“N”) to 0, casts numeric values to float directly, and prints a warning and returns 0 if the Value is a complex object (list or dict).
mean
Compute mean of all scores.
@metric
def mean() -> Metricstd
Calculates the sample standard deviation of a list of scores.
@metric
def std(to_float: ValueToFloat = value_to_float()) -> Metricto_floatValueToFloat-
Function for mapping Value to float for computing metrics. The default
value_to_float()maps CORRECT (“C”) to 1.0, INCORRECT (“I”) to 0, PARTIAL (“P”) to 0.5, and NOANSWER (“N”) to 0, casts numeric values to float directly, and prints a warning and returns 0 if the Value is a complex object (list or dict).
stderr
Standard error of the mean using Central Limit Theorem.
@metric
def stderr(
to_float: ValueToFloat = value_to_float(), cluster: str | None = None
) -> Metricto_floatValueToFloat-
Function for mapping Value to float for computing metrics. The default
value_to_float()maps CORRECT (“C”) to 1.0, INCORRECT (“I”) to 0, PARTIAL (“P”) to 0.5, and NOANSWER (“N”) to 0, casts numeric values to float directly, and prints a warning and returns 0 if the Value is a complex object (list or dict). clusterstr | None-
The key from the Sample metadata corresponding to a cluster identifier for computing clustered standard errors.
bootstrap_stderr
Standard error of the mean using bootstrap.
@metric
def bootstrap_stderr(
num_samples: int = 1000, to_float: ValueToFloat = value_to_float()
) -> Metricnum_samplesint-
Number of bootstrap samples to take.
to_floatValueToFloat-
Function for mapping Value to float for computing metrics. The default
value_to_float()maps CORRECT (“C”) to 1.0, INCORRECT (“I”) to 0, PARTIAL (“P”) to 0.5, and NOANSWER (“N”) to 0, casts numeric values to float directly, and prints a warning and returns 0 if the Value is a complex object (list or dict).
Reducers
at_least
Score correct if there are at least k score values greater than or equal to the value.
@score_reducer
def at_least(
k: int, value: float = 1.0, value_to_float: ValueToFloat = value_to_float()
) -> ScoreReducerkint-
Number of score values that must exceed
value. valuefloat-
Score value threshold.
value_to_floatValueToFloat-
Function to convert score values to float.
pass_at
Probability of at least 1 correct sample given k epochs (https://arxiv.org/pdf/2107.03374).
@score_reducer
def pass_at(
k: int, value: float = 1.0, value_to_float: ValueToFloat = value_to_float()
) -> ScoreReducerkint-
Epochs to compute probability for.
valuefloat-
Score value threshold.
value_to_floatValueToFloat-
Function to convert score values to float.
max_score
Take the maximum value from a list of scores.
@score_reducer(name="max")
def max_score(value_to_float: ValueToFloat = value_to_float()) -> ScoreReducervalue_to_floatValueToFloat-
Function to convert the value to a float
mean_score
Take the mean of a list of scores.
@score_reducer(name="mean")
def mean_score(value_to_float: ValueToFloat = value_to_float()) -> ScoreReducervalue_to_floatValueToFloat-
Function to convert the value to a float
median_score
Take the median value from a list of scores.
@score_reducer(name="median")
def median_score(value_to_float: ValueToFloat = value_to_float()) -> ScoreReducervalue_to_floatValueToFloat-
Function to convert the value to a float
mode_score
Take the mode from a list of scores.
@score_reducer(name="mode")
def mode_score() -> ScoreReducerTypes
Scorer
Score model outputs.
Evaluate the passed outputs and targets and return a dictionary with scoring outcomes and context.
class Scorer(Protocol):
async def __call__(
self,
state: TaskState,
target: Target,
) -> Score | NoneExamples
@scorer
def custom_scorer() -> Scorer:
async def score(state: TaskState, target: Target) -> Score:
# Compare state / model output with target
# to yield a score
return Score(value=...)
return scoreTarget
Target for scoring against the current TaskState.
Target is a sequence of one or more strings. Use the text property to access the value as a single string.
class Target(Sequence[str])Score
Score generated by a scorer.
class Score(BaseModel)Attributes
valueValue-
Score value.
answerstr | None-
Answer extracted from model output (optional)
explanationstr | None-
Explanation of score (optional).
metadatadict[str, Any] | None-
Additional metadata related to the score
historylist[ScoreEdit]-
Edit history - users can access intermediate states.
textstr-
Read the score as text.
Methods
- as_str
-
Read the score as a string.
def as_str(self) -> str - as_int
-
Read the score as an integer.
def as_int(self) -> int - as_float
-
Read the score as a float.
def as_float(self) -> float - as_bool
-
Read the score as a boolean.
def as_bool(self) -> bool - as_list
-
Read the score as a list.
def as_list(self) -> list[str | int | float | bool] - as_dict
-
Read the score as a dictionary.
def as_dict(self) -> dict[str, str | int | float | bool | None]
Value
Value provided by a score.
Use the methods of Score to easily treat the Value as a simple scalar of various types.
Value = Union[
str | int | float | bool,
Sequence[str | int | float | bool],
Mapping[str, str | int | float | bool | None],
]ScoreReducer
Reduce a set of scores to a single score.
class ScoreReducer(Protocol):
def __call__(self, scores: list[Score]) -> Scorescoreslist[Score]-
List of scores.
Metric
Metric protocol.
The Metric signature changed in release v0.3.64. Both the previous and new signatures are supported – you should use MetricProtocol for new code as the depreacated signature will eventually be removed.
Metric = MetricProtocol | MetricDeprecatedMetricProtocol
Compute a metric on a list of scores.
class MetricProtocol(Protocol):
def __call__(self, scores: list[SampleScore]) -> Valuescoreslist[SampleScore]-
List of scores.
Examples
@metric
def mean() -> Metric:
def metric(scores: list[SampleScore]) -> Value:
return np.mean([score.score.as_float() for score in scores]).item()
return metricSampleScore
Score for a Sample.
class SampleScore(BaseModel)Attributes
scoreScore-
A score
sample_idstr | int | None-
A sample id
sample_metadatadict[str, Any] | None-
Metadata from the sample
scorerstr | None-
Registry name of scorer that created this score.
Methods
- sample_metadata_as
-
Pydantic model interface to sample metadata.
def sample_metadata_as(self, metadata_cls: Type[MT]) -> MT | Nonemetadata_clsType[MT]-
Pydantic model type
Decorators
scorer
Decorator for registering scorers.
def scorer(
metrics: Sequence[Metric | Mapping[str, Sequence[Metric]]]
| Mapping[str, Sequence[Metric]],
name: str | None = None,
**metadata: Any,
) -> Callable[[Callable[P, Scorer]], Callable[P, Scorer]]metricsSequence[Metric | Mapping[str, Sequence[Metric]]] | Mapping[str, Sequence[Metric]]-
One or more metrics to calculate over the scores.
namestr | None-
Optional name for scorer. If the decorator has no name argument then the name of the underlying ScorerType object will be used to automatically assign a name.
**metadataAny-
Additional values to serialize in metadata.
Examples
@scorer
def custom_scorer() -> Scorer:
async def score(state: TaskState, target: Target) -> Score:
# Compare state / model output with target
# to yield a score
return Score(value=...)
return scoremetric
Decorator for registering metrics.
def metric(
name: str | Callable[P, Metric],
) -> Callable[[Callable[P, Metric]], Callable[P, Metric]] | Callable[P, Metric]namestr | Callable[P, Metric]-
Optional name for metric. If the decorator has no name argument then the name of the underlying MetricType will be used to automatically assign a name.
Examples
```python @metric def mean() -> Metric: def metric(scores: list[SampleScore]) -> Value: return np.mean([score.score.as_float() for score in scores]).item() return metric
score_reducer
Decorator for registering Score Reducers.
def score_reducer(
func: ScoreReducerType | None = None, *, name: str | None = None
) -> Callable[[ScoreReducerType], ScoreReducerType] | ScoreReducerTypefuncScoreReducerType | None-
Function returning ScoreReducer targeted by plain task decorator without attributes (e.g.
@score_reducer) namestr | None-
Optional name for reducer. If the decorator has no name argument then the name of the function will be used to automatically assign a name.
Intermediate Scoring
score
Score a model conversation.
Score a model conversation (you may pass TaskState or AgentState as the value for conversation)
async def score(conversation: ModelConversation) -> list[Score]conversationModelConversation-
Conversation to submit for scoring. Note that both TaskState and AgentState can be passed as the
conversationparameter.