inspect_ai.solver

Prompting and elicitation.

Generation

generate

Generate output from the model and append it to task message history.

generate() is the default solver if none is specified for a given task.

def generate(
    tool_calls: Literal['loop', 'single', 'none'] = ...,
    *,
    max_retries: int | None = ...,
    timeout: int | None = ...,
    attempt_timeout: int | None = ...,
    max_connections: int | None = ...,
    system_message: str | None = ...,
    max_tokens: int | None = ...,
    top_p: float | None = ...,
    temperature: float | None = ...,
    stop_seqs: list[str] | None = ...,
    best_of: int | None = ...,
    frequency_penalty: float | None = ...,
    presence_penalty: float | None = ...,
    logit_bias: dict[int, float] | None = ...,
    seed: int | None = ...,
    top_k: int | None = ...,
    num_choices: int | None = ...,
    logprobs: bool | None = ...,
    top_logprobs: int | None = ...,
    prompt_logprobs: int | None = ...,
    parallel_tool_calls: bool | None = ...,
    internal_tools: bool | None = ...,
    max_tool_output: int | None = ...,
    cache_prompt: Literal['auto'] | bool | None = ...,
    verbosity: Literal['low', 'medium', 'high'] | None = ...,
    effort: Literal['low', 'medium', 'high', 'xhigh', 'max'] | None = ...,
    reasoning_effort: Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh', 'max'] | None = ...,
    reasoning_tokens: int | None = ...,
    reasoning_summary: Literal['none', 'concise', 'detailed', 'auto'] | None = ...,
    reasoning_history: Literal['none', 'all', 'last', 'auto'] | None = ...,
    response_schema: ResponseSchema | None = ...,
    extra_headers: dict[str, str] | None = ...,
    extra_body: dict[str, Any] | None = ...,
    modalities: list[OutputModality] | None = ...,
    cache: bool | CachePolicy | None = ...,
    batch: bool | int | BatchConfig | None = ...,
) -> Solver
tool_calls Literal['loop', 'single', 'none']

Resolve tool calls: - "loop" resolves tools calls and then invokes generate(), proceeding in a loop which terminates when there are no more tool calls or message_limit or token_limit is exceeded. This is the default behavior. - "single" resolves at most a single set of tool calls and then returns. - "none" does not resolve tool calls at all (in this case you will need to invoke call_tools() directly).

max_retries int | None

Maximum number of times to retry request (defaults to unlimited).

timeout int | None

Request timeout (in seconds).

attempt_timeout int | None

Timeout (in seconds) for any given attempt (if exceeded, will abandon attempt and retry according to max_retries).

max_connections int | None

Maximum number of concurrent connections to Model API (default is model specific).

system_message str | None

Override the default system message.

max_tokens int | None

The maximum number of tokens that can be generated in the completion (default is model specific).

top_p float | None

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

temperature float | None

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

stop_seqs list[str] | None

Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

best_of int | None

Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). vLLM only.

frequency_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, and vLLM only.

presence_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, and vLLM only.

logit_bias dict[int, float] | None

Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI and Grok only.

seed int | None

Random seed. OpenAI, Google, Mistral, Groq, HuggingFace, and vLLM only.

top_k int | None

Randomly sample the next word from the top_k most likely next words. Anthropic, Google, and HuggingFace only.

num_choices int | None

How many chat completion choices to generate for each input message. OpenAI, Grok, Google, and TogetherAI only.

logprobs bool | None

Return log probabilities of the output tokens. OpenAI, Google, Grok, TogetherAI, Huggingface, llama-cpp-python, and vLLM only.

top_logprobs int | None

Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Google, Grok, and Huggingface only.

prompt_logprobs int | None

Number of log probabilities to return per prompt token (1-20). When greater than 1, top-N alternative tokens are also returned. vLLM only.

parallel_tool_calls bool | None

Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.

internal_tools bool | None

Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for anthropic).

max_tool_output int | None

Maximum tool output (in bytes). Defaults to 16 * 1024.

cache_prompt Literal['auto'] | bool | None

Whether to cache the prompt prefix. Enabled by default. Set to False to disable. Anthropic only.

verbosity Literal['low', 'medium', 'high'] | None

Constrains the verbosity of the model’s response. Lower values will result in more concise responses, while higher values will result in more verbose responses. GPT 5.x models only (defaults to “medium” for OpenAI models).

effort Literal['low', 'medium', 'high', 'xhigh', 'max'] | None

Control how many tokens are used for a response, trading off between response thoroughness and token efficiency. Anthropic Claude Opus 4.5+ only (max only supported on 4.6 and 4.7, xhigh supported only on 4.7).

reasoning_effort Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh', 'max'] | None

Constrains effort on reasoning. Defaults vary by provider and model and not all models support all values (please consult provider documentation for details).

reasoning_tokens int | None

Maximum number of tokens to use for reasoning. Anthropic Claude models only.

reasoning_summary Literal['none', 'concise', 'detailed', 'auto'] | None

Provide summary of reasoning steps (OpenAI reasoning models only). Use ‘auto’ to access the most detailed summarizer available for the current model (defaults to ‘auto’ if your organization is verified by OpenAI).

reasoning_history Literal['none', 'all', 'last', 'auto'] | None

Include reasoning in chat message history sent to generate.

response_schema ResponseSchema | None

Request a response format as JSONSchema (output should still be validated). OpenAI, Google, and Mistral only.

extra_headers dict[str, str] | None

Extra headers to be sent with requests. Not supported for AzureAI, Bedrock, and Grok.

extra_body dict[str, Any] | None

Extra body to be sent with requests to OpenAI compatible servers. OpenAI, vLLM, and SGLang only.

modalities list[OutputModality] | None

Additional output modalities to enable beyond text (e.g. [“image”]). OpenAI and Google only.

cache bool | CachePolicy | None

Policy for caching of model generations.

batch bool | int | BatchConfig | None

Use batching API when available. True to enable batching with default configuration, False to disable batching, a number to enable batching of the specified batch size, or a BatchConfig object specifying the batching configuration.

use_tools

Inject tools into the task state to be used in generate().

@solver
def use_tools(
    *tools: Tool | ToolDef | ToolSource | Sequence[Tool | ToolDef | ToolSource],
    tool_choice: ToolChoice | None = "auto",
    append: bool = False,
) -> Solver
*tools Tool | ToolDef | ToolSource | Sequence[Tool | ToolDef | ToolSource]

One or more tools or lists of tools to make available to the model. If no tools are passed, then no change to the currently available set of tools is made.

tool_choice ToolChoice | None

Directive indicating which tools the model should use. If None is passed, then no change to tool_choice is made.

append bool

If True, then the passed-in tools are appended to the existing tools; otherwise any existing tools are replaced (the default)

Prompting

prompt_template

Parameterized prompt template.

Prompt template containing a {prompt} placeholder and any number of additional params. All values contained in sample metadata and store are also automatically included in the params.

@solver
def prompt_template(template: str, **params: Any) -> Solver
template str

Template for prompt.

**params Any

Parameters to fill into the template.

system_message

Solver which inserts a system message into the conversation.

System message template containing any number of optional params. for substitution using the str.format() method. All values contained in sample metadata and store are also automatically included in the params.

The new message will go after other system messages (if there are none it will be inserted at the beginning of the conversation).

@solver
def system_message(template: str, **params: Any) -> Solver
template str

Template for system message.

**params Any

Parameters to fill into the template.

user_message

Solver which inserts a user message into the conversation.

User message template containing any number of optional params. for substitution using the str.format() method. All values contained in sample metadata and store are also automatically included in the params.

@solver
def user_message(template: str, **params: Any) -> Solver
template str

Template for user message.

**params Any

Parameters to fill into the template.

assistant_message

Solver which inserts an assistant message into the conversation.

Assistant message template containing any number of optional params. for substitution using the str.format() method. All values contained in sample metadata and store are also automatically included in the params.

@solver
def assistant_message(template: str, **params: Any) -> Solver
template str

Template for assistant message.

**params Any

Parameters to fill into the template.

chain_of_thought

Solver which modifies the user prompt to encourage chain of thought.

@solver
def chain_of_thought(template: str = DEFAULT_COT_TEMPLATE) -> Solver
template str

String or path to file containing CoT template. The template uses a single variable: prompt.

self_critique

Solver which uses a model to critique the original answer.

The critique_template is used to generate a critique and the completion_template is used to play that critique back to the model for an improved response. Note that you can specify an alternate model for critique (you don’t need to use the model being evaluated).

@solver
def self_critique(
    critique_template: str | None = None,
    completion_template: str | None = None,
    model: str | Model | None = None,
) -> Solver
critique_template str | None

String or path to file containing critique template. The template uses two variables: question and completion. Variables from sample metadata are also available in the template.

completion_template str | None

String or path to file containing completion template. The template uses three variables: question, completion, and critique

model str | Model | None

Alternate model to be used for critique (by default the model being evaluated is used).

multiple_choice

Multiple choice question solver. Formats a multiple choice question prompt, then calls generate().

Note that due to the way this solver works, it has some constraints:

  1. The Sample must have the choices attribute set.
  2. The only built-in compatible scorer is the choice scorer.
  3. It calls generate() internally, so you don’t need to call it again
def multiple_choice(
    *,
    template: str | None = ...,
    cot: bool = ...,
    multiple_correct: bool = ...,
    max_tokens: int | None = ...,
    shuffle: bool | Random = ...,
) -> Solver
template str | None

Template to use for the multiple choice question. The defaults vary based on the options and are taken from the MultipleChoiceTemplate enum. The template will have questions and possible answers substituted into it before being sent to the model. Consequently it requires three specific template variables:

  • {question}: The question to be asked.
  • {choices}: The choices available, which will be formatted as a list of A) … B) … etc. before sending to the model.
  • {letters}: (optional) A string of letters representing the choices, e.g. “A,B,C”. Used to be explicit to the model about the possible answers.
cot bool

Default False. Whether the solver should perform chain-of-thought reasoning before answering. NOTE: this has no effect if you provide a custom template.

multiple_correct bool

Default False. Whether to allow multiple answers to the multiple choice question. For example, “What numbers are squares? A) 3, B) 4, C) 9” has multiple correct answers, B and C. Leave as False if there’s exactly one correct answer from the choices available. NOTE: this has no effect if you provide a custom template.

max_tokens int | None

Default None. Controls the number of tokens generated through the call to generate().

shuffle bool | Random

Composition

chain

Compose a solver from multiple other solvers and/or agents.

Solvers are executed in turn, and a solver step event is added to the transcript for each. If a solver returns a state with completed=True, the chain is terminated early.

@solver
def chain(
    *solvers: Solver | Agent | list[Solver] | list[Solver | Agent],
) -> Solver
*solvers Solver | Agent | list[Solver] | list[Solver | Agent]

One or more solvers or agents to chain together.

fork

Fork the TaskState and evaluate it against multiple solvers in parallel.

Run several solvers against independent copies of a TaskState. Each Solver gets its own copy of the TaskState and is run (in parallel) in an independent Subtask (meaning that is also has its own independent Store that doesn’t affect the Store of other subtasks or the parent).

async def fork(
    state: TaskState, solvers: Solver | list[Solver]
) -> TaskState | list[TaskState]
state TaskState

Beginning TaskState

solvers Solver | list[Solver]

Solvers to apply on the TaskState. Each Solver will get a standalone copy of the TaskState.

Types

Solver

Contribute to solving an evaluation task.

Transform a TaskState, returning the new state. Solvers may optionally call the generate() function to create a new state resulting from model generation. Solvers may also do prompt engineering or other types of elicitation.

class Solver(Protocol):
    async def __call__(
        self,
        state: TaskState,
        generate: Generate,
    ) -> TaskState
state TaskState

State for tasks being evaluated.

generate Generate

Function for generating outputs.

Examples

@solver
def prompt_cot(template: str) -> Solver:
    def solve(state: TaskState, generate: Generate) -> TaskState:
        # insert chain of thought prompt
        return state

    return solve

SolverSpec

Solver specification used to (re-)create solvers.

@dataclass(frozen=True)
class SolverSpec

Attributes

solver str

Solver name (simple name or ).

args dict[str, Any]

Solver arguments.

args_passed dict[str, Any]

Solver arguments passed for invocation.

TaskState

The TaskState represents the internal state of the Task being run for a single Sample.

The TaskState is passed to and returned from each solver during a sample’s evaluation. It allows us to maintain the manipulated message history, the tools available to the model, the final output of the model, and whether the task is completed or has hit a limit.

class TaskState

Attributes

model ModelName

Name of model being evaluated.

sample_id int | str

Unique id for sample.

epoch int

Epoch number for sample.

input str | list[ChatMessage]

Input from the Sample, should be considered immutable.

input_text str

Convenience function for accessing the initial input from the Sample as a string.

If the input is a list[ChatMessage], this will return the text from the last chat message

user_prompt ChatMessageUser

User prompt for this state.

Tasks are very general and can have may types of inputs. However, in many cases solvers assume they can interact with the state as a “chat” in a predictable fashion (e.g. prompt engineering solvers). This property enables easy read and write access to the user chat prompt. Raises an exception if there is no user prompt

metadata dict[str, Any]

Metadata from the Sample for this TaskState

messages list[ChatMessage]

Chat conversation history for sample.

This will generally get appended to every time a generate call is made to the model. Useful for both debug and for solvers/scorers to assess model performance or choose the next step.

output ModelOutput

The ‘final’ model output once we’ve completed all solving.

For simple evals this may just be the last message from the conversation history, but more complex solvers may set this directly.

store Store

Store for shared data

tools list[Tool]

Tools available to the model.

tool_choice ToolChoice | None

Tool choice directive.

message_limit int | None

Limit on total messages allowed per conversation.

token_limit int | None

Limit on total tokens allowed per conversation.

token_usage int

Total tokens used for the current sample.

cost_limit float | None

Limit on total cost (in dollars) allowed per sample.

cost_usage float

Total cost (in dollars) used for the current sample.

completed bool

Is the task completed.

Additionally, checks for an operator interrupt of the sample.

target Target

The scoring target for this Sample.

scores dict[str, Score] | None

Scores yielded by running task.

uuid str

Globally unique identifier for sample run.

Methods

metadata_as

Pydantic model interface to metadata.

def metadata_as(self, metadata_cls: Type[MT]) -> MT
metadata_cls Type[MT]

Pydantic model type

store_as

Pydantic model interface to the store.

def store_as(self, model_cls: Type[SMT], instance: str | None = None) -> SMT
model_cls Type[SMT]

Pydantic model type (must derive from StoreModel)

instance str | None

Optional instances name for store (enables multiple instances of a given StoreModel type within a single sample)

Generate

Generate using the model and add the assistant message to the task state.

class Generate(Protocol):
def __call__(
    self,
    state: TaskState,
    tool_calls: Literal['loop', 'single', 'none'] = ...,
    *,
    max_retries: int | None = ...,
    timeout: int | None = ...,
    attempt_timeout: int | None = ...,
    max_connections: int | None = ...,
    system_message: str | None = ...,
    max_tokens: int | None = ...,
    top_p: float | None = ...,
    temperature: float | None = ...,
    stop_seqs: list[str] | None = ...,
    best_of: int | None = ...,
    frequency_penalty: float | None = ...,
    presence_penalty: float | None = ...,
    logit_bias: dict[int, float] | None = ...,
    seed: int | None = ...,
    top_k: int | None = ...,
    num_choices: int | None = ...,
    logprobs: bool | None = ...,
    top_logprobs: int | None = ...,
    prompt_logprobs: int | None = ...,
    parallel_tool_calls: bool | None = ...,
    internal_tools: bool | None = ...,
    max_tool_output: int | None = ...,
    cache_prompt: Literal['auto'] | bool | None = ...,
    verbosity: Literal['low', 'medium', 'high'] | None = ...,
    effort: Literal['low', 'medium', 'high', 'xhigh', 'max'] | None = ...,
    reasoning_effort: Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh', 'max'] | None = ...,
    reasoning_tokens: int | None = ...,
    reasoning_summary: Literal['none', 'concise', 'detailed', 'auto'] | None = ...,
    reasoning_history: Literal['none', 'all', 'last', 'auto'] | None = ...,
    response_schema: ResponseSchema | None = ...,
    extra_headers: dict[str, str] | None = ...,
    extra_body: dict[str, Any] | None = ...,
    modalities: list[OutputModality] | None = ...,
    cache: bool | CachePolicy | None = ...,
    batch: bool | int | BatchConfig | None = ...,
) -> TaskState
state TaskState

Beginning task state.

tool_calls Literal['loop', 'single', 'none']
  • "loop" resolves tools calls and then invokes generate(), proceeding in a loop which terminates when there are no more tool calls, or message_limit or token_limit is exceeded. This is the default behavior.
  • "single" resolves at most a single set of tool calls and then returns.
  • "none" does not resolve tool calls at all (in this case you will need to invoke call_tools() directly).
max_retries int | None

Maximum number of times to retry request (defaults to unlimited).

timeout int | None

Request timeout (in seconds).

attempt_timeout int | None

Timeout (in seconds) for any given attempt (if exceeded, will abandon attempt and retry according to max_retries).

max_connections int | None

Maximum number of concurrent connections to Model API (default is model specific).

system_message str | None

Override the default system message.

max_tokens int | None

The maximum number of tokens that can be generated in the completion (default is model specific).

top_p float | None

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

temperature float | None

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

stop_seqs list[str] | None

Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

best_of int | None

Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). vLLM only.

frequency_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, and vLLM only.

presence_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, and vLLM only.

logit_bias dict[int, float] | None

Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI and Grok only.

seed int | None

Random seed. OpenAI, Google, Mistral, Groq, HuggingFace, and vLLM only.

top_k int | None

Randomly sample the next word from the top_k most likely next words. Anthropic, Google, and HuggingFace only.

num_choices int | None

How many chat completion choices to generate for each input message. OpenAI, Grok, Google, and TogetherAI only.

logprobs bool | None

Return log probabilities of the output tokens. OpenAI, Google, Grok, TogetherAI, Huggingface, llama-cpp-python, and vLLM only.

top_logprobs int | None

Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Google, Grok, and Huggingface only.

prompt_logprobs int | None

Number of log probabilities to return per prompt token (1-20). When greater than 1, top-N alternative tokens are also returned. vLLM only.

parallel_tool_calls bool | None

Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.

internal_tools bool | None

Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for anthropic).

max_tool_output int | None

Maximum tool output (in bytes). Defaults to 16 * 1024.

cache_prompt Literal['auto'] | bool | None

Whether to cache the prompt prefix. Enabled by default. Set to False to disable. Anthropic only.

verbosity Literal['low', 'medium', 'high'] | None

Constrains the verbosity of the model’s response. Lower values will result in more concise responses, while higher values will result in more verbose responses. GPT 5.x models only (defaults to “medium” for OpenAI models).

effort Literal['low', 'medium', 'high', 'xhigh', 'max'] | None

Control how many tokens are used for a response, trading off between response thoroughness and token efficiency. Anthropic Claude Opus 4.5+ only (max only supported on 4.6 and 4.7, xhigh supported only on 4.7).

reasoning_effort Literal['none', 'minimal', 'low', 'medium', 'high', 'xhigh', 'max'] | None

Constrains effort on reasoning. Defaults vary by provider and model and not all models support all values (please consult provider documentation for details).

reasoning_tokens int | None

Maximum number of tokens to use for reasoning. Anthropic Claude models only.

reasoning_summary Literal['none', 'concise', 'detailed', 'auto'] | None

Provide summary of reasoning steps (OpenAI reasoning models only). Use ‘auto’ to access the most detailed summarizer available for the current model (defaults to ‘auto’ if your organization is verified by OpenAI).

reasoning_history Literal['none', 'all', 'last', 'auto'] | None

Include reasoning in chat message history sent to generate.

response_schema ResponseSchema | None

Request a response format as JSONSchema (output should still be validated). OpenAI, Google, and Mistral only.

extra_headers dict[str, str] | None

Extra headers to be sent with requests. Not supported for AzureAI, Bedrock, and Grok.

extra_body dict[str, Any] | None

Extra body to be sent with requests to OpenAI compatible servers. OpenAI, vLLM, and SGLang only.

modalities list[OutputModality] | None

Additional output modalities to enable beyond text (e.g. [“image”]). OpenAI and Google only.

cache bool | CachePolicy | None

Policy for caching of model generations.

batch bool | int | BatchConfig | None

Use batching API when available. True to enable batching with default configuration, False to disable batching, a number to enable batching of the specified batch size, or a BatchConfig object specifying the batching configuration.

Decorators

solver

Decorator for registering solvers.

def solver(
    name: str | Callable[P, SolverType],
) -> Callable[[Callable[P, Solver]], Callable[P, Solver]] | Callable[P, Solver]
name str | Callable[P, SolverType]

Optional name for solver. If the decorator has no name argument then the name of the underlying Callable[P, SolverType] object will be used to automatically assign a name.

Examples

@solver
def prompt_cot(template: str) -> Solver:
    def solve(state: TaskState, generate: Generate) -> TaskState:
        # insert chain of thought prompt
        return state

    return solve