Using Models

Overview

Inspect has support for a wide variety of language model APIs and can be extended to support arbitrary additional ones. Support for the following providers is built in to Inspect:

Lab APIs OpenAI, Anthropic, Google, Grok, Mistral, DeepSeek, Perplexity
Cloud APIs AWS Bedrock, Azure AI, Vertex AI
Open (Hosted) Groq, Together AI, Cloudflare
Open (Local) Hugging Face, vLLM, Ollama, Lllama-cpp-python, SGLang,


If the provider you are using is not listed above, you may still be able to use it if:

  1. It provides an OpenAI compatible API endpoint. In this scenario, use the Inspect OpenAI Compatible API interface.

  2. It is available via OpenRouter (see the docs on using OpenRouter with Inspect).

You can also create Model API Extensions to add model providers using their native interface.

Below we’ll describe various ways to specify and provide options to models in Inspect evaluations. Review this first, then see the provider-specific sections for additional usage details and available options.

Selecting a Model

To select a model for an evaluation, pass it’s name on the command line or use the model argument of the eval() function:

inspect eval arc.py --model openai/gpt-4o-mini
inspect eval arc.py --model anthropic/claude-3-5-sonnet-latest

Or:

eval("arc.py", model="openai/gpt-4o-mini")
eval("arc.py", model="anthropic/claude-3-5-sonnet-latest")

Alternatively, you can set the INSPECT_EVAL_MODEL environment variable (either in the shell or a .env file) to select a model externally:

INSPECT_EVAL_MODEL=google/gemini-1.5-pro

No Model

Some evaluations will either not make use of models or call the lower-level get_model() function to explicitly access models for different roles (see the Model API section below for details on this).

In these cases, you are not required to specify a --model. If you happen to have an INSPECT_EVAL_MODEL defined and you want to prevent your evaluation from using it, you can explicitly specify no model as follows:

inspect eval arc.py --model none

Or from Python:

eval("arc.py", model=None)

Generation Config

There are a variety of configuration options that affect the behaviour of model generation. There are options which affect the generated tokens (temperature, top_p, etc.) as well as the connection to model providers (timeout, max_retries, etc.)

You can specify generation options either on the command line or in direct calls to eval(). For example:

inspect eval arc.py --model openai/gpt-4 --temperature 0.9
inspect eval arc.py --model google/gemini-1.5-pro --max-connections 20

Or:

eval("arc.py", model="openai/gpt-4", temperature=0.9)
eval("arc.py", model="google/gemini-1.5-pro", max_connections=20)

Use inspect eval --help to learn about all of the available generation config options.

Model Args

If there is an additional aspect of a model you want to tweak that isn’t covered by the GenerateConfig, you can use model args to pass additional arguments to model clients. For example, here we specify the location option for a Google Gemini model:

inspect eval arc.py --model google/gemini-1.5-pro -M location=us-east5

See the documentation for the requisite model provider for information on how model args are passed through to model clients.

Max Connections

Inspect uses an asynchronous architecture to run task samples in parallel. If your model provider can handle 100 concurrent connections, then Inspect can utilise all of those connections to get the highest possible throughput. The limiting factor on parallelism is therefore not typically local parallelism (e.g. number of cores) but rather what the underlying rate limit is for your interface to the provider.

By default, Inspect uses a max_connections value of 10. You can increase this consistent with your account limits. If you are experiencing rate-limit errors you will need to experiment with the max_connections option to find the optimal value that keeps you under the rate limit (the section on Parallelism includes additional documentation on how to do this).

Model API

The --model which is set for an evaluation is automatically used by the generate() solver, as well as for other solvers and scorers built to use the currently evaluated model. If you are implementing a Solver or Scorer and want to use the currently evaluated model, call get_model() with no arguments:

from inspect_ai.model import get_model

model = get_model()
response = await model.generate("Say hello")

If you want to use other models in your solvers and scorers, call get_model() with an alternate model name, along with optional generation config. For example:

model = get_model("openai/gpt-4o")

model = get_model(
    "openai/gpt-4o",
    config=GenerateConfig(temperature=0.9)
)

You can also pass provider specific parameters as additional arguments to get_model(). For example:

model = get_model("hf/openai-community/gpt2", device="cuda:0")

Model Caching

By default, calls to get_model() are memoized, meaning that calls with identical parameters resolve to a cached version of the model. You can disable this by passing memoize=False:

model = get_model("openai/gpt-4o", memoize=False)

Finally, if you prefer to create and fully close model clients at their place of use, you can use the async context manager built in to the Model class. For example:

async with get_model("openai/gpt-4o") as model:
    eval(mytask(), model=model)

If you are not in an async context there is also a sync context manager available:

with get_model("hf/Qwen/Qwen2.5-72B") as model:
    eval(mytask(), model=model)

Note though that this won’t work with model providers that require an async close operation (OpenAI, Anthropic, Grok, Together, Groq, Ollama, llama-cpp-python, and CloudFlare).

Model Roles

Model roles enable you to create aliases for the various models used in your tasks, and then dynamically vary those roles when running an evaluation. For example, you might have a “critic” or “monitor” role, or perhaps “red_team” and “blue_team” roles. Roles are included in the log and displayed in model events within the transcript.

Here is a scorer that utilises a “grader” role when binding to a model:

@scorer(metrics=[accuracy(), stderr()])
def model_grader() -> Scorer:
    async def score(state: TaskState, target: Target):
        model = get_model(role="grader")
        ...

By default if there is no “grader” role specified, the default model for the evaluation will be returned. Model roles can be specified when using inspect eval or calling the eval() function:

inspect eval math.py --model-role grader=google/gemini-2.0-flash

Or with eval():

eval("math.py", model_roles = { "grader": "google/gemini-2.0-flash" })

Role Defaults

By default if there is a no role explicitly defined then get_model(role="...") will return the default model for the evaluation. You can specify an alternate default model as follows:

model = get_model(role="grader", default="openai/gpt-4o")

This means that you can use model roles as a means of external configurability even if you aren’t yet explicitly taking advantage of them.

Roles for Tasks

In some cases it may not be convenient to specify model_roles in the top level call to eval(). For example, you might be running an Eval Set to explore the behaviour of different models for a given role. In this case, do not specify model_roles at the eval level, rather, specify them at the task level.

For example, imagine we have a task named blues_clues that we want to vary the red and blue teams for in an eval set:

from inspect_ai import eval_set, task_with
from ctf_tasks import blues_clues 

tasks = [
    task_with(blues_clues(), model_roles = {
        "red_team": "openai/gpt-4o",
        "blue_team": "google/gemini-2.0-flash"
    }),()
    task_with(blues_clues, model_roles = {
        "red_team": "google/gemini-2.0-flash",
        "blue_team": "openai/gpt-4o"
    })
]

eval_set(tasks, log_dir="...")

Note that we also don’t specify a model for this eval (it doesn’t have a main model but rather just the red and blue team roles).

As illustrated above, you can define as many named roles as you need. When using eval() or Task roles are specified using a dictionary. When using inspect eval you can include multiple --model-role options on the command line:

inspect eval math.py \
   --model-role red_team=google/gemini-2.0-flash \
   --model-role blue_team=openai/gpt-4o-mini

Batch Processing

Note

Batch processing is available only in the development version of Inspect. To install the development version from GitHub:

pip install git+https://github.com/UKGovernmentBEIS/inspect_ai

Inspect supports calling the batch processing APIs for OpenAI and Anthropic models. Batch processing has lower token costs (typically 50% of normal costs) and higher rate limits, but also substantially longer processing times (batched generations typically complete within an hour but can take longer).

When batch processing is enabled, individual model requests are automatically collected and sent as batches to the provider’s batch API rather than making individual API calls.

Important

When considering whether to use batch processing for an evaluation, you should assess whether your usage pattern is a good fit for batch APIs. Generally evaluations that have a small number of sequential generations (e.g. a QA eval with a model scorer) are a good fit, as these will often complete in a small number of batches without taking many hours.

On the other hand, evaluations with a large and/or variable number of generations (e.g. agentic tasks) can often take many hours or days due to both the large number of batches that must be waited on and the path dependency created between requests in a batch.

Enabling Batch Mode

Pass the --batch CLI option or batch=True to eval() in order to enable batch processing for providers that support it. The --batch option supports several formats:

# Enable batching with default configuration
inspect eval arc.py --model openai/gpt-4o --batch

# Specify a batch size (e.g. 1000 requests per batch)
inspect eval arc.py --model openai/gpt-4o --batch 1000

# Pass a YAML or JSON config file with batch configuration
inspect eval arc.py --model openai/gpt-4o --batch batch.yml

Or from Python:

eval("arc.py", model="openai/gpt-4o", batch=True)
eval("arc.py", model="openai/gpt-4o", batch=1000)

If a provider does not support batch processing the batch option is ignored for that provider.

Batch Configuration

For more advanced batch processing configuration, you can specify a BatchConfig object in Python or pass a YAML/JSON config file via the --batch option. For example:

from inspect_ai.model import BatchConfig
eval(
    "arc.py", model="openai/gpt-4o", 
    batch=BatchConfig(size=200, send_delay=60)
)

Available BatchConfig options include:

Option Description
size Target number of requests to include in each batch. If not specified, uses provider-specific defaults (OpenAI: 100, Anthropic: 100). Batches may be smaller if the timeout is reached or if requests don’t fit within size limits.
send_delay Maximum time (in seconds) to wait before sending a partially filled batch. If not specified, uses a default of 15 seconds. This prevents indefinite waiting when request volume is low.
tick Time interval (in seconds) between checking for new batch requests and batch completion status. If not specified, uses a default of 15 seconds.
max_batches Maximum number of batches to have in flight at once for a provider (defaults to 100).

Batch Processing Flow

When batch processing is enabled, the following steps are taken when handling generation requests:

  1. Request Queuing: Individual model requests are queued rather than sent immediately

  2. Batch Formation: Requests are grouped into batches based on size limits and timeouts.

  3. Batch Submission: Complete batches are submitted to the provider’s batch API.

  4. Status Monitoring: Inspect periodically checks batch completion status.

  5. Result Distribution: When batches complete, results are distributed back to the original requests

These steps are transparent to the caller, however do have implications for total evaluation time as discussed above.

Details and Limitations

See the following documentation for additional provider-specific details on batch processing, including token costs, rate limits, and limitations:

In general, you should keep the following limitations in mind when using batch processing:

  • Batches may take up to 24 hours to complete.

  • Evaluations with many turns will wait for many batches (each potentially taking many hours), and samples will generally take longer as requests need to additionally wait on the other requests in their batch before proceeding to the next turn.

  • If you are using sandboxes then your machine’s resources may place an upper limit on the number of concurrent samples you have (correlated to the number of CPU cores, which will reduce batch sizes.

Learning More

  • Providers covers usage details and available options for the various supported providers.

  • Caching explains how to cache model output to reduce the number of API calls made.

  • Multimodal describes the APIs available for creating multimodal evaluations (including images, audio, and video).

  • Reasoning documents the additional options and data available for reasoning models.

  • Structured Output explains how to constrain model output to a particular JSON schema.