Reasoning
Overview
Reasoning models like OpenAI GPT-5, Claude 4 and 5, and Gemini 3 have some additional options that can be used to tailor their behaviour. They also in some cases make available full or summarized reasoning traces for the chains of thought that led to their response.
Reasoning Effort
The reasoning_effort option controls how much reasoning is performed. Inspect supports a supserset of what the various provider APIs accept and does mapping as required (as documented below). Available options include: none, minimal, low, medium, high, xhigh, and max.
For example:
inspect eval math.py --model openai/gpt-5 --reasoning-effort highOr from Python:
eval("math.py", model="openai/gpt-5", reasoning_effort="high")Provider Mapping
OpenAI
| Inspect input | API value |
|---|---|
none |
reasoning omitted |
minimal / low / medium / high / xhigh |
identical |
max |
xhigh |
Anthropic Claude 4.6+ and Claude 5
Opus 4.6, Opus 4.7, Opus 4.8, Sonnet 4.6, and the Claude 5 models all use adaptive thinking with the effort parameter. When reasoning_effort is not set, Opus 4.6/4.7 and Sonnet 4.6 let the model auto-select effort, while Opus 4.8 and the Claude 5 models default to high server-side.
For the Claude 5 models thinking is always on and cannot be disabled: passing none does not turn reasoning off — Inspect omits the effort and the model continues to reason at its server-side default.
| Inspect input | API value |
|---|---|
none |
reasoning omitted (Claude 5: not disabled — see above) |
minimal / low |
low |
medium |
medium |
high |
high |
xhigh |
xhigh on Claude 4.7+ and Claude 5; otherwise high |
max |
max |
Anthropic Claude 3.7 / 4.0 / 4.1 / 4.5
These models do not accept effort natively, so Inspect automatically bridges reasoning_effort to an extended thinking token budget as follows:
| Effort | Token budget |
|---|---|
minimal |
2,048 |
low |
4,096 |
medium |
10,000 |
high |
16,000 |
xhigh / max |
32,000 |
Note that you can also pass reasoning_tokens explicitly for these models.
Google Gemini 3
Gemini 3 Flash exposes four thinking levels (MINIMAL, LOW, MEDIUM, HIGH); Gemini 3 Pro / Pro 3.1 omit MINIMAL and otherwise share the same scale.
| Inspect input | API value (Flash) | API value (Pro) |
|---|---|---|
none |
thinking disabled | thinking disabled |
minimal |
MINIMAL |
LOW |
low |
LOW |
LOW |
medium |
MEDIUM |
MEDIUM |
high / xhigh / max |
HIGH |
HIGH |
Google Gemini 2.5
Does not accept effort levels, rather they support a thinking_budget. Inspect bridges reasoning_effort to the following budgets:
| Effort | Token budget |
|---|---|
minimal |
2,048 |
low |
4,096 |
medium |
10,000 |
high |
16,000 |
xhigh / max |
32,000 |
Note that you can also pass reasoning_tokens explicitly for these models.
Grok
Grok 3 Mini and Grok 4.X variants (grok-4-fast-reasoning, grok-4.1-fast-reasoning, grok-4.20, grok-4.3) accept reasoning_effort. The original grok-4 reasons but does not accept the parameter — Inspect omits effort for that model. Inspect maps reasoning_effort as follows:
| Inspect input | API value |
|---|---|
none |
reasoning omitted |
minimal / low |
low |
medium |
medium |
high / xhigh / max |
high |
OpenRouter
Passes through to the underlying model; OpenRouter itself maps effort to budget_tokens for models that need it, using the formula budget = clamp(max_tokens × ratio, 1024, 128000).
| Input | API value | Ratio |
|---|---|---|
none |
reasoning omitted | — |
minimal |
minimal |
0.1 |
low |
low |
0.2 |
medium |
medium |
0.5 |
high |
high |
0.8 |
max / xhigh |
xhigh |
0.95 |
Groq / Ollama / SageMaker
Upstream APIs accept only low / medium / high. Inspect clamps the extended values:
| Inspect input | API value |
|---|---|
none |
reasoning omitted |
minimal / low |
low |
medium |
medium |
high / xhigh / max |
high |
Bedrock
Varies by hosted model family. Claude on Bedrock accepts only reasoning_tokens (no effort); Nova uses its own reasoningConfig.maxReasoningEffort scale; GPT-OSS passes effort through.
Model Defaults
When Inspect does not pass reasoning_effort, each provider applies its own default. The table below records the documented provider default per model. Models with no entry have either no documented default or no effort scale at all.
| Model | Default effort |
|---|---|
| anthropic/claude-fable-5 | high |
| anthropic/claude-mythos-5 | high |
| anthropic/claude-opus-4-6 | adaptive |
| anthropic/claude-opus-4-7 | adaptive |
| anthropic/claude-opus-4-8 | high |
| anthropic/claude-sonnet-4-6 | adaptive |
| deepseek/deepseek-reasoner | no effort scale |
| google/gemini-3-flash-preview | medium |
| google/gemini-3-pro | high |
| google/gemini-3.1-flash-lite-preview | medium |
| google/gemini-3.1-pro | high |
| google/gemini-3.5-flash | medium |
| grok/grok-3-mini | low |
| grok/grok-4 | no effort scale |
| grok/grok-4.3 | low |
| mistral/magistral-medium-2506 | no effort scale |
| mistral/magistral-small-2506 | no effort scale |
| openai/gpt-5 | medium |
| openai/gpt-5-mini | medium |
| openai/gpt-5-nano | medium |
| openai/gpt-5.1 | medium |
| openai/gpt-5.1-codex | medium |
| openai/gpt-5.2 | medium |
| openai/gpt-5.2-codex | medium |
| openai/gpt-5.2-pro | high |
| openai/gpt-5.3-codex | medium |
| openai/gpt-5.4 | medium |
| openai/gpt-5.4-mini | medium |
| openai/gpt-5.4-nano | medium |
| openai/gpt-5.4-pro | high |
| openai/gpt-5.5 | medium |
| openai/gpt-5.5-pro | high |
Reasoning Content
Many reasoning models surface their underlying chain of thought in a special “thinking” or reasoning block. Inspect normalises these into ContentReasoning blocks alongside ContentText, ContentImage, etc., and displays them in their own region in Inspect View and the terminal conversation view.
Reasoning content is captured using several heuristics: a reasoning or reasoning_content field on the assistant message, content wrapped in <think></think> tags, or explicit APIs for models that support them (e.g. Anthropic extended thinking blocks).
Some models also return reasoning_tokens usage, which is included in the standard ModelUsage object.
Reasoning Options
The following reasoning options are available from the CLI and within GenerateConfig:
| Option | Description |
|---|---|
reasoning_effort |
Constrains effort on reasoning. Accepts none, minimal, low, medium, high, xhigh, max. See Reasoning Effort for per-provider mapping. Supported by all reasoning models — Inspect automatically bridges effort to a token budget for legacy Claude (3.7–4.5) and Gemini 2.5. Default is provider-defined. |
reasoning_tokens |
Deprecated. Prefer reasoning_effort. Explicit token budget for reasoning. Both Anthropic (budget_tokens) and Google (thinking_budget) have deprecated this control in favour of effort-based reasoning. On Anthropic Claude 4.7+ and Claude 5 it is unsupported (those models removed the token-budget control) and raises an error — use reasoning_effort instead, which works across all Claude versions. |
reasoning_summary |
OpenAI only. Provide a summary of reasoning steps. Accepts none, concise, detailed, auto. Use auto to access the most detailed summarizer available. Some OpenAI accounts require organization verification. |
reasoning_history |
How much prior reasoning to replay in conversation history. Accepts none, all, last, auto. Use last to keep reasoning from dominating the context window. Defaults to auto. |
vLLM / SGLang
vLLM and SGLang both support reasoning outputs, but the configuration is model-specific. See the vLLM and SGLang docs for details.
For vLLM, configure the model’s reasoning parser using -M model arguments. For example, Qwen3:
inspect eval math.py --model vllm/Qwen/Qwen3-8B -M reasoning_parser=qwen3Thinking mode is model-specific and controlled separately from --reasoning-effort. For models where vLLM exposes template switches such as enable_thinking or thinking, pass them as chat-template kwargs:
inspect eval math.py --model vllm/Qwen/Qwen3-8B \
-M reasoning_parser=qwen3 \
-M default_chat_template_kwargs='{"enable_thinking": true}'To override per-request:
inspect eval math.py --model vllm/Qwen/Qwen3-8B \
-M reasoning_parser=qwen3 \
-M extra_body='{"chat_template_kwargs": {"enable_thinking": true}}'Open-weights reasoning models do not all support adjustable effort levels — in those cases --reasoning-effort is a no-op even though a reasoning parser is required for vLLM to separate reasoning from the final answer.
If the model already emits reasoning between <think></think> tags (as with R1 or via prompt engineering), Inspect captures it automatically without any vLLM or SGLang configuration.