Reasoning
Overview
Reasoning models like OpenAI o-series, Claude Sonnet 3.7, Gemini 2.5 Flash, Grok 3, and DeepSeek r1 have some additional options that can be used to tailor their behaviour. They also in some cases make available full or partial reasoning traces for the chains of thought that led to their response.
In this article we’ll first cover the basics of Reasoning Content and Reasoning Options, then cover the usage and options supported by various reasoning models.
Reasoning Content
Many reasoning models allow you to see their underlying chain of thought in a special “thinking” or reasoning block. While reasoning is presented in different ways depending on the model, in the Inspect API it is normalised into ContentReasoning blocks which are parallel to ContentText, ContentImage, etc.
Reasoning blocks are presented in their own region in both Inspect View and in terminal conversation views.
While reasoning content isn’t made available in a standard fashion across models, Inspect does attempt to capture it using several heuristics, including responses that include a reasoning or reasoning_content field in the assistant message, assistant content that includes <think></think> tags, as well as using explicit APIs for models that support them (e.g. Claude 3.7).
In addition, some models make available reasoning_tokens which will be added to the standard ModelUsage object returned along with output.
Reasoning Options
The following reasoning options are available from the CLI and within GenerateConfig:
| Option | Description | Default | Models |
|---|---|---|---|
reasoning_effort |
Constrains effort on reasoning for reasoning models (minimal, low, medium, or high) |
medium |
OpenAI o-series, Grok 3 |
reasoning_tokens |
Maximum number of tokens to use for reasoning. | (none) | Claude 3.7+ and Gemini 2.5+ |
reasoning_summary |
Provide summary of reasoning steps (none, concise, detailed, auto). Use “auto” to access the most detailed summarizer available for the current model. |
(none) | OpenAI o-series |
reasoning_history |
Include reasoning in message history sent to model (none, all, last, or auto) |
auto |
All models |
As you can see from above, models have different means of specifying the tokens to allocate for reasoning (reasoning_effort and reasoning_tokens). The two options don’t map precisely into each other, so if you are doing an evaluation with multiple reasoning models you should specify both. For example:
eval(
task,
model=["openai/o3-mini","anthropic/claude-3-7-sonnet-20250219"],
reasoning_effort="medium", # openai and grok specific
reasoning_tokens=4096 # anthropic and gemini specific
reasoning_summary="auto", # openai and gemini specific
)The reasoning_history option lets you control how much of the model’s previous reasoning is presented in the message history sent to generate(). The default is auto, which uses a provider-specific recommended default (normally all). Use last to not let the reasoning overwhelm the context window.
OpenAI Models
OpenAI has several reasoning models available including the GPT-5 and o-series models. Learn more about the specific models available in the OpenAI Models documentation.
Reasoning Effort
You can condition the amount of reasoning done via the reasoning_effort option, which can be set to none, minimal, low, medium, or high. For example:
inspect eval math.py --model openai/o3 --reasoning-effort highDefaults vary by provider and model and not all models support all values (please consult provider documentation for details).
Reasoning Summary
You can see a summary of the model’s reasoning by specifying the reasoning_summary option. Availablle options are none, concise, detailed, and auto (auto is recommended to access the most detailed summarizer available for the current model). For example:
inspect eval math.py --model openai/o3 --reasoning-summary autoAnthropic Claude
Anthropic’s Claude 3.7 Sonnet and Claude 4/4.5 Sonnet/Opus models include optional support for extended thinking. These are hybrid models that supports both normal and reasoning modes. This means that you need to explicitly request reasoning by specifying the reasoning_tokens option, for example:
inspect eval math.py \
--model anthropic/claude-3-7-sonnet-latest \
--reasoning-tokens 4096Tokens
The max_tokens for any given request is determined as follows:
- If you only specify
reasoning_tokens, then themax_tokenswill be set to4096 + reasoning_tokens(as 4096 is the standard Inspect default for Anthropic max tokens). - If you explicitly specify a
max_tokens, that value will be used as the max tokens without modification (so should accommodate sufficient space for both yourreasoning_tokensand normal output).
Inspect will automatically use response streaming whenever extended thinking is enabled to mitigate against networking issue that can occur for long running requests. You can override the default behavior using the streaming model argument. For example:
inspect eval math.py \
--model anthropic/claude-3-7-sonnet-latest \
--reasoning-tokens 4096 \
-M streaming=falseHistory
Note that Anthropic requests that all reasoning blocks and played back to the model in chat conversations (although they will only use the last reasoning block and will not bill for tokens on previous ones). Consequently, the reasoning_history option has no effect for Claude models (it effectively always uses last).
Google Gemini
Google currently makes available several Gemini reasoning models, the most recent of which are:
Gemini 2.5 Flash:
google/gemini-2.5-flashGemini 2.5 Pro:
google/gemini-2.5-proGemini 3.0 Pro:
google/gemini-3-pro-preview
For Gemini 3.0, you can use the --reasoning-effort option to control the amount of reasoning used by the model. For example:
inspect eval math.py \
--model google/gemini-3-pro-preview \
--reasoning-effort lowGemini 3 supports thinking_levels “low” and “high”, so Inspect maps reasoning effort levels “minimal” or “low” to “low” and “medium” or “high” to “high” (Gemini support for “medium” is coming soon).
For Gemini 2.5, you can use the --reasoning-tokens option to control the amount of reasoning used by these models (this option is deprecated for Gemini 3 models). For example:
inspect eval math.py \
--model google/gemini-2.5-flash \
--reasoning-tokens 4096Note that for Flash models you can disable reasoning with --reasoning-tokens=0 (Gemini 2.5 Pro does not support disabling reasoning).
The most recent Gemini models also include support for including a reasoning summary in model output.
Grok
Grok currently makes available several reasoning models:
grok/grok-4grok/grok-3grok/grok-3-mini
You can condition the amount of reasoning done by Grok 3 using the [reasoning_effort]https://docs.x.ai/docs/guides/reasoning) option, which can be set to low or high.
inspect eval math.py --model grok/grok-3-mini --reasoning-effort highNote that Grok 4 does not support the --reasoning-effort parameter so it is ignored if specified.
DeepSeek-R1
DeepSeek-R1 is an open-weights reasoning model from DeepSeek. It is generally available either in its original form or as a distillation of R1 based on another open weights model (e.g. Qwen or Llama-based models).
DeepSeek models can be accessed directly using their OpenAI interface. Further, a number of model hosting providers supported by Inspect make DeepSeek available, for example:
| Provider | Model |
|---|---|
| Together AI | together/deepseek-ai/DeepSeek-R1 (docs) |
| Groq | groq/deepseek-r1-distill-llama-70b (docs) |
| Ollama | ollama/deepseek-r1:<tag> (docs) |
There isn’t currently a way to customise the reasoning_effort of DeepSeek models, although they have indicated that this will be available soon.
Reasoning content from DeepSeek models is captured using either the reasoning_content field made available by the hosted DeepSeek API or the <think> tags used by various hosting providers.
vLLM/SGLang
vLLM and SGLang both support reasoning outputs; however, the usage is often model dependant and requires additional configuration. See the vLLM and SGLang documentation for details.
If the model already outputs its reasoning between <think></think> tags such as with the R1 models or through prompt engineering, then Inspect will capture it automatically without any additional configuration of vLLM or SGLang.