SageMaker: Add inference_component_name model argument for routing requests to specific inference components on multi-model endpoints.
Computer Use: Map PRINTSCREEN (OpenAI vocab) to xdotool Print keysym so key combos like ALT+PRINTSCREEN work correctly.
Inspect View: Metadata with more than 5 children will be collapsed by default.
Bugfix: Fix race condition in eval_set with retry_immediate=True that could cause ClosedResourceError when a task entered the retry path while other workers were completing concurrently.
Bugfix: Fix regression in realtime event stream introduced by message condensing.
0.3.209 (20 April 2026)
Capture compaction strategy params in eval log.
Inspect View: Display results of Scout scanners used as scorers next to transcripts.
Inspect View: New columns in task and log view: tags, % completed, sample errors, and error.
Inspect View: Fix regression in messages view which causes excessive whitespace between messages.
Inspect View: Fix error when attempting to collapse all or expand all events in transcripts.
Inspect View: Improvements to expand / collapse behavior in transcripts.
Inspect View: Don’t show empty entries in messages view when a message is retried.
0.3.208 (19 April 2026)
Google: Correct counting for cached input tokens.
Model API: Log model retries at WARNING when backoff >= 60s.
Model API: Enrich retry log messages with task/sample/model context and error summary.
Text Editor: Return OSError from path validation (e.g. ENAMETOOLONG) to the model as a tool error instead of crashing the eval.
Task Display: Add cancel button to cancel individual tasks during parallel execution.
0.3.207 (16 April 2026)
Anthropic: Auto-detect correct context window and max tokens for Opus 4.7.
Anthropic: Support for new xhigh value for effort.
Anthropic: Support for max value for reasoning_effort.
0.3.206 (15 April 2026)
Eval Set: Display explicit --id in task panel headers when provided.
Eval Set: Add --retry-immediate option to retry failed tasks immediately without waiting for all tasks to complete, reusing completed samples from the failed run.
Eval Logs: Add header_only parameter to write_eval_log() for writing only the header to .eval files without rewriting samples.
Eval Logs: Condense sample events when writing logs.
Eval Logs: Enable zstd compression by default for writing logs.
Eval Logs: New inspect log recover command for recovering crashed eval logs from the sample buffer database. Recovers both completed (unflushed) and in-progress samples. Automatic recovery is integrated into eval_set() and eval_retry().
Eval Logs: Save recent events (up to last ModelEvent) when retrying samples.
Bash tool: Change name of argument from cmd to command.
Sandboxes: Pass sample_id to sandbox providers via metadata.
Sandboxes: INSPECT_SANDBOX_MAX_READ_FILE_SIZE and INSPECT_SANDBOX_MAX_EXEC_OUTPUT_SIZE environment variables for overriding limits.
Docker Sandbox: Implement in-sandbox timeout enforcement using timeout command.
Hooks: Add on_before_model_generate() hook.
Model API: Support extended json schema fields (validation and examples).
Model API: Handle special token strings in tiktoken encoding.
Task Display: Truncate all content to a maximum of 50 lines.
Scoring: Convert score value of None to NaN during deserialization.
Computer Use: Map comma character to xdotool comma keysym so key combos like CTRL+, work correctly.
Computer Use: Restore sudo package to computer tool Docker image.
OpenAI Compatible: Pad response with content block when only content is reasoning.
OpenAI Compatible: Return server_error when server returns non-ChatCompletion (which can occur in some cases for OpenRouter).
Anthropic: Pass display="summarized" in thinking configuration.
Anthropic: Use request level “auto” caching mode for improved prompt caching.
vLLM: Allow vLLM provider to restart after close().
Schemas: Remove old json-schema-to-typescript codegen in favor of new pipeline.
Schemas: Fix OpenAPI schema genreation for samples/reductions (give them independent field serializers to preserve types).
Schemas: Fix OpenAPI schema generation for samples/reductions (give them independent field serializers to preserve types).
Inspect View: Use FastAPI server when fastapi and uvicorn packages are available.
Inspect View: Transcript viewing improvements for complex transcripts (timeline + other fixes)
Inspect View: Introduce new ‘Tasks’ view of log directory which shows tasks recursively as a flat list.
Inspect View: Fix error when viewing the API information for a running Model Event.
Bugfix: Fix eval_results() producing identical aggregate scores
for multiple instances of the same scorer due to incorrect name resolution using dimension names instead of scorer names.
Bugfix: Fix eval_results() mutating the reducers parameter inside a loop, causing inconsistent reducer assignment across scorer
instances.
Timelines: Consolidate TimelineBranch into TimelineSpan via forked_at property.
0.3.202 (31 March 2026)
Google: Update to google-genai v1.69.0 to address type changes (async_http_client can now be None for Vertex with Google Auth).
Approval: New read_approval_policies() function for reading approval policies from a config file.
Approval: Add metadata field to Approval which is in turn forwarded to ApprovalEvent.
Cache results of parse_tool_info() to improve performance when there are many tools defined.
Cache Pydantic TypeAdapters in condense_events for performance.
Model API: Add required field to get_model() for ensuring that model roles are specified.
Model API: Export model_roles() function to get model roles for the active task.
Timelines: Improved forked_at detection for forking on non-assistant messages.
Installation: Ensure that all required static assets are included in bundle.
Inspect View: Fix printing for samples with large transcripts or many messages.
Inspect View: Fix issues that would cause a running sample display to wait for the task to complete before showing final score.
Inspect View: Fix regression that could hide assistant messages with only tool calls.
Inspect View: Move log viewer frontend from src/inspect_ai/_view/www/ into ts-mono/apps/inspect/ monorepo (pnpm + Vite + Jest). Built assets are copied to src/inspect_ai/_view/dist/ via a Vite plugin. No user-facing changes.
Inspect View: Built TypeScript code is now minified and committed via git lfs.
Bugfix: Handle recursive references when resolving $ref targets in JSON schema.
Bugfix: Accept numeric cpus in compose deploy resources.
0.3.201 (25 March 2026)
Google: Remove deprecated gemini-3-pro-preview from computer use model check and replace with gemini-3.1-pro-preview in tests and docs.
SageMaker: Add completion_mode for CPT/base models, sending completions-style request payloads with logprobs and prompt_logprobs support.
SageMaker: Fix streaming metadata tracking to accumulate across chunks instead of relying on the last chunk.
Bedrock: Add read_timeout and connect_timeout model args.
HuggingFace: Add do_sample model arg for overriding default sampling behavior.
VLLM: Add client_timeout to OpenAICompatibleAPI and VLLMAPI.
Computer Use: Fix argparse error when typing non-numeric text starting with - (e.g. -0.07") by using the = form for the --text argument.
Eval Set: Embed viewer before evals run when using embed_viewer=True, and keep listing.json updated as logs are created.
Eval Set: Fix run_multiple silently swallowing task finalisation errors and returning success=True with no results.
Model API: Handle tool_calls and source when combining assistant messages.
Hooks: Increase event buffer to math.inf so it never blocks.
Inspect View: Copy button for log files now copies the absolute path (or S3 URI) rather than the relative serving path.
0.3.200 (20 March 2026)
Google: Fix intermittent FAILED_PRECONDITION error when using native code execution by omitting function calling system instruction hint.
AzureAI: Add explicit org prefix support for AzureAI third-party models.
Agent Bridge: Ensure that sandbox model proxy errors end the sample with a clear runtime error.
Agent Bridge: Provide streaming ‘ping’ responses to keep Anthropic clients alive.
Agent Bridge: Increase timeouts associated with model proxy server.
Bash Session: Further increase bash session transport timeout to 180s.
Sandboxes: Fail the sample when sandbox timeout errors occur outside of the context of tool calls.
Sandboxes: Bounded output buffering in exec_remote to prevent OOM from unbounded subprocess output.
Sandboxes: Enable specification of timeout and timeout_retry for exec_remote() requests.
Sandboxes: Increase timeouts associated with Docker execution.
Sandboxes: Fix intermittent server startup failure caused by socket deletion racing with bind()/listen() in sandbox tools server.
Logging: INSPECT_SUBPROCESS_REDIRECT_TO_LOGGER env var to pipe subprocess output to logging.
Eval Logs: Add condense_events/expand_events API pair and EventsData TypedDict for deduplicating and restoring repeated model event inputs and call messages.
Task Execution: Defer loading full task states until samples actually execute, reduding sample memory usage from O(total_samples × epochs) to O(concurrent_samples).
Task Execution: Add --max-dataset-memory option to limit the size of datasets held in memory during execution. When exceeded, samples are paged to disk.
Inspect Score: Add support for an optional list of metrics when rescoring a log.
Inspect View: Embed viewer directly in the log directory instead of a viewer/ subdirectory, fixing permission issues when serving logs.
Inspect View: Fix incorrectly themed sample column header text in VS Code (especially dark themes).
Inspect View: Fix issue where expanding one message in a sample chat would expand all messages.
Bugfix: Handle dicts with numeric keys in json_changes.
Bugfix: Raise error when computer use is requested with an incompatible model/bridge combination.
Bugfix: Catch NotADirectoryError when locating sandbox tools binary so S3 download/build fallbacks run on Python < 3.13.
Bugfix: Fix mutation of reused GenerateConfig values during request assembly.
Bugfix: Fix sandbox tools Docker build failure caused by staticx incompatibility with setuptools 82+ (removed pkg_resources).
0.3.193 (13 March 2026)
OpenAI: Don’t serialize unspecified fields in ResponseCustomToolCallParam.
Anthropic: Update input tokens for Sonnet/Opus 4.6 to 1MM.
Bugfix: Handle missing ‘content’ content key in parse_reasoning_content for OpenAI.
0.3.192 (13 March 2026)
Anthropic: Fallback to summary compaction when native compaction fails to compact.
Compaction: Improve logging message for native compaction failures.
Model Args: Support for specifying model_args using the --model-role CLI flag.
0.3.191 (12 March 2026)
Mistral: Update to v2.0 of mistralai package.
Inspect View: Fix regression that prevented proper display of running samples.
0.3.190 (11 March 2026)
Anthropic: Add insert_text parameter to text_editor() tool (matches Claude 4.6 schema).
0.3.189 (07 March 2026)
Anthropic: Preserve OAuth beta header when per-request betas are set.
Timelines: Detect agent_result for timeline spans.
Sandboxes: exec_remote() now auto-injects sandbox tools CLI if needed.
Web Search: Add query parameter encoding for Google provider.
Hooks: Add on_sample_event() hook.
Hooks: Add ML Flow tracking example hook.
AsyncFilesystem: Support writing files larger than 5GB to S3 using upload_fileobj for automatic multipart uploads.
File Utils: Strip trailing separators from S3 paths in absolute_file_path (matches local fs behavior).
Inspect View: Add inspect view embed command and --embed-viewer eval-set option to embed a log viewer into a log directory.
Inspect View: Display proper message when transcript events are removed to reduce eval log size.
Inspect View: Properly compute nested span parents when rendering the event tree.
Bugfix: Eval set task_identifier is now computed using redacted model_args
0.3.188 (05 March 2026)
OpenAI: Fix mypy issues with OpenAI SDK 2.26 (now required).
Improve error serialization for model call objects.
0.3.187 (05 March 2026)
OpenAI: Detect some additional “content_filter” stop reason conditions.
OpenAI: Handle web_search_call response witgh no action field.
OpenAI: Support for OpenAI SDK 2.25 (GPT 5.4, image detail “original”).
Anthropic: Handle continuations that split server tool use and its result across messages.
Grok: Support for batch inference.
Refusals: Added refusal counter to task display and add option to log warnings when refusals occur (--log-refusals).
Eval: Add --generate-config CLI option for specifying config via YAML or JSON file.
Sandboxes: Longer default timeout (120) for sandbox RPC polling.
Timelines: Detect k/v warmup calls as utility agents.
Inspect View: Fix truncation of the bottom of events and messages panels.
Inspect View: Improve appearance of model events in transcript.
Testing: Fix trio skip/select logic for parameterized tests whose node IDs contain [trio-...] instead of [trio].
0.3.186 (03 March 2026)
Anthropic: Handle updated Anthropic compaction not supported error message.
OpenAI: Use fallback for token counting and compaction endpoints when running in environments (e.g. AzureAI) where they are not supported.
Google: Use httpx instead of aiohttp when running under trio async backend for compatibility.
Grok: Raise clear error when using the grok provider under the trio async backend (gRPC is asyncio-only).
Serialization: Remove dependency on frozendict as fallback; update jsonpath-ng dependency.
Task view: Extract and print <summary> from <details> tags in tool views.
Timelines: Don’t attempt to automaticlaly detect branches (require explicit creation by user in custom timelines).
Timelines: Improved automatic detection of utility agents and automatically unwrap solver/agent pairs.
AsyncFilesystem: Add anonymous and region_name parameters to support credential-free access to public S3 buckets.
Inspect View: Add support for find in log list.
Inspect View: Fix regression displaying running samples when switching samples.
Testing: Fix “Event loop is closed” error in bridge compaction tests by properly closing AsyncOpenAI client.
Eval logs: Deduplicate repeated model event inputs and call messages into shared pools, reducing .eval file sizes.
Eval logs: Stream deduplicated message pools to the viewer during in-progress evaluations.
0.3.185 (01 March 2026)
Anthropic: Use text_editor_20250728 for all Claude 4.x models per Anthropic docs.
Events: Add agent_span_id property to tool events for associating them with their associated agent.
0.3.184 (28 February 2026)
Model API: By default, only log raw model api request/response when an error occurs. Override to log all model api calls with --log-model-api.
Model API: Truncate the model request to a maximum of 200 lines when printing to the console after an error.
Model API: Add SageMaker provider for invoking models hosted on AWS SageMaker endpoints.
Model API: Normalize handling of cached tokens in ModelUsage (input tokens now excludes cached tokens whereas previously it included them for some providers).
Model API: Track model usage by model role in addition to globally.
OpenAI: Capture system and user messages in compaction responses.
OpenAI: Warn user when reasoning options are passed to non-reasoning model.
OpenAI: Pass through phase for gpt-5.3-codex models.
OpenAI Compatible: Re-create closed httpx client after disconnect.
Anthropic: Support ANTHROPIC_AUTH_TOKEN for OAuth Bearer authentication.
Anthropic: Enable computer_20251124 tool version (with zoom) for Claude Sonnet 4.6.
vLLM: Support for LoRA (Low-Rank Adaptation) via --enable-lora server option and LoRA-tuned server startup logic.
Tools: Enable parameter placeholders in tool views.
OpenRouter: Improved capture of reasoning summaries for Gemini models.
Scoring: Add math() scorer which handles comparing mathematical expressions.
ReAct Agent: Break out of react() agent loop if the model refuses three times without choosing to call the submit() tool.
Agent Bridge: Only require openai package when bridging the openai completions or reaponses API.
Sandbox Tools: Increase server startup timeout from 20 seconds to 120 seconds.
Performance: Share a single AsyncFilesystem via ContextVar within each async context, eliminating redundant S3 client creation and connection pool fragmentation.
Inspect View: Improve virtualized find in transcript by matching event titles as well as contents.
Testing: Migrate async tests from pytest-asyncio to anyio, enabling dual-backend (asyncio/trio) test execution via --runtrio flag.
Testing: Run --runtrio as trio-only in a separate process to prevent cross-backend global state contamination; convert batch tests from asyncio to anyio.
Bugfix: Strip surrounding quotes from S3 ETag in .eval header-only reads so it is consistent with full reads.
Inspect View: Presigned URL support for S3 log files, enabling direct browser-to-S3 byte-range fetches with parallel chunk downloads and a determinate progress bar for large samples.
Eval Logs: Batch log-headers validation and file mapping concurrently.
0.3.183 (24 February 2026)
Improved naming and typeinfo for exec_remote() output stream events.
Scoring: Don’t use task level metric overrides when appending scores to an existing log.
Inspect View: Add support for downloading sample JSON.
Inspect View: Server returns correct content length for responses.
0.3.182 (24 February 2026)
AzureAI: Pass max_completion_tokens to gpt-5 and o-series models.
Events: Add timeline functions for providing additional structure for event viewing and traversal.
Bugfix: Fix Inspect View showing stale sample data when rapidly switching between samples.
0.3.181 (23 February 2026)
Hooks: New on_sample_init() hook that fires before sandbox environments are created, enabling hooks to gate sandbox resource provisioning.
Model API: Add content_list property to ChatMessage for consistent access to content as a list.
OpenAI Compatible: Send max_completion_tokens when interacting with gpt-5 or o-series models.
Anthropic: Use output_config directly (rather than via extra_body) which is compatible with batch mode.
Google: Add latest Gemini models to model info database.
Sandboxes: Verify execute result size automatically for all sandbox exec calls.
Sandboxes: Export exec_remote() types from root namespace and add docs.
Eval Set: Add TASK_IDENTIFIER_VERSION to support persistence of task identifiers in inspect_flow.
Eval Retry: Don’t retry with model_base_url unless it was explicitly specified by the user.
Agent Bridge: Add model_aliases to agent bridge and pass Model to GenerateFilter.
Dependencies: Update to nest-asyncio2 v1.7.2 to address anyio threading issue.
Inspect View: Display all non-undefined edited score values.
Bugfix: Don’t reuse eval_set logs when sample_shuffle changes and limit constrains sample selection.
Bugfix: eval_set now correctly handles pending tasks and incomplete tasks (e.g. limit/epoch changes) in a single pass, instead of skipping incomplete tasks when new tasks were present.
Bugfix: Reuse S3 clients in log recorders to fix session leak.
Bugfix: Create eval set bundle even when all logs are already complete.
Bugfix: Fix epochs_changed false positives in eval_set caused by comparing reducer closure __name__ instead of registry log name.
Bugfix: Fix async ZIP parser crash on valid .eval files whose compressed data contained a false ZIP64 EOCD Locator signature.
Bugfix: Skip non-JSON lines in MCP server stdout parsing,
Bugfix: Remove doubled MIME prefix in MCP content conversion.
Bugfix: Ensure that eval() specified model_roles override task-level roles.
Bugfix: Improve max sample size error.
0.3.180 (20 February 2026)
Agent Bridge: Google Gemini API is now supported for in-process and sandbox bridges.
Task Execution: Cancelled samples are now logged in the same fashion as samples with errors.
Anthropic: Increase max_tokens caps for Claude 4.5 and 4.6 models.
Anthropic: Update to new sdk types released with Sonnet 4.6 (v0.80.0 of anthropic package is now required).
Anthropic: Remove uses of Sonnet 3.7 from tests (no longer available).
Hugging Face: More flexible control over application of chat templates (enables support for generation from base models).
VLLM: Don’t retry when the error indicates that the VLLM server has crashed.
Analysis: Async reading of logs/samples in samples_df() (now 50x faster).
Sandboxes: Don’t require Docker compatible sandboxes to implement config_deserialize().
Sandboxes: New exec_remote() method for async execution of long-running commands.
Compaction: Add type field to CompactionEvent to record compaction type.
Web Search: Treat Tavily query character limits as ToolErrors.
Limits: New cost_limit() context manager for scoped application of cost limits.
Performance: Disable expensive per-sample options when running high-throughput workloads.
Events: Rename EventNode to EventTreeNode and SpanNode to EventTreeSpan (old type names will still work at runtime with a deprecation warning).
Inspect View: Make samples in task detail sortable, inline epoch filter, show sample status.
Bugfix: Shield sandbox cleanup after cancelled exception.
Bugfix: Protect against leading zero-width characters when printing tool output to the terminal.
Bugfix: Google batch JSONL serialization now correctly nests generation config fields (e.g. thinking_config) under generation_config in the REST schema.
Bugfix: Google batch polling no longer hangs forever when a batch job reaches EXPIRED or PARTIALLY_SUCCEEDED state.
0.3.179 (12 February 2026)
Bugfix: Fix shutdown hang by draining nest_asyncio event loop.
Bugfix: Fix regression in live sample display in Inspect View.
0.3.178 (11 February 2026)
Google: Hard failure for quota exceeded errors with limit: 0 (indicating the model or feature is fully restricted).
Compaction: Improve token counting by using input tokens reported from call to generate().
Model API: for 400 errors, print the error after the request payload rather than before.
Eval Logs: Add progress callback interface for reading eval logs
Sandboxes: Added http_proxy example for intercepting and remapping HTTP requests from agents using mitmproxy.
Inspect View: Fix regression in log viewer navigation in VSCode.
Inspect View: Improve transcript display appearance in VSCode.
Inspect View: Improve log events display in transcripts.
Bugfix: Fix off-by-one in _read_all_summaries that skipped the last sample summary.
Inspect View: Fix stale state read in logsSlice after syncLogs.
0.3.177 (10 February 2026)
Anthropic: Do not pass through unrecognized extra_body fields.
0.3.176 (10 February 2026)
Eval Logs: Async parallel reading of eval log headers from S3, reducing time from 12+ minutes to ~12 seconds for ~600 files.
Bugfix: Correct handling of ComposeConfig for Docker sandbox provider.
0.3.175 (10 February 2026)
OpenRouter: Retry 500 and 504 errors returned in request JSON body.
Scoring: Allow customisation of grouped metric names.
Model API: Don’t strictly require OpenAI and Anthropic versions when they aren’t in use.
Inspect View: Show live requests and responses for model call events in the transcript.
Inspect View: Improve scroll performance when viewing sample transcripts and messages.
0.3.174 (09 February 2026)
Compaction: Remove reasoning blocks from compact() result for Anthropic provider.
0.3.173 (08 February 2026)
Compaction: Correct capture of compaction results for Anthropic streaming mode.
Compaction: Improved prefix handling (drop all by system messages) for native compaction.
Compaction: Improved display of OpenAI and Anthropic compaction data in the viewer.
0.3.172 (06 February 2026)
Inspect View: Fix a regression which affected the display of samples within VSCode.
0.3.171 (06 February 2026)
Compaction: New CompactionNative strategy which uses provider-native compaction (currently only available for OpenAI and Anthropic Claude 4.6).
Sandboxes: Enable sandbox providers to declare Docker compatibility, which will result in Docker config files being passed to them.
Docker Sandbox: Store auto-compose files in centralized project-keyed location (rather than alongside tasks).
Inspect View: Improve reliability of code syntax highlighting in messages and events.
Inspect View: Support zstd compression of eval log file contents.
Inspect View: Fix issue where viewing sample events could result in flashing and scroll oscillation.
Inspect View: Render <think> tags when included in user messages.
Bugfix: Correct handling for --reasoning-history CLI argument (don’t parse as boolean).
Bugfix: Submit to human_cli() with no answer now correctly completes task.
0.3.169 (01 February 2026)
Anthropic: Correct handling of beta server tool use blocks for bridge clients that use the beta API (e.g. PydanticAI).
OpenAI: Workaround for openai Python SDK inability to round trip ‘find_in_page’ web search actions.
Reasoning: Don’t process <think> tags in assistant message loading (now all done directly by model providers).
Web Search: Use internal search providers by default when no external provider is defined (previously they required explicit enabling).
Web Search: Fallback to Google CSE provider only when Google CSE environment variables are defined (the CSE service has been deprecated by Google).
Eval Logs: Improve eval log loading performance with JSON cache key for messages.
Eval Logs: Support Zstd compression of eval logs for improved performance via INSPECT_USE_ZSTD environment variable.
Agent Bridge: Make sandbox_agent_bridge cleanup errors non-fatal when agent completes
Compaction: Add source=“compaction” to InfoEvent created by compaction.
0.3.168 (31 January 2026)
nnterp model provider enabling use of StandardizedTransformer models with Inspect.
OpenAI Compatible: More generic handling for reasoning payloads (playback reasoning in exactly the same body field it was captured from).
Eval Logs: Add EvalStatus type alias for evaluation status literals ("started", "success", "cancelled", "error").
Bugfix: raise PrerequisiteError when bundling to a subdirectory of the log dir (instead of deleting the logs from the log dir).
0.3.167 (29 January 2026)
Early Stopping: Check for early stopping after sample semaphore is acquired rather than before.
Revert use of json.dumps for message cache keys (incompatible with BaseModel types).
0.3.166 (29 January 2026)
Scoring: Add model_usage field to ScoreEvent for tracking token usage vs score.
Compaction: Compact server tool uses in CompactionEdit strategy (previously only client tool uses were compacted).
Docker: Avoid mutable default env arguments in execution helpers.
Eval Logs: Add exclude_fields parameter to read_eval_log_sample() for memory-efficient loading of large samples.
Inspect View: Fix issue where switching from a running to a non-running evaluation could display incorrect metrics in the title region.
Inspect View: Fix sample switching when viewing live transcripts.
0.3.165 (26 January 2026)
Eval Logs: Improve load time by using JSON in duplicate message cache rather than frozendict.
Compaction: Remove citations after compaction to avoid dangling citation references (updated trim_message() to use the same behavior).
Inspect View: Fix “Cannot add property timestamp, object is not extensible” error when viewing live transcripts.
0.3.164 (24 January 2026)
Google: Provide JSON schema directly rather than converting it to Google Schema type.
Agent Bridge: Support bridge clients that use the Anthropic Beta API.
Agent Bridge: Serialize ContentReasoning as <think> with attributes to prevent bridge clients from doing a more lossy <think> tag conversion.
Compaction: Correct handling of thinking mode in Anthropoic count_tokens() method.
Compaction: Correct handling of consecutive tool messages in Anthropic count_tokens() method.
Bash Session: Increase bash session transport timeout and make new session timeouts fatal.
Inspect View: Timestamps for USER and ASSISTANT transcript of model events, yyyy-mm-dd hh:mm:ss format (keep local time zone).
Inspect View: Remove events from JSON before parsing if Sample JSON is too large.
Bugfix: Include type field in JSON Schema for Literal and Enum types.
Bugfix: Handle maps and lists in registry_kwargs().
0.3.163 (21 January 2026)
Anthropic: Only re-order reasoning blocks for Claude 3 (as we use interleaved thinking for Claude 4).
Analysis: Read all samples at once in implementation of samples_df().
Agent Bridge: Handle OpenAI assistant message params with no ‘type’ field (Pydantic AI compatibility).
Inspect View: Improve sample summary truncation (use markdown truncation instead of line clamping).
Inspect View: Fix issue with typing over selection in ‘Find’
Inspect View: Fix issues with ‘Find’ scrolling, keyboard behavior, and restoration of scroll position / panel expansion.
Inspect View: Support find using JSON-like syntax.
0.3.162 (18 January 2026)
Google: Add streaming model arg to opt-in to streaming generation.
TogetherAI: Support for parsing logprobs returned in OpenAI format (e.g. for gpt-oss-20b).
HF tasks: Support for image_input (data URI) in field spec for multimodal tasks
Scoring: Enable editing scores for samples that do not yet have a score.
Task Display: Throttle updates to running samples according to total samples.
Sandbox: Support passing a ComposeConfig directly to Docker sandbox provider.
Sandbox: Remove supported_fields parameter from parse_compose_yaml() (packages handle their own validation).
Sandbox Service: Provide option to trigger request processing manually.
Inspect View: Fix regression where viewing samples with identical id/epoch would re-use the previous sample details.
Inspect View: Show event timestamp in tooltips in all types of events in transcripts.
Inspect View: Show sample invalidation status in sample header.
Bugfix: Compose models now correctly handle x- extensions at all levels (inner models discarded them, outer models accepted non-extensions).
0.3.161 (10 January 2026)
Sandbox: parse_compose_yaml() for parsing Docker Compose files into typed configuration for sandbox providers.
Google: Yield system_instructions as list of str (improved compatibility with opentelemetry capture).
Google: Raise error if batch processing is used with Vertex hosted models.
OpenAI Compatible: Always pass function definitions with strict=True. This is required by HF Inference Providers and Fireworks (and possibly others).
OpenAI Compatible: Convert function arguments to JSON if they are provided as a string (as is done by xAI and perhaps other providers).
Model API: Improvements in model detection for hosting providers (e.g. Azure, Bedrock, etc.).
Eval Log: Add version of the package exporting the task (if any) to the eval log.
Analysis: Convert mixed-type object columns to string for PyArrow conversion.
Sandboxing: Add INSPECT_SANDBOX_SETUP_TIMEOUT env var to override default 300s setup timeout.
Human Agent: Fixed non-scalar intermediate score values breaking task commands like task status and task stop.
Bugfix: Print only enabled hooks at CLI startup.
Bugfix: Fix eval_set log reuse when setting limits as eval set args.
0.3.160 (09 January 2026)
Agent Bridge: Consolidate bridged tools implementation into the existing sandbox model proxy service (eliminate Python requirement for using bridged tools).
Anthropic: Correctly replay reasoning when sourced from Inspect cache.
Anthropic: Tolerate {} as value for additionalProperties in tool schema.
OpenAI Compatible: Don’t ever send background parameter as this is OpenAI service-specific.
OpenAI Compatible: Added support for disabling reasoning history emulation.
Grok: Correctly replay tool calling errors in message history.
VLLM and SGLang: Don’t require API key environment variable to be set when running in local mode.
Google: Support minimal and medium reasoning effort levels for Gemini 3 Flash.
Fireworks: Use streaming when max_tokens is greater than 16000.
Model API: Add combined_from metadata field when combining consecutive user or assistant messages for call to generate.
HF Tasks: Require >1.0.0 of huggingface_hub package.
Eval Set: Include task version and limits in task identifier hash to prevent incorrect log reuse.
Scoring: Match only last line of output in answer(pattern=“line”).
JSON Datasets: Support passing arbitrary kwargs to JSON readers (built-in reader and jsonlines reader).
Filesystems: Use default_fs_options() for async_connection()
Inspect View: Don’t attempt to display events when the events are too large for the browser to deserialize (e.g. 350MB+ of events).
Inspect View: Improve rendering of tool output with ANSI codes. Support viewing raw/unrendered ANSI output.
Inspect View: Scale ANSI display in messages view to preserve row/column layout without wrapping.
Inspect View: Render custom tool view when viewing messages.
Inspect View: Fix cmd+click on tasks/samples to open in new tab.
Inspect View: Only stream log bytes when requested chunks are large (>50MB)
Inspect View: Add Show Retried Logs button when inside an eval set and some logs were retried (both Tasks and Samples are now de-duplicated by default).
Inspect View: Improved non-native find for virtualized lists (better CTRL-f)
Bugfix: Prevent component not found error during Human Agent transition.
Bugfix: Use builtins module rather than __builtins__ when parsing tool function types.
0.3.159 (03 January 2026)
Compaction: Compacting message histories for long-running agents that exceed the context window.
Model API: count_tokens() method for estimating token usage for messages.
Model API: ModelInfo for retrieving information about models (e.g. organization, context window, reasoning, release date, etc.)
Eval Retry: Initialize model usage from usage recorded in retried eval log.
Anthropic: Use service model name when detecting tool compatibility.
Google: Various mitigations for Gemini returning MALFORMED_FUNCTION_CALL.
OpenRouter: Improved integration with reasoning_details (map onto standard reasoning fields for viewer).
Human CLI Agent: Ability to add custom instructions and .bashrc commands to agent shell.
Properly handle working time reporting for overlapping coroutines waiting on semaphores.
Eval Logs: Support reading from IO[bytes] via read_eval_log().
Inspect View: Properly display dict scores in sample list.
Inspect View: Improve display of Codex shell_command tool calls.
Inspect View: Improve the display of very wide metrics results in the results dialog.
0.3.158 (24 December 2025)
skill() tool to make agent skills available to models.
Bugfix: Fix log file cache lookup using incorrect comparison key.
0.3.157 (22 December 2025)
Eval Set: Correct log reuse behavior when epochs and limit change.
Solvers: Capture all parameters (including defaults) used to create solvers and agents.
Tasks: Improved validation of Hugging Face Hub task definitions.
HF Inference Providers: Specify “strict” for function tool definitions.
Agent API: Improved capture of agent name with nested @agent decorators.
Agent Bridge: Ensure that OpenAI responses params have an “id” field before validation.
Sandbox Service: Continue with warning if request polling raises a RuntimeError.
0.3.156 (20 December 2025)
Anthropic: Treat reasoning text as a summary (true for all models after Sonnet 3.7).
Open AI: Remove custom transport to respect HTTP proxy settings.
Bugfix: Copy metadata field to new eval for eval-retry.
Bugfix: Retry when parsing an incomplete bridged tool call response.
Bugfix: Delay after launching bridged tool service to prevent asyncio race condition.
0.3.153 (05 December 2025)
Agent Bridge: Don’t print serialization warnings when going from Pydantic -> JSON (as we use beta types that can cause warnings even though serialization works as intended).
Batch Processing: Enable customizing of batch status rendering.
Inspect View: Expand dictionary scores into separate scores when viewing samples.
0.3.152 (04 December 2025)
Update Plan tool for tracking steps and progress across longer horizon tasks.
Code Execution tool for executing Python code in a stateless sandbox running on model provider servers.
Anthropic: Support for new Effort setting (--effort) for trading off between response thoroughness and token efficiency.
Anthropic: Include native web_fetch tool as part of web_search() implementation (matching capability of other providers that have native web search).
Anthropic: Use required caller field for server tool uses (required by package version 0.75, which is now the minimum version).
OpenAI: Check for mismatches between specified model and Azure deployment URL.
Mistral: Use the new Conversation API by default (disable with -M conversation_api=False).
Mistral: Added support for native web_search and code_execution tools (executed server side).
Mistral: Added support for document input.
Grok: Support for server-side MCP tool calling.
VLLM and SGLang: Default to 5 second retry policy when server rejects requests due to saturated GPU (customize with model arg retry_delay).
Model API: Assign new message ID when combining messages for replay to providers.
MCP Tools Bridge: Added BridgedToolsSpec and bridged_tools parameter to sandbox_agent_bridge() for exposing host-side Inspect tools to sandboxed agents via MCP protocol.
Dependencies: Update to mcp package version 1.23.0.
Inspect View: Fix regression where the display of samples with errors would result in unusuably wide sample list view.
Inspect View: Properly compute sample list columns for running evaluations that return dictionary scores.
Bugfix: Ensure that entry points are not scanned repeatedly when there are no targets.
0.3.151 (30 November 2025)
Memory tool: Added memory() tool and bound it to native definitions for providers that support it (currently only Anthropic).
Grok: Correctly reconstruct assistant tool calls when replaying messages to API.
Grok: Round trip encrypted reasoning (made available in v1.4.0 of xai_sdk, which is now required).
Anthropic: Protect against signature not being replayed (can occur for agent bridge) by saving a side list of signatures.
Sandboxes: For “local” and “docker” sandbox providers, treat output_limit as a cap enforced with a circular buffer (rather than a limit that results in killing the process and raising).
Sandboxes: Added evals_in_eval example for running Inspect evaluations inside other evaluations.
Model API: Enable model providers to have custom retry wait strategies (use 5 second fixed wait for vllm).
Prevent querying of local timezone and forbid naïve datetime’s via DTZ lint rule.
Dependencies: Change jsonpath-ng requirement to >=1.6.0 (formerly required >= 1.7.0).
Dependencies: Move from unmaintained nest_asyncio, which is fundamentally incompatible with Python 3.14, to nest_asyncio2, which has explicit 3.14 compatibility.
Agent bridge: Ensure that bridge filters also take advantage of retry_refusals loop.
Agent bridge: Workaround Codex CLI not passing detail along with images.
OpenAI: Automatically switch to the completions API when --num-choices is specified.
Model APIs: Improve legibility/clarify of error messages when updated versions of anthropic or openai packages are required.
Dataframes: Added SampleScores column group for extracting score answer, explanation, and metadata.
Sandbox tools: Rewrite inspect-ai package installation type detection code.
Task: Support mixed metrics (both direct metrics and dict groupings in the same list), matching the flexibility of the @scorer decorator.
Inspect View: Fix regression sorting folder and logs in list (folders should sort to the front of the list)
Inspect View: Properly reset page when navigating between folders.
Inspect View: Always show reasoning blocks (previously we hid them when there was no content, i.e. Responses API store=True).
Inspect View: Improve the display of Codex Agent update_plan and shell tool inputs.
Inspect View: Fix flash of error message when initially viewing a log file in VS Code.
Inspect View: Properly create tree for transcripts when tasks include async work generating spans and events.
Bugfix: Properly deserialize EvalSet when optional values are missing.
Bugfix: Fix “auto” message truncation in react agent.
Bugfix: Update various tests to react to Google’s deprecation of old models.
0.3.133 (22 September 2025)
Sandbox tools: bash_session, text_editor, and sandbox MCP servers no longer require a separate pipx install (they are now automatically injected into sandbox as a static binary with no Python dependencies).
Agent bridge: Python is no longer required within containers using the sandbox agent bridge.
Agent bridge: Enhance automatic state tracking by ignoring shorter sub-agent generations.
Agent bridge: Add retry_refusals option for automatically retrying refusals a set number of times.
Scoring: inspect score now supports streaming via the --stream argument.
Inspect View: Starting the view server with a path to a specific log file will automatically open that log file (if it exists) rather than showing the log list.
Inspect View: Fix error that caused ‘json too large’ message to appear incorrectly for sample JSON.
Inspect View: Improve filtering of log files in log list (improve performance and loading progress).
Inspect View: Add cmd+F shortcut for filtering log in log list.
Inspect View: Fix regression in tool input syntax highlighting.
Inspect View: Focus transcript or messages when sample dialog is loaded, allowing use of keyboard shortcuts like cmd + arrow down for scrolling.
Inspect View: Focus log list when the log list is shown, allowing use of keyboard shortcuts like cmd + F.
Bugfix: Ensure ETags always match content when reading S3 logs to prevent write conflicts.
0.3.129 (03 September 2025)
Agent Bridge: Don’t use concurrency() for agent bridge interactions (not required for long-running proxy server or cheap polling requests).
Sandboxes: Add concurrency parameter to exec() to advise whether the execution should be subject to local process concurrency limits.
0.3.128 (02 September 2025)
Agent Bridge: Correctly dispatch LimitExceededError which occurs during proxied model calls.
Agent Bridge: Respect reference vs. value semantics of agent caller (enables preservation of messages when agent is run via as_solver()).
OpenAI: Update types to match openai v1.104.1 (which is now the minimum required version).
Mistral: Support for updated use of ThinkChunk types in mistralai v1.9.10.
Groq: Support for --reasoning-effort parameter (works w/ gpt-oss models).
Scoring: Use fallback unicode numeric string parser when default str_to_float() fails.
Bugfix: Work around OpenAI breaking change that renamed “find” web search action to “find_in_page” (bump required version of openai package to v1.104.0).
0.3.127 (01 September 2025)
Bugfix: Preserve sample list state (e.g. scroll position, selection) across sample open/close.
0.3.126 (01 September 2025)
Agent Bridge: OpenAI Responses API and Anthropic API are now supported alongside the OpenAI Completions API for both in-process and sandbox-based agent bridges.
Agent Bridge: Bridge can now automatically keep track of AgentState changes via inspecting model traffic running over the bridge.
Agent Bridge: Improved id stability across generations to prevent duplicated messages in messages_df().
Agent Bridge: Ensure that explicitly specified GenerateConfig values for models override bridged agent config.
Agent handoff(): Use content_only() filter by default for handoff output and improve detection of new content from handed off to agents.
Model API: Refine available tool types for ContentToolUse (“web_search” or “mcp_call”)
Model API: Remove internal field from ChatMessageBase (no longer used).
OpenAI: Added responses_store model arg for explicitly enabling or disabling the responses API.
Google: Pass tool parameter descriptions for nullable and enum typed fields.
Google: Support thought_signature for thought parts.
Google: Use role=“user” for tool call results rather than role=“function”.
MCP: Export MCP server configuration types (MCPServerConfig and Stdio and HTTP variants).
Sandbox Service: New instance option for multiple services of the same type in a single container.
Sandbox Service: New polling_interval option for controlling polling interval from sandbox to scaffold (defaults to 2 seconds, overridden to 0.2 seconds for Docker sandbox).
ReAct Agent: Add submit tool content to assistant message (in addition to setting the completion).
Metrics: Compute metrics when an empty list of reducers is provided (do not reduce the scores before computing metrics). Add --no-epochs-reducer CLI flag for specifying no reducers.
Scoring: Make match more lenient when numeric matches container markdown formatting.
Concurrency: Add visible option for concurrency() contexts to control display in status bar.
Inspect View: Add support for filtering sample transcripts by event types. Be default, filter out sample_init, sandbox, store, and state events.
Inspect View: Add support for displaying raw markdown source when viewing sample data.
Inspect View: Remove sample list / title content when sample is displaying (prevents find from matching content behind the sample detail).
Inspect View: Custom rendering for TodoWrite tool calls.
Bugfix: Fix error in reducing scores when all scores for a sample are NaN.
Bugfix: Correctly extract authorization token from header in MCP remote server config.
0.3.125 (25 August 2025)
Scoring: Refactor inspect score to call same underlying code as score().
Bugfix: Fix regression in CLI scoring.
0.3.124 (24 August 2025)
Agent Bridge: New context-manager based agent_bridge() that replaces the deprecated bridge() function.
Agent Bridge: sandbox_agent_bridge() to integrate with CLI based agents running inside sandboxes.
Agent Bridge: Inspect model roles can now be addressed by bridged agents (e.g. “inspect/red-team”).
ReAct Agent: Allow for a ToolDef to be passed to an AgentSubmit type.
Model API: user_prompt() function for getting the last user message from a list of messages.
Scoring: Add copy option to score_async() (defaults to True) to control whether the log is deep copied before scoring.
Inspect View: Convert samples in the sample list to use simple a tags for navigation. This allows typical user gestures like cmd+click to work correctly.
Inspect View: Update document titles when viewing a sample, log, or log dir to better disambiguate tabs or windows. Use reverse pyramid to place details at the head of the title.
Inspect View: Increase sample size limit to 100MB (samples larger than that are not browsable in the viewer).
Tool Support: Converted to a new runtime reconnaissance and injection architecture for inspect_tool_support.
Bugifx: Properly handle surrogates in JSON serialization.
Bugfix: Google and Mistral providers now generate unique tool call IDs to prevent collisions when calling the same tool multiple times.
Bugfix: Enable use of custom reducers with eval-retry by delaying their creation until after task creation.
Bugfix: Fix custom json schema generation code for CitationBase so that it no longer leads to an invalid schema.
Bugfix: Only pass background to OpenAI Responses if specified.
Bugfix: Do not pass unsupported tool_choice to Anthropic thinking models.
Google: Pass timeout generation config option through to API Client.
Google: Ability to specify a custom GOOGLE_VERTEX_BASE_URL.
OpenAI: Add background, safety_identifier and prompt_cache_key custom model args (bump required version of openai package to v1.98).
OpenAI: Set client_timeout to 900s when flex processing is enabled.
Ollama: Forward reasoning_effort option to reasoning dict.
MCP: Support for mcp_server_http() (which replaces the deprecated SSE server mode).
MCP: Added authorization to provide OAuth Bearer token for HTTP based servers.
Task display: Sample cancel button now works immediately (no longer needs to wait for a cooperative check).
Limits: Sample working limit is now enforced even during long running generations and sandbox operations.
Store: Support for serializing complex nested types (e.g. to read in an offline scorer).
Tools: Code viewer now handles function calls with list[str] rather than str without crashing.
Basic Agent: Only set message_limit to 50 when both message_limit and token_limit are None.
Tests: Improve sandbox self_check to handle test failure via with pytest.raises, add test for env vars.
Tests: Improve sandbox self_check to handle test failure via with pytest.raises, add test for env vars.
Tests: Added the ability to provide a generator like callback function for MockLLM.
Scoring: Improve multiple_choice answer parsing, making it more strict in interpreting answers like ANSWER: None of the above. Allow answers to end with full stop (.).
Tool Support: Converted inspect_tool_support to use a Unix socket rather than a tcp port for intra-container RPC.
Bugfix: background() task is now scoped to the sample lifetime in the presence of retry_on_error.
Bugfix: Correct recording of waiting_time from within coroutines spawned from the main sample coroutine.
Bugfix: Update inspect-tool-support reference container to support executing tool code with non-root accounts.
Bugfix: Correct forwarding of reasoning_effort and reasoning_tokens for OpenRouter provider.
Bugfix: bridge() no longer causes a recursion error when running a large number of samples with openai models
Bugfix: Ensure that model_roles are available within task initialization code.
0.3.120 (07 August 2025)
OpenAI: Update model version checks for GPT-5.
OpenAI: Support for specifying “minimal” for reasoning_effort.
Bugfix: Conform to breaking changes in openai package (1.99.2).
Bugfix: Ensure that sample_shuffle is None (rather than 0) when not specified on the command line.
0.3.119 (04 August 2025)
Analysis functions are out of beta (inspect_ai.analysis.beta is deprecated in favor of inspect_ai.analysis).
Scoring: Provide access to sample store for scorers run on existing log files.
0.3.118 (02 August 2025)
Remove support for vertex provider as the google-cloud-aiplatform package has deprecated its support for Vertex generative models. Vertex can still be used via the native google and anthropic providers.
Tool calling: Added support for emulated tool calling (emulate_tools model arg) to OpenAI API compatible providers.
Task display: Improved display for multiple scorers/metrics in task results summary.
Scoring: Improved error message for scorers missing a return type annotation.
Datasets: Added --sample-shuffle eval option to control sample shuffling (takes an optional seed for determinism).
Batch Processing: Enable batch support when using Google model provider.
ReAct Agent: Require submit tool to have no errors before you exit the react loop.
Mistral: Type updates for ThinkChunk and AudioChunk in package v1.9.3 (which is now the minimum required version).
Inspect View: Use MathJax rather than Katex for math rendering.
Inspect View: Fix issue with scores ‘More…’ link not being displayed in some configurations.
Inspect View: Fix issue displaying tool calls in transcript in some configurations.
Bugfix: Strip smuggled <think> and <internal> tags from tool messages to prevent leakage in multi-agent scenarios where an inner assistant message can be coerced into a tool message.
Bugfix: Handle descriptions of nested BaseModel types in tool call schemas.
Bugfix: Update workaround of OpenAI reasoning issue to retain only the last (rather than the first) in a run of consecutive reasoning items.
0.3.114 (17 July 2025)
OpenAI: Move model classification functions into ModelAPI class so that subclasses can override them.
Azure: Support for authenticating with Microsoft Entra ID managed identities.
Analysis: prepare() function for doing common data preparation tasks and log_viewer() operation for adding log viewer URLs to data frames.
ReAct Agent: Require submit tool to have no errors before you exit the react loop.
Inspect View: Use MathJax rather than Katex for math rendering.
Inspect View: Supporting linking to events via uuid field (or event_id in analysis data frames).
Bugfix: Use the output filesystem when creating directories in inspect log convert
ReAct agent: Add keep_in_messages option to AgentSubmit to preserve calls to submit() in message history.
Scoring: Change Value type to use covariant types (Mapping and Sequence).
Scoring: Add display parameter to score() to control display type.
Scoring: Nan values returned from scorers will be excluded from computation of metrics. Scorers in results include scored_samples and unscored_samples fields to indicate how many samples were scored and how many were not. The viewer will display these values if there are unscored samples.
Eval Log: Protect against removing excessive numbers of samples at once from realtime database.
Eval Log: Add --resolve-attachments option to inspect log dump.
Hooks: Provide full EvalSample (rather than only the summary) to on_sample_end() hook.
Inspect View: Compatiblility for sites published to GitHub Pages for inspect view bundle.
Inspect View: The bundle produced for deployment now includes a much more compact manifest, improving support for bundling large numbers of files.
Bugfix: Fix failure to allow Anthropic native web search for some model names such as claude-3-7-sonnet-latest.
Bugfix: Fix Anthropic citation support code when it encounters citations created by external search providers such as Tavily.
Bugfix: Break after finding final assistant message when implementing fallback for AgentStateoutput field.
Bugfix: Fix run_in_background allowing it to properly function outside the context of a task.
Bugfix: None out TaskLogger’s SampleBufferDatabase after cleaning it up to avoid crashing on subsequent logging attempts.
Bugfix: Disassociate the logger used by batch processing’s background task from any particular sample.
Bugfix: Improve the compactness and efficiency of eval files with extremely large text user inputs.
Bugfix: Fixed bugs in batch process as the size of a batch approached the model provider’s maximum batch size of 256MB.
Bugfix: Fix regression that allowed computer tool screenshot truncation to occur despite not being valid for OpenAI.
Bugfix: Fix agent bridge scenarios that failed when used with reasoning models.
Bugfix: Fix cases where blocks are dropped in OpenAI choices because they are not at the front of text content.
0.3.112 (03 July 2025)
Hooks: Generic lifecycle hooks for Inspect extensions.
Datasets: Expand glob wildcards when processing --sample_id filter for datasets.
OpenAI: Enable web search for o3 and o4-mini models.
OpenAI: Enable emulated tool call image results for o-series.
Analysis: Provide score_headline_stderr field in standard evals column definitions.
Analysis: Provide task_name without package namespace by default.
Analysis: Don’t show dataframe import progress by default in notebooks (leaves empty cell output artifact).
ChatMessage: Add metadata field for arbitrary additional metadata.
Content: Added ContentData for model specific content blocks.
Citations: Added Citation suite of types and included citations in ContentText (supported for OpenAI and Anthropic models).
Eval log: task_args now includes defaulted args (formerly it only included explicitly passed args).
Eval set: retry_connections now defaults to 1.0 (resulting in no reduction in connections across passes). OpenAI: Work around OpenAI Responses API issue by filtering out leading consecutive reasoning blocks.
OpenAI compatible provider: Substitute - with _ when looking up provider environment variables.
MCP: Update to types in latest release (1.9.4, which is now required).
Added development container (.devcontainer) configuration.
trim_messages() now removes any trailing assistant message after compaction.
Task display: Ensure that full path to log file is always displayed (wrap as required).
Task display: Wrap scorers and scores in the task detail display.
Inspect View: Add support for displaying citations for web searches in the transcript.
Inspect View: Correctly update browser URL when navigation between samples.
Bugfix: Properly honor responses_api=False when pass as an OpenAI model config arg.
Bugfix: Limits passed to handoffs can be used multiple times (if agent is handed off to multiple times).
Bugfix: Replace invalid surrogate characters when serializing strings to JSON.
Bugfix: Prevent error writing Nan values to the logs.json summary file during bundling.
v0.3.103 (06 June 2025)
Eval set: Do not read full eval logs into memory at task completion.
v0.3.102 (05 June 2025)
OpenAI: Use responses API for codex models.
Bugfix: Temporarily revert change to eval set header reading to investigate regression.
v0.3.101 (05 June 2025)
Eval set: Default max_tasks to the greater of 4 and the number of models being evaluated.
Eval set: Do not read full eval logs into memory at task completion.
pass_at_k: Treat threshold as the the minimum inclusive value for passing (rather than checking equality)
Web search: Include links specified by providers in the results.
Inspect View: Display sample id & epoch in sample dialog title bar.
Inspect View: Don’t open sample dialog when simply navigating the sample list.
Inspect View: Fix error that could occur when determine transcript outline collapse state.
Inspect View: Show the correct sample when opening a sample from a sorted list.
Bugfix: Ensure that dataset shuffle_choices=True always uses a distinct random seed.
Bugfix: Don’t attempt to use OpenAI’s web search preview against models that are known to not support it.
Abiliy to query current usage for scoped limits (e.g. time or tokens).
Added native OpenAI web search to web_search() tool.
Limit docker compose concurrency to 2 * os.cpu_count() by default (override with INSPECT_DOCKER_CLI_CONCURRENCY).
ReAct agent: Only send custom on_continue message to the model if the model made no tool calls.
Tool calling: Support for Enum types in tool arguments.
AzureAI: Automatically fold user and tool messages for Mistral models.
Task display: Simplify task display for plain mode (no outline, don’t expand tables to console width).
Task display: Truncate task config to prevent overflow (collapse dicts, limit individual values to 50 chars, limit overall output to 500 chars).
Task display: Always show the sample init event in the task transcript display.
Task display: Fix mouse support on ghostty (and possibly other terminals).
Inspect View: Outline view for transcript which enables high level navigation to solvers, agents, scorers, etc.
Inspect View: Fix an issue that prevented the display of the viewer in VSCode when the viewer tab was moved to the background.
Inspect View: Don’t error when metadata contains null values.
v0.3.99 (22 May 2025)
Exported view() function for running Inspect View from Python.
Always return tasks in the same order they were passed to eval() or eval_set().
Google: Updated required version of google-genai to 1.16.1 (which includes support for reasoning summaries and is now compatible with the trio async backend).
Anthropic: More flexible detection of “overloaded_error” for retires.
Inspect View: Improve text zooming and wrapping when rendering sample errors.
Inspect View: Preserve log mtime-ordering in the bundle output directory
v0.3.98 (18 May 2025)
Google: Disable reasoning when reasoning_tokens is set to 0.
Temporarily pin to textual < 3.0.0 to work around event loop breakage.
CLI display: improve performance of sample rendering by only rendering the 10 most recent events.
Dataframes: Use native pyarrow column storage with pd.NA for missing values.
Inspect View: Improve the performance and memory efficiency of the viewer when viewing large samples with long, complex transcripts.
Inspect View: Improve the performance of the viewer when viewing large, complex sample or task metadata.
Inspect View: Live display of subtask, tool and other child events when viewing a running evaluation.
Inspect View: Transcript rendering improvements including less complex overall layout, more collapsible entities, and improved rendering of sandbox events, tool calls, and other events.
Inspect View: Message rendering improvement including coloring user messages, reducing layout complexity, and other minor improvements.
Inspect View: Render metadata for samples and tasks as an interactive tree.
Inspect View: When deployed via inspect view bundle, support linking to individual transcript events or messages.
Inspect View: Reduce the maximum size of the header (before it is collapsed) when evals have large numbers of metrics.
Bugfix: More robust handling of non-529 “overloaded_error” for Anthropic.
Bugfix: More robust handling of no result returned from tool call.
Updated vLLM provider to use local server rather than in process vllm package (improved concurrency and resource utilization).
New SGLang provider (using similar local server architecture as vLLM provider).
Anthropic: Added streaming model argument to control whether streaming API is used (by default, streams when using extended thinking).
--sample-id option can now include task prefixes (e.g. --sample-id=popularity:10,security:5)).
Improved write performance for realtime event logging.
--no-log-realtime option for disabling realtime event logging (live viewing of logs is disabled when this is specified).
Packaging: Exclude _resources directories from package (reduces pressure on path lengths for Windows).
Inspect View: Split info tab into task, models, and info for improved layout.
Bugfix: Avoid validation errors when loading old log files which contain “output_limit” tool errors.
v0.3.92 (26 April 2025)
OpenAI: In responses API, don’t pass back assistant output that wasn’t part of the output included in the server response (e.g. output generated from a call to a submit() tool).
Bugfix: Correctly pass tool arguments back to model for OpenAI responses API.
OpenAI: responses_store model argument to control whether the store option is enabled (it is enabled by default for reasoning models to support reasoning playback).
OpenAI: Support for flex processing, which provides lower inference costs in exchange for slower response times and occasional resource unavailability (added in v1.75.0, which is now required).
OpenAI: Responses API is now used by default for all reasoning models.
OpenAI: Automatically alias reserved internal tool names (e.g. python) for responses API.
Anthropic: Warn only once if unable to call count_tokens() for a model.
Google: Update to 1.12.1 of google-genai (which is now required).
Google: Support for reasoning_tokens option for Gemini 2.5 models.
Grok: Support for reasoning_effort option and capturing reasoning content.
OpenRouter: Forward reasoning_effort and reasoning_tokens to reasoning field.
Model API: ToolSource for dynamic tools inputs (can be used in calls to model.generate() and execute_tools())
ReAct Agent: Ability to fully repleace the default submit() tool.
Human Agent: Added user parameter for running the human agent cli as a given user.
Anthropic: Don’t include side count of reasoning_tokens in total_tokens (they are already included).
Anthropic: Update string matching to correctly handle BadRequestErrors related to prompts being too long.
v0.3.87 (10 April 2025)
Eval: Fix an error when attempting to display realtime metrics for an evaluation.
Log Viewer: Fix an error when displaying a running log with a null metric value.
v0.3.86 (09 April 2025)
Open AI: Treat UnprocessableEntityError as bad request so we can include the request payload in the error message.
Eval Retry: Correctly restore model-specific generation config on retry.
Inspect View: Resolve sample attachments before including in realtime event stream.
Bugfix: Properly handle special characters in IDs during event database cleanup.
v0.3.85 (08 April 2025)
Remove support for goodfire model provider (dependency conflicts).
React Agent: Enable specification of description without name.
v0.3.84 (07 April 2025)
Bugfix: Suppress link click behavior in vscode links.
v0.3.83 (07 April 2025)
Inspect View: Live updates to running evaluation logs.
Agent protocol and inspect_ai.agent module with new system for creating, composing, and executing agents.
Scoring: New grouped() metric wrapper function, which applies a given metric to subgroups of samples defined by a key in sample metadata.
Basic Agent: New submit_append option to append the submit tool output to the completion rather than replacing the completion (note that the new react() agent appends by default).
Model API: New execute_tools() function (replaces deprecated call_tools() function) which handles agent handoffs that occur during tool calling.
Model API: generate_loop() method for calling generate with a tool use loop.
Model API: Provide optional sync context manager for Model (works only with providers that don’t require an async close).
Anthropic: Add support for tool_choice="none" (added in v0.49.0, which is now required).
Together AI: Updated logprobs to pass 1 rather than True (protocol change).
Tools: bash_session() and web_browser() now create a distinct sandbox process each time they are instantiated.
Computer Tool: Support for use of the native Open AI computer tool (available in the model openai/computer-use-preview)
Task API: task_with() and tool_with() no longer copy the input task or tool (rather, they modify it in place and return it).
Eval Set: Resolve tasks before each pass (ensure that each pass runs against an entirely new task instance).
Eval Retry: Ability to retry any task in the registry, even if it has a custom name (save registry_name separately).
Human Agent: Start task with clock paused and then automatically start it on container logins.
Typed Store: instance option for store_as() for using multiple instances of a StoreModel within a sample.
Typed Store: Raise error if attempting to embed a StoreModel within another StoreModel.
Sandbox: New sandbox_default() context manager for temporarily changing the default sandbox.
Docker: write_file() function now gracefully handles larger input file sizes (was failing on files > 2MB).
Docker: Prevent low timeout values (e.g. 1 second) from disabling timeout entirely when they are retried.
Display: Print warnings after task summaries for improved visibility.
Inspect View: Fallback to content range request if initial HEAD request fails.
Inspect View: Improve error message when view bundles are server from incompatible servers.
Inspect View: Render messages in user and assistant solver events.
Inspect View: Improved support for display of nested arrays.
Inspect View: Improved rendering of complex scores and metrics.
Inspect View: Properly handle filtering of dictionary scores.
Inspect View: Render math in model input and output using katex.
Inspect View: Improve sample score rendering (single scoring tab with scores rendered in a table).
Inspect View: Improve sample count display in sample list footer.
Inspect View: Properly refresh running evals when restoring from being backgrounded.
Bugfix: Support for calling the score() function within Jupyter notebooks.
Bugfix: Handle process lookup errors that can occur during timeout race conditions.
Bugfix: Correctly capture and return logs from eval() when a cancellation occurs.
Bugfix: Correctly handle custom api_version model argument for OpenAI on Azure.
Bugfix: Correct handling for None passed to tool call by model for optional parameters.
Bugfix: Cleanup automatically created .compose.yml when not in working directory.
Bugfix: Prevent exception when navigating to sample that no longer exists in running samples display.
v0.3.82 (02 April 2025)
Bugfix: Correct handling of backward compatibility for inspect-web-browser-tool image.
Bugfix: Eval now properly exits when max_tasks is greater than total tasks
v0.3.81 (30 March 2025)
Requirements: Temporarily upper-bound rich to < 14.0.0 to workaround issue.
v0.3.80 (30 March 2025)
Google: Compatibility with httpx client in google-genai >= 1.8.0 (which is now required).
Mistral: Compatibility with tool call schema for mistralai >= v1.6.0 (which is now required).
Inspect View: Correctly parse NaN values (use JSON5 for all JSON parsing)
v0.3.79 (26 March 2025)
Google: Compatibility with v1.7 of google-genai package (create client per-generate request)
Bugfix: Properly record scorer and metrics when there are multiple tasks run in an eval.
v0.3.78 (25 March 2025)
OpenAI: Ensure that assistant messages always have the msg_ prefix in responses API.
v0.3.77 (25 March 2025)
New think() tool that provides models with the ability to include an additional thinking step.
OpenAI: Remove base64-encoded audio content from API call JSON in ModelEvent.
AzureAI: Support for use of native OpenAI and Mistral clients using service qualifiers (e.g. openai/azure/gpt-4o-mini or mistral/azure/Mistral-Large-2411).
OpenRouter: Handle “error” field in response object and retry for empty responses.
Added --metadata option to eval for associating metadata with eval runs.
Task display: Show reasoning tokens for models that report them.
Anthropic: Include reasoning tokens in computation of total tokens
Inspect View: Properly wrap tool input for non-code inputs like think.
v0.3.76 (23 March 2025)
bash_session() tool for creating a stateful bash shell that retains its state across calls from the model.
text_editor() tool which enables viewing, creating and editing text files.
Structured Output: Properly handle Pydantic BaseModel that contains other BaseModel definitions in its schema.
OpenAI: Support for .wav files in audio inputs for gpt-4o-audio-preview.
OpenAI: Strip ‘azure’ prefix from model_name so that model type checks all work correctly.
OpenAI: Don’t send reasoning_effort parameter to o1-preview (as it is not supported).
Inspect View: Fix error sorting numeric or categorical score results.
Inspect View: Properly wrap model API call text in the transcript.
Bugfix: Only initialise display in eval_set if it wasn’t initialised from the CLI
Bugfix: Set the global log level based on the specified Inspect log level.
Bugfix: Resolve issue when deserialising a SubtaskEvent from a log file which does not have a completed time.
Bugfix: Fix unnecessary warnings about task arguments.
Bugfix: When a task does not take a kwargs argument, only warn if the provided argument is not valid.
v0.3.75 (18 March 2025)
Model API: Specifying a default model (e.g. --model) is no longer required (as some evals have no model or use get_model() for model access).
Tasks can now directly specify a model, and model is no longer a required axis for parallel tasks.
Eval Set: Improved parallelisation in scheduler (all pending tasks are now run together rather than in model groups).
Don’t generate id for ChatMessage when deserialising (id is now str | None and is only populated when messages are directly created).
Log: Support for zip64 extensions required to read some log files that are larger than 4GB.
Anthropic: Provide reasoning_tokens for standard thinking blocks (redacted thinking not counted).
Google: Improve checking of APIError status codes for retry.
CLI: Added --env option for defining environment variables for the duration of the inspect process.
Inspect View: Fix issue generating diffs for nested arrays.
Inspect View: Fix layout issue with sample error display in sample detail summary.
Inspect View: Better support large eval files (in excess of 4GB).
Inspect View: Correctly display ‘None’ when passed in tool calls.
Inspect View: Fix ‘Access Denied’ error when using inspect view and viewing the log in a browser.
Bugfix: Properly handle nested Pydantic models when reading typed store (store_as()) from log.
Bugfix: Enable passing solver list to eval() (decorate chain function with @solver).
Bugfix: Support deserializing custom sandbox configuration objects when said sandbox plugin is not installed.
Bugfix: Fix error in sample filtering autocomplete (could cause autocomplete to fail and show an error in js console).
v0.3.74 (15 March 2025)
Bugfix: Exclude chat message id from cache key (fixes regression in model output caching).
v0.3.73 (14 March 2025)
Constrain model output to a particular JSON schema using Structured Output (supported for OpenAI, Google, and Mistral).
New “HTTP Retries” display (replacing the “HTTP Rate Limits” display) which counts all retries and does so much more consistently and accurately across providers.
The ModelAPI class now has a should_retry() method that replaces the deprecated is_rate_limit() method.
The “Generate…” progress message in the Running Samples view now shows the number of retries for the active call to generate().
New inspect trace http command which will show all HTTP requests for a run.
More consistent use of max_retries and timeout configuration options. These options now exclusively control Inspect’s outer retry handler; model providers use their default behaviour for the inner request, which is typically 2-4 retries and a service-appropriate timeout.
Improved async implementation using AnyIO (can now optionally run Trio rather than asyncio as the async backend).
Agent Bridge: Correct handling for tool_choice option.
Model API: ChatMessage now includes an id field (defaults to auto-generated uuid).
OpenAI: More flexible parsing of content parts (some providers omit the “type” field); support for “reasoning” content parts.
Anthropic: Retry api connection errors and remote protocol errors that occur during streaming.
Mistral: Update to new Mistral API (v1.5.1 of mistralai is now required).
Logging: Inspect no longer sets the global log level nor does it allow its own messages to propagate to the global handler (eliminating the possibility of duplicate display). This should improve compatibility with applications that have their own custom logging configured.
Tasks: For filesystem based tasks, no longer switch to the task file’s directory during execution (directory switching still occurs during task loading). Specify @task(chdir=True) to preserve the previous behavior.
Bugfix: Fix issue with deserializing custom sandbox configuration objects.
Bugfix: Handle parallel_tool_calls correctly for OpenAI models served through Azure.
v0.3.72 (03 March 2025)
Computer: Updated tool definition to match improvements in Claude Sonnet 3.7.
v0.3.71 (01 March 2025)
Anthropic: Support for extended thinking features of Claude Sonnet 3.7 (minimum version of anthropic package bumped to 0.47.1).
Reasoning: ContentReasoning type for representing model reasoning blocks.
Reasoning: reasoning_tokens for setting maximum reasoning tokens (currently only supported by Claude Sonnet 3.7)
Reasoning: reasoning_history can now be specified as “none”, “all”, “last”, or “auto” (which yields a provider specific recommended default).
Web Browser: Various improvements to performance and robustness along with several bug fixes.
OpenAI: Provide long connection (reasoning friendly) socket defaults in http client
OpenAI: Capture reasoning_tokens when reported.
OpenAI: Retry on rate limit requests with “Request too large”.
OpenAI: Tolerate None for assistant content (can happen when there is a refusal).
Google: Retry requests on more HTTP status codes (selected 400 errors and all 500 errors).
Event Log: Add working_start attribute to events and completed and working_time to model, tool, and subtask events.
Human Agent: Add task quit command for giving up on tasks.
Human Agent: Don’t emit sandbox events for human agent
Inspect View: Improve rendering of JSON within logging events.
Inspect View: Improve virtualized rendering of Sample List, Sample Transcript, and Sample Messages.
Task Display: Let plugins display counters (‘rich’ and ‘full’ display modes only).
Inspect View: Fix layout issues with human agent terminal session playback.
Inspect View: Improve tool input / output appearance when rendered in VSCode.
Inspect View: Display reasoning tokens in model usage for the samples and for the complete eval.
Inspect View: Improve model api request / response output when rendered in VSCode.
Inspect View: Improve rendering of some tool calls in the transcript.
Bugfix: Fix audio and video inputs for new Google GenAI client.
Bugfix: Ensure that token limits are not enforced during model graded scoring.
Bugfix: Catch standard TimeoutError for running shell commands in the computer tool container.
Bugfix: Correct combination of consecutive string based user messages for Anthropic provider.
v0.3.70 (25 February 2025)
working_limit option for specifying a maximum working time (e.g. model generation, tool calls, etc.) for samples.
Added SandboxEvent to transcript for recording sandbox execution and I/O.
Sandboxes: as_type() function for checked downcasting of SandboxEnvironment
Remove root logging handlers upon Inspect logger initialisation (as they result in lots of log spam if left installed).
Only explicitly set state.completed=True when entering scoring (basic_agent() no longer sets completed so can be used in longer compositions of solvers).
Add uuid property to TaskState and EvalSample (globally unique identifier for sample run).
Add cleanup to tasks for executing a function at the end of each sample run.
Agent bridge() is now compatible with the use of a custom OPENAI_BASE_URL.
Mistral: Bump required version of mistralai package to 1.5 (required for working_limit).
Truncate tracebacks included in evaluation log to a maximum of 1MB.
Compatibility with textual version 2.0 (remove upper bound).
Align with HF datasets fsspec version constraints to avoid pip errors when installing alongside datasets.
Bugfix: Fix issue with tools that had an ordinary dict as a parameter.
Bugfix: Print the correct container sample_id for --no-sandbox-cleanup.
v0.3.69 (20 February 2025)
Google provider updated to use the Google Gen AI SDK, which is now the recommended API for Gemini 2.0 models.
Task display: Use cooperative cancellation for cancel buttons in task display.
Task display: Print task progress every 5 seconds for ‘plain’ display mode.
Task display: Handle click on running samples tab when there is no transcript.
Docker: Print stderr from compose up when no services startup successfully.
Docker: Print sample id and epoch for each container when using --no-sandbox-cleanup
Mistral: Create and destroy client within generate.
Inspect View: Fix display of score dictionaries containing boolean values
Bugfix: Catch standard TimeoutError for subprocess timeouts (ensure kill/cleanup of timed out process).
v0.3.68 (19 February 2025)
Task display: Improve spacing/layout of final task display.
Textual: speicfy broader range of compatible versions (v0.86.2 to v1.0.0)
v0.3.67 (18 February 2025)
Memoize calls to get_model() so that model instances with the same parameters are cached and re-used (pass memoize=False to disable).
Async context manager for Model class for optional scoped usage of model clients.
display_type() function for detecting the current display type (e.g. “full”, “rich”, etc.)
Trace: improved handling of eval() running in multiple processes at once (trace file per-process)
Docker: don’t apply timeouts to docker build and docker pull commands.
Bugfix: fix issue w/ store.get() not auto-inserting default value.
v0.3.55 (29 December 2024)
Bedrock: redact authentication model args from eval logs.
OpenAI: warn when temperature is used with o1 models (as it is not supported).
Bugfix: spread args for cache trace logging.
v0.3.54 (26 December 2024)
Tracing for diagnosing runs with unterminated action (e.g. model calls, docker commands, etc.).
Provide default timeout/retry for docker compose commands to mitigate unreliability in some configurations.
Switch to sync S3 writes to overcome unreliability observed when using async interface.
Task display: Added --no-score-display option to disable realtime scoring metrics.
Bugfix: Fix failure to fully clone samples that have message lists as input.
llama-cpp-python: Support for logprobs.
v0.3.53 (20 December 2024)
OpenAI: Support for o1 including native tool calling and reasoning_effort generation option.
Task API: Introduce setup step that always runs even if solver is replaced.
Bedrock: Support for tool calling on Nova models.
Bedrock: Support for custom model_args passed through to session.Client.
Bedrock: Support for jpeg images.
Bedrock: Correct max_tokens for llama3-8b, llama3-70b models on Bedrock.
Inspect View: Various improvements to appearance of tool calls in transcript.
Task display: Ensure that widths of progress elements are kept consistent across tasks.
Sandboxes: New max_sandboxes option for (per-provider) maximum number of running sandboxes.
Sandboxes: Remove use of aiofiles to mitigate potential for threading deadlocks.
Concurrency: Do not use max_tasks as a lower bound for max_samples.
Log recorder: Always re-open log buffer for eval format logs.
Bugfix: Proper handling of text find for eval raw JSON display
Bugfix: Correct handling for --sample-id integer comparisons.
Bugfix: Proper removal of model_args with falsey values (explicit check for None)
Bugfix: Properly handle custom metrics that return dictionaries or lists
Bugfix: Proper sample count display when retrying an evaluation
Bugfix: Fix inability to define and run tasks in a notebook.
v0.3.52 (13 December 2024)
Eval: --sample-id option for evaluating specific sample id(s).
Bedrock: Detect and report HTTP rate limit errors.
Azure AI: Add emulate_tools model arg to force tool emulation (emulation is enabled by default for Llama models).
Basic Agent: Add max_tool_output parameter to override default max tool output from generate config.
Inspect View: Correct display of sample ID for single sample tasks.
Trace: Show custom tool views in --trace mode.
Bugfix: Support for dynamic metric names in realtime scoring display.
v0.3.51 (13 December 2024)
Bugfix: Task display fails to load when no scorers are defined for a task.
v0.3.50 (12 December 2024)
Tools: Improved typing/schema support (unions, optional params, enums).
Tools: Added append argument to use_tools() for adding (rather than replacing) the currently available tools.
Docker sandbox: Streamed reads of stderr/stdout (enabling us to enforce output limits for read_file and exec at the source).
Sandbox API: Enable passing BaseModel types for sandbox config (formerly only a file path could be passed).
Task display: Show all task scores in realtime (expand task progress to see scores).
Task display: Show completed samples and align progress more closely to completed samples (as opposed to steps).
Task display: Show sample messages/tokens used (plus limits if specified).
Task display: Resolve issue where task display would lose mouse input after VS Code reload.
Datasets: Validate that all IDs in datasets are unique (as several downstream problems occur w/ duplicate IDs).
Inspect View: Fix issue with incorrectly displayed custom tool views.
Human approval: Use fullscreen display (makes approval UI async and enables rapid processing of approvals via the Enter key).
Added input_panel() API for adding custom panels to the fullscreen task display.
Log recorder: Methods are now async which will improve performance for fsspec filesystems with async implementations (e.g. S3)
Log recorder: Improve .eval log reading performance for remote filesystem (eagerly fetch log to local buffer).
Add token_usage property to TaskState which has current total tokens used across all calls to generate() (same value that is used for enforcing token limits).
Add time field to ModelOutput that records total time spent within call to ModelAPI generate().
Web browser: Remove base64 images from web page contents (prevent filling up model context with large images).
Match scorer: If the target of a match isn’t numeric, ignore the numeric flag and instead use text matching (improved handling for percentages).
Hugging Face: Support for native HF tool calling for Llama, Mistral, Qwen, and others if they conform to various standard schemas.
Hugging Face: tokenizer_call_args dict to specify custom args during tokenization, such as max_length and truncation.
Azure AI: Fix schema validation error that occurred when model API returns None for content.
Display: Throttle updating of sample list based on number of samples.
Display: Add explicit ‘ctrl+c’ keybinding (as textual now disables this by default).
Bugfix: Correct rate limit error display when running in fullscreen mode.
Bugfix: hf_dataset now explicitly requires the split argument (previously, it would crash when not specified).
Bugfix: Prevent cascading textual error when an error occurs during task initialisation.
Bugfix: Correctly restore sample summaries from log file after amend.
Bugfix: Report errors that occur during task finalisation.
v0.3.49 (03 December 2024)
Logging: Only call CreateBucket on Amazon S3 when the bucket does not already exist.
Improve cancellation feedback and prevent multiple cancellations when using fullscreen display.
Inspect View: Resolve display issue with sorting by sample then epoch.
v0.3.48 (01 December 2024)
Realtime display of sample transcripts (including ability to cancel running samples).
Scoring: When using a dictionary to map metrics to score value dictionaries, you may now use globs as keys. See our scorer documentation for more information.
EvalLog now includes a location property indicating where it was read from.
Use tool views when rendering tool calls in Inspect View.
Consistent behavior for max_samples across sandbox and non-sandbox evals (both now apply max_samples per task, formerly evals with sandboxes applied max_samples globally).
Log files now properly deal with scores that produce Nan. (fixes #834)
Bash tool: add --login option so that e.g. .bashrc is read before executing the command.
Google: Support for tools/functions that have no parameters.
Google/Vertex: Support for logprobs and other new 1.5 (002 series) options.
AzureAI: Change default max_tokens for Llama models to 2048 (4096 currently yields an error w/ Llama 3.1).
Mistral: Various compatibility changes for their client and tool calling implementation.
Handle exponents in numeric normalisation for match, include, and answer scorers.
hf_dataset: Added cached argument to control whether to use a previously cached version of the dataset if available (defaults to True).
hf_dataset: Added revision option to load a specific branch or commit SHA (when using revision datasets are always revalidated on Hugging Face, i.e. cached is ignored).
Log viewer: Display sample ids rather than indexes.
Log viewer: Add timestamps to transcript events.
Log viewer: Metadata which contains images will now render the images.
Log viewer: Show custom tool call views in messages display.
Bugfix: Correctly read and forward image detail property.
Bugfix: Correct resolution of global eval override of task or sample sandboxes.
Bugfix: Don’t do eval log listing on background threads (s3fs can deadlock when run from multiple threads).
v0.3.47 (18 November 2024)
Basic agent: Ensure that the scorer is only run once when max_attempts = 1.
Basic agent: Support custom function for incorrect_message reply to model.
Tool calling: Execute multiple tool calls serially (some models assume that multiple calls are executed this way rather than in parallel).
Google: Combine consecutive tool messages into single content part; ensure no empty text content parts.
AzureAI: Create and close client with each call to generate (fixes issue w/ using azureai on multiple passes of eval).
Bedrock: Migrate to the Converse API, which supports many more features including tool calling and multimodal models.
Scoring: When using a dictionary to map metrics to score value dictionaries, you may now use globs as keys. See our scorer documentation for more information.
Sample limit events will now appear in the transcript if a limit (e.g. message, token, or time limit) halt a sample. The sample list and sample detail also display the limit, if applicable.
v0.3.46 (12 November 2024)
eval is now the default log format (use --log-format=json to use old format).
Base 64 images are now logged by default for all log formats (disable with --no-log-images).
The log viewer now properly displays sample errors in the sample list for eval format log files.
Improve path handling when using inspect log convert to convert a single log file.
Web browser tool: Subtasks now each have independent web browser sessions.
Anthropic: Ensure that assistant messages created in generate never have empty content lists.
Increase sandbox exec() output limit from 1 MiB to 10 MiB.
v0.3.45 (11 November 2024)
time_limit option for specifying a maximum execution time for samples.
Always treat .eval files as logs (don’t apply file name pattern restrictions as we do with .json).
Log model calls when model providers return bad request errors
Better lay out large numbers of configuration and parameters when displaying log files.
The log viewer now properly displays sample scores for running tasks.
Add metadata field to ModelOutput and provide various fields for the Groq provider.
v0.3.44 (04 November 2024)
Revert change to single epoch reducer behavior (regressed some scoring scenarios).
v0.3.43 (04 November 2024)
New binary log format which yields substantial size and speed improvements (JSON format log files are still fully supported and utilities for converting between the formats are provided).
Limit SandboxEnvironment.exec() output streams to 1 MiB. Limit SandboxEnvironment.read_file() to 100 MiB.
Add INSPECT_DISABLE_MODEL_API environment variable for disabling all Model APIs save for mockllm.
Add optional tool_call_id param to ModelOutput.for_tool_call().
Support all JSON and CSV dataset arguments in file_dataset() function.
v0.3.42 (23 October 2024)
ToolDef class for dynamically creating tool definitions.
Added --tags option to eval for tagging evaluation runs.
Added APIs for accessing sample event transcripts and for creating and resolving attachments for larger content items.
Cleanup Docker Containers immediately for samples with errors.
Support Dockerfile as config path for Docker sandboxes (previously only supported compose files).
Anthropic: remove stock tool use chain of thought prompt (many Anthropic models now do this internally, in other cases its better for this to be explicit rather than implicit).
Anthropic: ensure that we never send empty text content to the API.
Google: compatibility with google-generativeai v0.8.3
Llama: remove extraneous <|start_header_id|>assistant<|end_header_id|> if it appears in an assistant message.
OpenAI: Remove tool call id in user message reporting tool calls to o1- models.
Use Dockerhub aisiuk/inspect-web-browser-tool image for web browser tool.
Use ParamSpec to capture types of decorated solvers, tools, scorers, and metrics.
Support INSPECT_EVAL_MODEL_ARGS environment variable for calls to eval().
Requirements: add lower bounds to various dependencies based on usage, compatibility, and stability.
Added include_history option to model graded scorers to optionally include the full chat history in the presented question.
Added delimiter option to csv_dataset() (defaults to “,”)
Improve answer detection in multiple choice scorer.
Open log files in binary mode when reading headers (fixes ijson deprecation warning).
Capture list and dict of registry objects when logging plan.
Add model_usage field to EvalSample to record token usage by model for each sample.
Correct directory handling for tasks that are imported as local (non-package) modules.
Basic agent: terminate agent loop when the context window is exceeded.
Call tools sequentially when they have opted out of parallel calling.
Inspect view bundle: support for bundling directories with nested subdirectories.
Bugfix: strip protocol prefix when resolving eval event content
Bugfix: switch to run directory when running multiple tasks with the same run directory.
Bugfix: ensure that log directories don’t end in forward/back slash.
v0.3.41 (11 October 2024)
Approval mode for extensible approvals of tool calls (human and auto-approvers built in, arbitrary other approval schemes via extensions).
Trace mode for printing model interactions to the terminal.
Sample limits (token_limit and message_limit) for capping the number of tokens or messages used per sample ( message_limit replaces deprecated max_messages).
Add metadata field to Task and record in log EvalSpec.
Include datetime and level in file logger.
Correct llama3 and o1 tool calling when empty arguments passed.
Allow resolution of any sandbox name when there is only a single environment.
Introduce --log-level-transcript option for separate control of log entries recorded in the eval log file
Improve mime type detection for image content encoding (fixes issues w/ webp images).
Fix memory leak in Inspect View worker-based JSON parsing.
Add fail_on_error option for eval_retry() and inspect eval-retry.
Eval Sets for running groups of tasks with automatic retries.
Per-sample Sandbox environments can now be specified (e.g. allowing for a distinct Dockerfile or Docker compose file for each sample).
input_screen() context manager to temporarily clear task display for user input.
Introduce two new scorers, f1() (precision and recall in text matching) and exact() (whether normalized text matches exactly).
Task metrics now override built in scorer metrics (previously they were merged). This enables improved re-use of existing scorers where they only change required is a different set of metrics.
Reduce scores in multi-epoch tasks before computing metrics (defaults to averaging sample values).
Replace the use of the bootstrap_std metric with stderr for built in scorers (see rationale for details).
Option to write Python logger entries to an external file.
Rename ToolEnvironment to SandboxEnvironment and tool_environment() to sandbox() (moving the renamed types from inspect_ai.tool to inspect_ai.util). Existing symbols will continue to work but will print deprecation errors.
Moved the bash(), python(), and web_search() functions from inspect_ai.solver to inspect_ai.tool. Existing symbols will continue to work but will print deprecation errors.
Enable parallel execution of tasks that share a working directory.
Add chdir option to @task to opt-out of changing the working directory during task execution.
Enable overriding of default safety settings for Google models.
Use Python type annotations as the first source of type info for tool functions (fallback to docstrings only if necessary)
Support for richer types (list, TypeDict, dataclass, Pydantic, etc.) in tool calling.
Change ToolInfo parameters to be directly expressed in JSON Schema (making it much easier to pass them to model provider libraries).
Validate tool call inputs using JSON Schema and report errors to the model.
Gracefully handle tool calls that include only a single value (rather than a named dict of parameters).
Support tool_choice="any" for OpenAI models (requires >= 1.24.0 of openai package).
Make multiple tool calls in parallel. Parallel tool calls occur by default for OpenAI, Anthropic, Mistral, and Groq. You can disable this behavior for OpenAI and Groq with --parallel-tool-calls false.
Invoke rate limit retry for OpenAI APITimeoutError (which they have recently begun returning a lot of more of as a result of httpx.ConnectTimeout, which is only 5 seconds by default.).
Add cwd argument to SandboxEnvironment.exec()
Use tee rather than docker cp for Docker sandbox environment implementation of write_file().
Handle duplicate tool call ids in Inspect View.
Handle sorting sample ids of different types in Inspect View.
Correctly resolve default model based on CLI –model argument.
Fix issue with propagating API keys to Azure OpenAI provider.
Add azure model arg for OpenAI provider to force binding (or not binding) to the Azure OpenAI back-end.
Support for Llama 3 models with the Azure AI provider.
Add setup field to Sample for providing a per-sample setup script.
Score multiple choice questions without parsed answers as incorrect (rather than being an error). Llama 3 and 3.1 models especially often fail to yield an answer.
Read JSON encoded metadata field from samples.
Show task/display progress immediately (rather than waiting for connections to fill).
Reduce foreground task contention for Inspect View history loading.
Ability to host standalone version of Inspect View to view single log files.
Throw TimeoutError if a call to subprocess() or sandbox().exec() times out (formerly a textual error was returned along with a non-zero exit code).
Validate name passed to example_dataset() (and print available example dataset names).
Resolve relative image paths within Dataset samples against the directory containing the dataset.
Preserve tool_error text for Anthropic tool call responses.
Fix issue with rate limit reporting being per task not per eval.
Set maximum rate limit backoff time to 30 minutes
Retry with exponential backoff for web_search Google provider.
Multiple Models can now be evaluated in parallel by passing a list of models to eval().
Add api_key to get_model() for explicitly specifying an API key for a model.
Improved handling of very large (> 100MB) log files in Inspect View.
Use network_mode: none for disabling networking by default in Docker tool environments.
Shorten the default shutdown grace period for Docker container cleanup to 1 second.
Allow sandbox environment providers to specify a default max_samples (set to 25 for the Docker provider).
Prevent concurrent calls to eval_async() (unsafe because of need to change directories for tasks). Parallel task evaluation will instead be implemented as a top-level feature of eval() and eval_async().
Match scorers now return answers consistently even when there is no match.
Relocate tool related types into a new top-level inspect_ai.tool module (previous imports still work fow now, but result in a runtime deprecation warning).
Decouple tools entirely from solvers and task state (previously they had ways to interact with metadata, removing this coupling will enable tool use in lower level interactions with models). Accordingly, the call_tools() function now operates directly on messages rather than task state.
Support token usage for Google models (Inspect now requires google-generativeai v0.5.3).
v0.3.17 (25 June 2024)
Optional increased control over the tool use loop via the call_tools() function and new tool_calls parameter for generate().
New per_epoch option for CachePolicy to allow caching to ignore epochs.
Correctly handle choices and files when converting Sample images to base64.
v0.3.16 (24 June 2024)
Various fixes for the use of Docker tool environments on Windows.
Ability to disable cleanup of tool environments via --no-toolenv-cleanup.
New inspect toolenv cleanup command for manually cleaning up tool environments.
ToolError exception type for explicitly raising tool errors to the model. Formerly, any exception would be surfaced as a tool error to the model. Now, the ToolError exception is required for reporting to the model (otherwise other exception types go through the call stack and result in an eval error).
Resolve INSPECT_LOG_DIR in .env file relative to .env file parent directory.
Use - for delimiting --limit ranges rather than ,.
Use HF model device for generate (compatibility with multi-GPU).
Caching to reduce the number of model API calls made.
The multiple_choice() solver now has support for questions with multiple correct answers.
More fine grained handling of Claude BadRequestError (400) errors (which were formerly all treated as content moderation errors).
Filter out empty TextBlockParam when playing messages back to Claude.
Automatically combine Claude user messages that include tool content.
Revert to “auto” rather than “none” after forced tool call.
Provide TaskState.tools getter/setter (where the setter automatically syncs the system messages to the specified set of tools).
The use_tools() function now uses the TaskState.tools setter, so replaces the current set of tools entirely rather than appending to it.
Set state.completed = False when max_messages is reached.
Allow tools to be declared with no parameters.
Allow for null bytes field in Logprobs and TopLogprobs.
Support all Llama series models on Bedrock.
Added truthfulqa benchmark.
Added intercode-ctf example.
v0.3.14 (04 June 2024)
Stream samples to the evaluation log as they are completed (subject to the new --log-buffer option). Always write completed samples in the case of an error or cancelled task.
New "cancelled" status in eval log for tasks interrupted with SIGINT (e.g. Ctrl-C). Logs are now written for cancellations (previously they were not).
Default --max-samples (maximum concurrent samples) to --max-connections, which will result in samples being more frequently completed and written to the log file.
For eval_retry(), copy previously completed samples in the log file being retried so that work is not unnecessarily repeated.
New inspect eval-retry command to retry a log file from a task that ended in error or cancellation.
New retryable_eval_logs() function and --retryable option for inspect list logs to query for tasks not yet completed within a log directory.
Add shuffled property to datasets to determine if they were shuffled.
Bugfix: Inspect view was not reliably updating when new evaluation logs were written.
v0.3.12 (31 May 2024)
Bugfix: results was not defined when no scorer was provided resulting in an error being thrown. Fixed by setting results = EvalResults() when no scorer is provided.
Bugfix: The viewer was not properly handling samples without scores.
v0.3.11 (30 May 2024)
Update to non-beta version of Anthropic tool use (remove legacy xml tools implementation).
v0.3.10 (29 May 2024)
BREAKING: The pattern scorer has been modified to match against any (or all) regex match groups. This replaces the previous behaviour when there was more than one group, which would only match the second group.
Improved performance for Inspect View on very large datasets (virtualized sample list).
ToolChoice any option to indicate the model should use at least one tool (supported by Anthropic and Mistral, mapped to auto for OpenAI).
Tool calls can now return a simple scalar or list[ContentText | ContentImage].
Support for updated Anthropic tools beta (tool_choice and image tool results).
Report tool_error back to model if it provides invalid JSON for tool calls arguments (formerly this halted the entire eval with an error).
New max_samples option to control how many samples are run in parallel (still defaults to running all samples in parallel).
Registry name looks are now case sensitive (fixes issue w/ loading tasks w/ mixed case names).
Resiliency to Python syntax errors that occur when enumerating tasks in a directory.
Do not throw error if unable to parse or load .ipynb file due to lack of dependencies (e.g. nbformat).
Various additions to log viewer display (log file name, dataset/scorer in listing, filter by complex score types).
Improvements to markdown rendering in log viewer (don’t render intraword underscores, escape html tags).
v0.3.3 (28 April 2024)
inspect view command for viewing eval log files.
Score now has an optional answer field, which denotes the answer text extracted from model output.
Accuracy metrics now take an optional ValueToFloat function for customising how textual values mapped to float.
Made model_graded_qa more flexible with separate instruction template and grade_pattern, as well providing partial_credit as an option.
Modify the default templates for chain_of_thought() and self_critique() to instruct the model to reply with ANSWER: $ANSWER at the end on its own line.
Improved numeric extraction for match(numeric=True) (better currency and decimal handling).
Improve answer() patterns so that they detect letter and word answers both within and at the end of model output.
Plan now has an optional cleanup function which can be used to free per-sample resources (e.g. Docker containers) even in the case of an evaluation error.
Add Dataset.filter method for filtering samples using a predicate.
Dataset slices (e.g. dataset[0:100]) now return a Dataset rather than list[Sample].
Relative path to INSPECT_LOG_DIR in .env file is now correctly resolved for execution within subdirectories.
inspect list tasks and list_tasks() now only parse source files (rather than loading them), ensuring that it is fast even for task files that have non-trivial global initialisation.
inspect list logs and list_eval_logs() now enumerate log files recursively by default, and only enumerate json files that match log file naming conventions.
Provide header_only option for read_eval_log() and inspect info log-file for bypassing the potentially expensive reading of samples.
Provide filter option for list_eval_logs() to filter based on log file header info (i.e. anything but samples).
Added __main__.py entry point for invocation via python3 -m inspect_ai.
Removed prompt and callable from model ToolDef (renamed to ToolInfo).
Fix issue with accesses of completion property on ModelOutput with no choices.