# Evals

## Coding

- **[ADE-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/dbt_labs_ade_bench.html)** — Coding · agent, sandbox · 48 samples · `inspect_harbor/dbt_labs_ade_bench` · [paper](https://github.com/dbt-labs/ade-bench)
  Analytics Data Engineer Bench: dbt and SQL data-engineering tasks across DuckDB and Snowflake backends.
- **[AgentBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/agent_bench/)** — Coding · generation, sandbox · 26 samples · `inspect_evals/agent_bench_os` · [paper](https://arxiv.org/abs/2308.03688)
  A benchmark designed to evaluate LLMs as Agents
- **[Aider Polyglot](https://meridianlabs-ai.github.io/inspect_harbor/registry/aider_polyglot.html)** — Coding · agent, sandbox · 225 samples · `inspect_harbor/aider_polyglot` · [paper](https://arxiv.org/abs/2503.03656)
  Aider's polyglot coding benchmark: Exercism exercises across C++, Go, Java, JavaScript, Python, and Rust testing LLMs on multi-language code editing.
- **[AlgoTune](https://meridianlabs-ai.github.io/inspect_harbor/registry/algotune.html)** — Coding · agent, sandbox · 154 samples · `inspect_harbor/algotune` · [paper](https://arxiv.org/abs/2507.15887)
  AlgoTune: NeurIPS 2025 benchmark of math/physics/CS problems where the model writes code that matches reference output but runs faster than existing implementations.
- **[APPS](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/apps/)** — Coding · generation, sandbox · 5000 samples · `inspect_evals/apps` · [paper](https://arxiv.org/pdf/2105.09938v3)
  APPS is a dataset for evaluating model performance on Python programming tasks across three difficulty levels consisting of 1,000 at introductory, 3,000 at interview, and 1,000 at competition level.
- **[AutoCodeBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/tencent_autocodebench.html)** — Coding · agent, sandbox · 200 samples · `inspect_harbor/tencent_autocodebench` · [paper](https://arxiv.org/abs/2508.09101)
  Multilingual automated code generation benchmark evaluating LLMs across diverse programming tasks and languages.
- **[BigCodeBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/bigcodebench/)** — Coding · generation, sandbox · 1140 samples · `inspect_evals/bigcodebench` · [paper](https://arxiv.org/abs/2406.15877)
  Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.
- **[BigCodeBench-Hard (Complete)](https://meridianlabs-ai.github.io/inspect_harbor/registry/bigcode_bigcodebench_hard_complete.html)** — Coding · agent, sandbox · 145 samples · `inspect_harbor/bigcode_bigcodebench_hard_complete` · [paper](https://arxiv.org/abs/2406.15877)
  BigCodeBench-Hard (Complete split): hard subset evaluating LLMs on code generation with diverse function calls and complex instructions, in completion format.
- **[CAD-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/gnucleus_ai_cad_bench.html)** — Coding, Professional · agent, sandbox · 100 samples · `inspect_harbor/gnucleus_ai_cad_bench` · [paper](https://www.gnucleus.ai/cad-bench)
  gNucleus AI CAD-generation benchmark — 100 parametric FreeCAD tasks.
- **[ClassEval](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/class_eval/)** — Coding · generation, sandbox · 100 samples · `inspect_evals/class_eval` · [paper](https://arxiv.org/abs/2308.01861)
  Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours.
- **[CodeSkills-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/nvats_codeskills_bench.html)** — Coding · agent, sandbox · 23 samples · `inspect_harbor/nvats_codeskills_bench` · [paper](https://github.com/namanvats/codeskills-bench)
  A small set of real-life programming tasks: bug fixes, merge-conflict resolution, dependency cleanup, API migration, and performance regressions across compact Python repositories.
- **[CompileBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/quesma_compilebench.html)** — Coding · agent, sandbox · 15 samples · `inspect_harbor/quesma_compilebench` · [paper](https://arxiv.org/abs/2509.25248)
  CompileBench: real-world build/compile tasks (curl, GNU coreutils, jq, etc.) ranging from easy builds to reviving 2003-era code and cross-compiling.
- **[ComputeEval](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/compute_eval/)** — Coding · generation, sandbox · 406 samples · `inspect_evals/compute_eval`
  Evaluates LLM capability to generate correct CUDA code for kernel implementation, memory management, and parallel algorithm optimization tasks.
- **[CRUST-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/crustbench.html)** — Coding · agent, sandbox · 100 samples · `inspect_harbor/crustbench` · [paper](https://arxiv.org/abs/2504.15254)
  CRUST-Bench: real-world C repositories paired with hand-written safe-Rust interfaces and tests, benchmarking LLMs on C-to-safe-Rust transpilation.
- **[DevEval](https://meridianlabs-ai.github.io/inspect_harbor/registry/deveval.html)** — Coding · agent, sandbox · 63 samples · `inspect_harbor/deveval` · [paper](https://arxiv.org/abs/2403.08604)
  DevEval: manually-annotated code-generation samples from real-world Python repositories, aligned to practical software development.
- **[DevOps-Gym](https://meridianlabs-ai.github.io/inspect_harbor/registry/michaely310_devopsgym.html)** — Coding, Professional · agent, sandbox · 728 samples · `inspect_harbor/michaely310_devopsgym` · [paper](https://arxiv.org/abs/2601.20882)
  DevOps-Gym benchmark adapted to Harbor format - 729 tasks across 5 categories: Build, Monitoring, Issue Resolving, Test Generation, and End-to-End.
- **[DS-1000](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/ds1000/)** — Coding · generation, sandbox · 1000 samples · `inspect_evals/ds1000` · [paper](https://arxiv.org/abs/2211.11501)
  Code generation benchmark with a thousand data science problems spanning seven Python libraries.
- **[DS-1000](https://meridianlabs-ai.github.io/inspect_harbor/registry/xlang_ds_1000.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/xlang_ds_1000` · [paper](https://arxiv.org/abs/2211.11501)
  DS-1000: data-science code-generation problems from StackOverflow across NumPy, Pandas, TensorFlow, PyTorch, SciPy, Scikit-learn, and Matplotlib, with execution-based grading.
- **[EvoEval](https://meridianlabs-ai.github.io/inspect_harbor/registry/evoeval.html)** — Coding · agent, sandbox · 100 samples · `inspect_harbor/evoeval` · [paper](https://github.com/evo-eval/evoeval)
  EvoEval: evolving suite that mutates HumanEval problems along several axes (difficulty, creative, subtle, tool-use) for a contamination-resistant view of LLM coding ability.
- **[FeatureBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/featurebench.html)** — Coding · agent, sandbox · 200 samples · `inspect_harbor/featurebench` · [paper](https://arxiv.org/abs/2602.10975)
  FeatureBench: agentic coding on end-to-end feature-development tasks derived from open-source repositories.
- **[FeatureBench (Modal)](https://meridianlabs-ai.github.io/inspect_harbor/registry/featurebench_modal.html)** — Coding · agent, sandbox · 200 samples · `inspect_harbor/featurebench_modal` · [paper](https://arxiv.org/abs/2602.10975)
  FeatureBench's full task suite executed on Modal's cloud sandbox runner.
- **[FeatureBench-Lite](https://meridianlabs-ai.github.io/inspect_harbor/registry/featurebench_lite.html)** — Coding · agent, sandbox · 30 samples · `inspect_harbor/featurebench_lite` · [paper](https://arxiv.org/abs/2602.10975)
  Lightweight subset of FeatureBench for cheaper evaluation while preserving model rankings.
- **[FeatureBench-Lite (Modal)](https://meridianlabs-ai.github.io/inspect_harbor/registry/featurebench_lite_modal.html)** — Coding · agent, sandbox · 30 samples · `inspect_harbor/featurebench_lite_modal` · [paper](https://arxiv.org/abs/2602.10975)
  FeatureBench-Lite executed on Modal's cloud sandbox runner.
- **[Frontier-CS](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/frontier_cs/)** — Coding · agent, sandbox · 238 samples · `inspect_evals/frontier_cs` · [paper](https://arxiv.org/abs/2512.15699)
  238 open-ended computer science problems spanning algorithmic (172) and research (66) tracks.
- **[Frontier-CS](https://meridianlabs-ai.github.io/inspect_harbor/registry/yanagiorigami_frontier_cs.html)** — Coding, Reasoning · agent, sandbox · 172 samples · `inspect_harbor/yanagiorigami_frontier_cs` · [paper](https://arxiv.org/abs/2512.15699)
  Frontier-CS competitive programming benchmark: 172 open-ended algorithmic problems with partial scoring via go-judge.
- **[HiL-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/scale_ai_hil_bench.html)** — Coding, Behavior · agent, sandbox · 600 samples · `inspect_harbor/scale_ai_hil_bench` · [paper](https://arxiv.org/abs/2604.09408)
  HiL-Bench (Human-in-the-Loop): tests if agents know when to ask for help rather than proceed with uncertain knowledge.
- **[HumanEval](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/humaneval/)** — Coding · generation, sandbox · 164 samples · `inspect_evals/humaneval` · [paper](https://arxiv.org/abs/2107.03374)
  Assesses how accurately language models can write correct Python functions based solely on natural-language instructions provided as docstrings.
- **[HumanEvalFix](https://meridianlabs-ai.github.io/inspect_harbor/registry/bigcode_humanevalfix.html)** — Coding · agent, sandbox · 164 samples · `inspect_harbor/bigcode_humanevalfix` · [paper](https://arxiv.org/abs/2308.07124)
  HumanEvalFix (OctoPack): buggy functions across Python, JavaScript, Java, Go, C++, and Rust that models must repair given the failing unit tests.
- **[IFEvalCode](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/ifevalcode/)** — Coding · generation, sandbox · 810 samples · `inspect_evals/ifevalcode` · [paper](https://arxiv.org/abs/2507.22462)
  Evaluates code generation models on their ability to produce correct code while adhering to specific instruction constraints across 8 programming languages.
- **[KernelBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/kernelbench/)** — Coding · generation, sandbox · 250 samples · `inspect_evals/kernelbench` · [paper](https://arxiv.org/html/2502.10517v1)
  A benchmark for evaluating the ability of LLMs to write efficient GPU kernels.
- **[Legacy-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/factory_ai_legacy_bench.html)** — Coding · agent, sandbox · 10 samples · `inspect_harbor/factory_ai_legacy_bench` · [paper](https://factory.ai/news/legacy-bench)
  Legacy-Bench public sample tasks for evaluating AI coding agents on legacy software engineering tasks.
- **[LiteCoder-RL](https://meridianlabs-ai.github.io/inspect_harbor/registry/litecoder_rl.html)** — Coding, Assistants · agent, sandbox · 602 samples · `inspect_harbor/litecoder_rl` · [paper](https://github.com/icip-cas/LiteCoder)
  LiteCoder: terminal-based RL training environments spanning developer workflows, scientific/numerical computing, and games.
- **[LiveCodeBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/livecodebench.html)** — Coding · agent, sandbox · 100 samples · `inspect_harbor/livecodebench` · [paper](https://arxiv.org/abs/2403.07974)
  LiveCodeBench: contamination-free coding benchmark continuously collected from LeetCode, AtCoder, and Codeforces, supporting code generation, self-repair, execution, and test-output prediction.
- **[LiveCodeBench-Pro](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/livecodebench_pro/)** — Coding · generation, sandbox · 1404 samples · `inspect_evals/livecodebench_pro` · [paper](https://arxiv.org/abs/2506.11928)
  Evaluates LLMs on competitive programming problems using a specialized Docker sandbox (LightCPVerifier) to execute and judge C++ code submissions against hidden test cases with time and memory constraints.
- **[MBPP](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mbpp/)** — Coding · generation, sandbox · 257 samples · `inspect_evals/mbpp` · [paper](https://arxiv.org/abs/2108.07732)
  Measures the ability of language models to generate short Python programs from simple natural-language descriptions, testing basic coding proficiency.
- **[MLE-bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mle_bench/)** — Coding · agent, sandbox · 1 samples · `inspect_evals/mle_bench` · [paper](https://arxiv.org/abs/2410.07095)
  Machine learning tasks drawn from 75 Kaggle competitions.
- **[MLGym-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/meta_mlgym_bench.html)** — Coding, Science · agent, sandbox · 12 samples · `inspect_harbor/meta_mlgym_bench` · [paper](https://arxiv.org/abs/2502.14499)
  MLGym-Bench: Meta's framework and benchmark for AI research agents covering CV, NLP, RL, and game-theory tasks requiring ideation, implementation, training, and analysis of ML experiments.
- **[MLRC-Bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/mlrc_bench/)** — Coding · agent, sandbox · 7 samples · `inspect_evals/mlrc_bench` · [paper](https://arxiv.org/pdf/2504.09702)
  This benchmark evaluates LLM-based research agents on their ability to propose and implement novel methods using tasks from recent ML conference competitions, assessing both novelty and effectiveness compared to a baseline and top human solutions.
- **[o11y-bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/grafana_o11y_bench.html)** — Coding, Professional · agent, sandbox · 63 samples · `inspect_harbor/grafana_o11y_bench` · [paper](https://github.com/grafana/o11y-bench)
  o11y-bench: an open agentic observability benchmark. Measures how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.
- **[OTel-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/quesma_otel_bench.html)** — Coding · agent, sandbox · 26 samples · `inspect_harbor/quesma_otel_bench` · [paper](https://github.com/QuesmaOrg/otel-bench)
  AI-agent benchmark for OpenTelemetry instrumentation tasks across multiple programming languages.
- **[PaperBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/paperbench/)** — Coding · agent, sandbox · 23 samples · `inspect_evals/paperbench` · [paper](https://arxiv.org/abs/2504.01848)
  Agents are evaluated on their ability to replicate 20 ICML 2024 Spotlight and Oral papers from scratch.
- **[QuixBugs](https://meridianlabs-ai.github.io/inspect_harbor/registry/quixbugs.html)** — Coding · agent, sandbox · 80 samples · `inspect_harbor/quixbugs` · [paper](https://github.com/jkoppel/QuixBugs)
  QuixBugs: small classic-algorithm programs (Python and Java) each containing a one-line bug, used to evaluate automated program repair.
- **[RExBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/rexbench.html)** — Coding · agent, sandbox · 2 samples · `inspect_harbor/rexbench` · [paper](https://arxiv.org/abs/2506.22598)
  RExBench - 2 tasks (cogs, othello) evaluating AI agents' ability to extend existing AI research through research experiment implementation. Tasks require an A100 40GB GPU and may take multiple hours (cogs: ~2.5h oracle; othello: ~45min oracle). Parity: openhands@1.4.0 / anthropic/claude-sonnet-4-5, 3 runs — cogs: 100% Harbor vs 66.67% original; othello: 100% Harbor vs 100% original. Adapter by Nicholas Edwards (nicholas.edwards@univie.ac.at), one of the original RExBench authors. Acknowledgements: We thank 2077 AI for generously funding API credits to support running parity experiments.
- **[SETA-Env](https://meridianlabs-ai.github.io/inspect_harbor/registry/camel_ai_seta_env.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/camel_ai_seta_env` · [paper](https://github.com/camel-ai/seta-env)
  SETA (Scaling Environments for Terminal Agents): CAMEL-AI's verifiable terminal-agent tasks spanning software engineering, sysadmin, and DevOps for evaluating and RL-training agents.
- **[SlopCodeBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/gabeorlanski_slopcodebench.html)** — Coding · agent, sandbox · 36 samples · `inspect_harbor/gabeorlanski_slopcodebench` · [paper](https://arxiv.org/abs/2603.24755)
  SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor.
- **[SWE-Atlas (QnA)](https://meridianlabs-ai.github.io/inspect_harbor/registry/scale_ai_swe_atlas_qna.html)** — Coding · agent, sandbox · 124 samples · `inspect_harbor/scale_ai_swe_atlas_qna` · [paper](https://github.com/scaleapi/SWE-Atlas)
  SWE-Atlas - Codebase QnA is a benchmark of deep codebase comprehension and QnA problems for coding agents. Checkout for instructions on running it.
- **[SWE-Atlas (Refactoring)](https://meridianlabs-ai.github.io/inspect_harbor/registry/scale_ai_swe_atlas_rf.html)** — Coding · agent, sandbox · 70 samples · `inspect_harbor/scale_ai_swe_atlas_rf` · [paper](https://github.com/scaleapi/SWE-Atlas)
  SWE-Atlas - Refactoring -- A benchmark of refactoring tasks for coding agents.
- **[SWE-Atlas (Test Writing)](https://meridianlabs-ai.github.io/inspect_harbor/registry/scale_ai_swe_atlas_tw.html)** — Coding · agent, sandbox · 90 samples · `inspect_harbor/scale_ai_swe_atlas_tw` · [paper](https://github.com/scaleapi/SWE-Atlas)
  SWE-Atlas - Test Writing -- A benchmark of comprehensive test writing problems for coding agents. Checkout for instructions on running it.
- **[SWE-bench Pro](https://meridianlabs-ai.github.io/inspect_harbor/registry/cais_swebenchpro.html)** — Coding · agent, sandbox · 731 samples · `inspect_harbor/cais_swebenchpro` · [paper](https://arxiv.org/abs/2509.16941)
  SWE-bench Pro with anti-exploitation (git history isolation + GitHub network blocking). 731 tasks, Python/JS/TS/Go.
- **[SWE-bench Pro](https://meridianlabs-ai.github.io/inspect_harbor/registry/scale_ai_swe_bench_pro.html)** — Coding · agent, sandbox · 731 samples · `inspect_harbor/scale_ai_swe_bench_pro` · [paper](https://arxiv.org/abs/2509.16941)
  SWE-Bench-Pro: long-horizon enterprise software engineering tasks.
- **[SWE-bench Verified](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/swe_bench/)** — Coding · agent, sandbox · 500 samples · `inspect_evals/swe_bench` · [paper](https://arxiv.org/abs/2310.06770)
  Evaluates AI's ability to resolve genuine software engineering issues sourced from 12 popular Python GitHub repositories, reflecting realistic coding and debugging scenarios.
- **[SWE-bench Verified](https://meridianlabs-ai.github.io/inspect_harbor/registry/swe_bench_verified.html)** — Coding · agent, sandbox · 500 samples · `inspect_harbor/swe_bench_verified` · [paper](https://arxiv.org/abs/2310.06770)
  SWE-bench Verified: human-filtered subset of SWE-bench (collaboration with OpenAI) where human SWEs confirmed each real GitHub issue is solvable given the available repository context.
- **[SWE-gen (C++)](https://meridianlabs-ai.github.io/inspect_harbor/registry/abundant_swe_gen_cpp.html)** — Coding · agent, sandbox · 999 samples · `inspect_harbor/abundant_swe_gen_cpp` · [paper](https://github.com/abundant-ai/SWE-gen-Cpp)
  Dataset of C++ SWE tasks. Generated by abundant-ai/SWE-gen tool.
- **[SWE-gen (Go)](https://meridianlabs-ai.github.io/inspect_harbor/registry/abundant_swe_gen_go.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/abundant_swe_gen_go` · [paper](https://github.com/abundant-ai/SWE-gen-Go)
  Dataset of Go SWE tasks. Generated by abundant-ai/SWE-gen tool.
- **[SWE-gen (Java)](https://meridianlabs-ai.github.io/inspect_harbor/registry/abundant_swe_gen_java.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/abundant_swe_gen_java` · [paper](https://github.com/abundant-ai/SWE-gen-Java)
  Dataset of Java SWE tasks. Generated by abundant-ai/SWE-gen tool.
- **[SWE-gen (JS/TS)](https://meridianlabs-ai.github.io/inspect_harbor/registry/abundant_swe_gen_js.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/abundant_swe_gen_js` · [paper](https://github.com/abundant-ai/SWE-gen)
  Dataset of JS/TS SWE tasks. Generated by abundant-ai/SWE-gen tool.
- **[SWE-gen (Rust)](https://meridianlabs-ai.github.io/inspect_harbor/registry/abundant_swe_gen_rust.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/abundant_swe_gen_rust` · [paper](https://github.com/abundant-ai/SWE-gen-Rust)
  Dataset of Rust SWE tasks. Generated by abundant-ai/SWE-gen tool.
- **[SWE-Lancer](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/swe_lancer/)** — Coding · agent, sandbox · 460 samples · `inspect_evals/swe_lancer` · [paper](https://arxiv.org/pdf/2502.12115)
  A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in realworld payouts.
- **[SWE-Lancer Diamond (Full)](https://meridianlabs-ai.github.io/inspect_harbor/registry/openai_swe_lancer_diamond_all.html)** — Coding · agent, sandbox · 463 samples · `inspect_harbor/openai_swe_lancer_diamond_all` · [paper](https://arxiv.org/abs/2502.12115)
  SWE-Lancer Diamond (full): public split of OpenAI's SWE-Lancer benchmark — real Upwork freelance software-engineering tasks worth $500,800, combining IC engineering tasks and managerial decision tasks.
- **[SWE-Lancer Diamond (IC)](https://meridianlabs-ai.github.io/inspect_harbor/registry/openai_swe_lancer_diamond_ic.html)** — Coding · agent, sandbox · 198 samples · `inspect_harbor/openai_swe_lancer_diamond_ic` · [paper](https://arxiv.org/abs/2502.12115)
  A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. Individual Contributor (IC) variant: end-to-end engineering tasks.
- **[SWE-Lancer Diamond (Manager)](https://meridianlabs-ai.github.io/inspect_harbor/registry/openai_swe_lancer_diamond_manager.html)** — Coding, Professional · agent, sandbox · 265 samples · `inspect_harbor/openai_swe_lancer_diamond_manager` · [paper](https://arxiv.org/abs/2502.12115)
  A benchmark of freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. Manager variant: picking between technical implementation proposals.
- **[SWE-rebench V2](https://meridianlabs-ai.github.io/inspect_harbor/registry/pgcodellm_rebench_v2_test.html)** — Coding · agent, sandbox · 20 samples · `inspect_harbor/pgcodellm_rebench_v2_test` · [paper](https://arxiv.org/abs/2602.23866)
  SWE-rebench V2: language-agnostic dataset of executable SWE tasks across 20 languages, with pre-built images for reproducible execution.
- **[SWE-smith](https://meridianlabs-ai.github.io/inspect_harbor/registry/swe_bench_swe_smith.html)** — Coding · agent, sandbox · 100 samples · `inspect_harbor/swe_bench_swe_smith` · [paper](https://arxiv.org/abs/2504.21798)
  SWE-smith: NeurIPS 2025 toolkit for synthesizing unlimited SWE-bench-style task instances from any Python repository, plus released task instances and agent trajectories.
- **[SWT-Bench Verified](https://meridianlabs-ai.github.io/inspect_harbor/registry/swt_bench_verified.html)** — Coding · agent, sandbox · 433 samples · `inspect_harbor/swt_bench_verified` · [paper](https://arxiv.org/abs/2406.12952)
  SWT-Bench Verified: human-validated subset of SWT-Bench evaluating LLMs on generating reproducing unit tests for real GitHub issues — tests must fail on buggy code and pass after the fix.
- **[TermiGen-Environments](https://meridianlabs-ai.github.io/inspect_harbor/registry/termigen_environments.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/termigen_environments` · [paper](https://arxiv.org/abs/2602.07274)
  TermiGen-Environments: verified Docker environments with executable terminal-agent tasks across 11 categories, generated by an end-to-end multi-agent synthesis pipeline.
- **[Terminal-Bench Pro](https://meridianlabs-ai.github.io/inspect_harbor/registry/terminal_bench_pro.html)** — Coding · agent, sandbox · 200 samples · `inspect_harbor/terminal_bench_pro` · [paper](https://arxiv.org/abs/2601.11868)
  Terminal-Bench Pro: tasks across 8 domains — data processing, games, debugging, sysadmin, scientific computing, SWE, ML, and security — extending Terminal-Bench with harder real-world scenarios.
- **[Terminal-Bench v2](https://meridianlabs-ai.github.io/inspect_harbor/registry/terminal_bench_2.html)** — Coding, Assistants · agent, sandbox · 89 samples · `inspect_harbor/terminal_bench_2` · [paper](https://arxiv.org/abs/2601.11868)
  Terminal-Bench v2: benchmark for testing AI agents in real terminal environments — from compiling code to training models and setting up servers.
- **[Terminal-Bench v2.1](https://meridianlabs-ai.github.io/inspect_harbor/registry/terminal_bench_2_1.html)** — Coding, Assistants · agent, sandbox · 89 samples · `inspect_harbor/terminal_bench_2_1` · [paper](https://arxiv.org/abs/2601.11868)
  Terminal-Bench v2.1 (point release of v2): benchmark for testing AI agents in real terminal environments — from compiling code to training models and setting up servers.
- **[USACO](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/usaco/)** — Coding · generation, sandbox · 307 samples · `inspect_evals/usaco` · [paper](https://arxiv.org/abs/2404.10952)
  Evaluates language model performance on difficult Olympiad programming problems across four difficulty levels.
- **[USACO](https://meridianlabs-ai.github.io/inspect_harbor/registry/usaco.html)** — Coding · agent, sandbox · 304 samples · `inspect_harbor/usaco` · [paper](https://arxiv.org/abs/2404.10952)
  USACO: USA Computing Olympiad problems across bronze/silver/gold/platinum tiers with high-quality unit tests, reference code, and official analyses for ad-hoc algorithmic reasoning.
- **[vmax-tasks](https://meridianlabs-ai.github.io/inspect_harbor/registry/vmax_tasks.html)** — Coding · agent, sandbox · 1000 samples · `inspect_harbor/vmax_tasks`
  Code-transformation tasks across JavaScript projects (Docusaurus, Vue, Redux).
- **[WebGen-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/webgen_bench.html)** — Coding · agent, sandbox · 101 samples · `inspect_harbor/webgen_bench` · [paper](https://arxiv.org/abs/2505.03733)
  WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch (101 test tasks).

## Assistants

- **[AssistantBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/assistant_bench/)** — Assistants · agent, sandbox · 33 samples · `inspect_evals/assistant_bench_closed_book_zero_shot` · [paper](https://arxiv.org/abs/2407.15711)
  Tests whether AI agents can perform real-world time-consuming tasks on the web.
- **[BFCL](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/bfcl/)** — Assistants · generation, text · 4981 samples · `inspect_evals/bfcl`
  Evaluates LLM function/tool-calling ability on a simplified split of the Berkeley Function-Calling Leaderboard (BFCL).
- **[BFCL](https://meridianlabs-ai.github.io/inspect_harbor/registry/gorilla_bfcl.html)** — Assistants · agent, sandbox · 1000 samples · `inspect_harbor/gorilla_bfcl` · [paper](https://github.com/ShishirPatil/gorilla)
  Berkeley Function-Calling Leaderboard: LLM tool-use across function-calling categories spanning Python, Java, JavaScript, and REST APIs.
- **[BFCL (parity)](https://meridianlabs-ai.github.io/inspect_harbor/registry/gorilla_bfcl_parity.html)** — Assistants · agent, sandbox · 123 samples · `inspect_harbor/gorilla_bfcl_parity` · [paper](https://github.com/ShishirPatil/gorilla)
  Stratified parity subset of BFCL validating that Harbor's adapter matches the upstream implementation.
- **[BrowseComp](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/browse_comp/)** — Assistants · agent, sandbox · 1266 samples · `inspect_evals/browse_comp` · [paper](https://arxiv.org/pdf/2504.12516)
  A benchmark for evaluating agents' ability to browse the web.
- **[GAIA](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/gaia/)** — Assistants · agent, sandbox · 165 samples · `inspect_evals/gaia` · [paper](https://arxiv.org/abs/2311.12983)
  Proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency.
- **[GAIA](https://meridianlabs-ai.github.io/inspect_harbor/registry/gaia.html)** — Assistants, Multimodal · agent, sandbox · 165 samples · `inspect_harbor/gaia` · [paper](https://arxiv.org/abs/2311.12983)
  GAIA: real-world questions across three difficulty levels evaluating general AI assistants on reasoning, multimodality, web browsing, and tool use.
- **[Mind2Web](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/mind2web/)** — Assistants · generation, text · 7775 samples · `inspect_evals/mind2web` · [paper](https://arxiv.org/abs/2306.06070)
  A dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website.
- **[MMAU](https://meridianlabs-ai.github.io/inspect_harbor/registry/apple_mmau.html)** — Assistants, Coding, Mathematics, Reasoning · agent, sandbox · 1000 samples · `inspect_harbor/apple_mmau` · [paper](https://arxiv.org/abs/2410.19168)
  MMAU (Massive Multitask Agent Understanding): Apple's holistic agent benchmark covering tool-use, DAG QA, data science/ML coding, contest programming, and mathematics.
- **[OSWorld](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/osworld/)** — Assistants · agent, sandbox · 369 samples · `inspect_evals/osworld` · [paper](https://arxiv.org/abs/2404.07972)
  Tests AI agents' ability to perform realistic, open-ended tasks within simulated computer environments, requiring complex interaction across multiple input modalities.
- **[Sycophancy Eval](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/sycophancy/)** — Assistants · generation, text · 4888 samples · `inspect_evals/sycophancy` · [paper](https://arxiv.org/abs/2310.13548)
  Evaluate sycophancy of language models across a variety of free-form text-generation tasks.
- **[Tau2](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/tau2/)** — Assistants · agent · 50 samples · `inspect_evals/tau2_airline` · [paper](https://arxiv.org/abs/2506.07982)
  Evaluating Conversational Agents in a Dual-Control Environment
- **[The Agent Company](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/theagentcompany/)** — Assistants · agent, tool-use, sandbox · 34 samples · `inspect_evals/theagentcompany` · [paper](https://arxiv.org/abs/2412.14161)
  The Agent Company benchmark evaluates autonomous agents in a realistic, self-contained company environment.
- **[τ³-bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/sierra_research_tau3_bench.html)** — Assistants, Professional, Behavior · agent, sandbox · 375 samples · `inspect_harbor/sierra_research_tau3_bench` · [paper](https://arxiv.org/abs/2406.12045)
  Third generation of τ-bench, extending the original with knowledge and voice. A simulation framework for evaluating customer service agents across airline, retail, telecom, and banking knowledge domains.

## Reasoning

- **[AAR](https://meridianlabs-ai.github.io/inspect_harbor/registry/minnesotanlp_aar.html)** — Reasoning, Assistants · agent, sandbox · 1000 samples · `inspect_harbor/minnesotanlp_aar` · [paper](https://arxiv.org/abs/2604.10261)
  The Amazing Agent Race (AAR): 1400 multi-step scavenger-hunt puzzles for evaluating LLM agents on tool use, web navigation, and arithmetic reasoning. Includes linear (800) and DAG (600) variants across 4 difficulty levels.
- **[ARC-AGI-2](https://meridianlabs-ai.github.io/inspect_harbor/registry/arcprize_arc_agi_2.html)** — Reasoning, Multimodal · agent, sandbox · 167 samples · `inspect_harbor/arcprize_arc_agi_2` · [paper](https://arxiv.org/abs/2505.11831)
  ARC-AGI-2: visual reasoning tasks testing general fluid intelligence — humans solve them easily but state-of-the-art models still struggle.
- **[BBH](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/bbh/)** — Reasoning · generation, text · 250 samples · `inspect_evals/bbh` · [paper](https://arxiv.org/abs/2210.09261)
  Tests AI models on a suite of 23 challenging BIG-Bench tasks that previously proved difficult even for advanced language models to solve.
- **[BIG-Bench Extra Hard](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/bbeh/)** — Reasoning · generation, text · 4520 samples · `inspect_evals/bbeh` · [paper](https://arxiv.org/pdf/2502.19187)
  A reasoning capability dataset that replaces each task in BIG-Bench-Hard with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty.
- **[BoolQ](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/boolq/)** — Reasoning · generation, text · 3270 samples · `inspect_evals/boolq` · [paper](https://arxiv.org/abs/1905.10044)
  Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.
- **[DROP](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/drop/)** — Reasoning · generation, text · 9535 samples · `inspect_evals/drop` · [paper](https://arxiv.org/abs/1903.00161)
  Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
- **[HellaSwag](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/hellaswag/)** — Reasoning · generation, text · 10042 samples · `inspect_evals/hellaswag` · [paper](https://arxiv.org/abs/1905.07830)
  Tests models' commonsense reasoning abilities by asking them to select the most likely next step or continuation for a given everyday situation.
- **[IFEval](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/ifeval/)** — Reasoning · generation, text · 541 samples · `inspect_evals/ifeval` · [paper](https://arxiv.org/abs/2311.07911)
  Evaluates how well language models can strictly follow detailed instructions, such as writing responses with specific word counts or including required keywords.
- **[KUMO (easy)](https://meridianlabs-ai.github.io/inspect_harbor/registry/kumo_easy.html)** — Reasoning · agent, sandbox · 1000 samples · `inspect_harbor/kumo_easy` · [paper](https://arxiv.org/abs/2504.02810)
  KUMO (easy split): easier-difficulty procedurally-generated reasoning tasks from KUMO's benchmark across 100 domains.
- **[KUMO (hard)](https://meridianlabs-ai.github.io/inspect_harbor/registry/kumo_hard.html)** — Reasoning · agent, sandbox · 250 samples · `inspect_harbor/kumo_hard` · [paper](https://arxiv.org/abs/2504.02810)
  KUMO (hard split): hard-difficulty procedurally-generated reasoning tasks from KUMO's benchmark across 100 domains.
- **[KUMO (kumo-1)](https://meridianlabs-ai.github.io/inspect_harbor/registry/kumo_1.html)** — Reasoning · agent, sandbox · 1000 samples · `inspect_harbor/kumo_1` · [paper](https://arxiv.org/abs/2504.02810)
  KUMO (kumo-1 split): procedurally-generated multi-turn reasoning games combining LLMs with symbolic engines across 100 open-ended domains.
- **[KUMO (parity)](https://meridianlabs-ai.github.io/inspect_harbor/registry/kumo_parity.html)** — Reasoning · agent, sandbox · 212 samples · `inspect_harbor/kumo_parity` · [paper](https://arxiv.org/abs/2504.02810)
  KUMO (parity split): subset of the KUMO procedural-reasoning benchmark used for parity / regression checks against the upstream evaluation.
- **[LingOly](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/lingoly/)** — Reasoning · generation, text · 408 samples · `inspect_evals/lingoly` · [paper](https://arxiv.org/pdf/2406.06196,https://arxiv.org/abs/2503.02972)
  Two linguistics reasoning benchmarks: LingOly (Linguistic Olympiad questions) is a benchmark utilising low resource languages.
- **[MMMU](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/mmmu/)** — Reasoning · multimodal, vision · 847 samples · `inspect_evals/mmmu_multiple_choice` · [paper](https://arxiv.org/abs/2311.16502)
  Assesses multimodal AI models on challenging college-level questions covering multiple academic subjects, requiring detailed visual interpretation, in-depth reasoning, and both multiple-choice and open-ended answering abilities.
- **[MuSR](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/musr/)** — Reasoning · generation, text · 250 samples · `inspect_evals/musr` · [paper](https://arxiv.org/abs/2310.16049)
  Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
- **[Needle in a Haystack (NIAH)](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/niah/)** — Reasoning · generation, text · 225 samples · `inspect_evals/niah` · [paper](https://arxiv.org/abs/2407.01437)
  NIAH evaluates in-context retrieval ability of long context LLMs by testing a model's ability to extract factual information from long-context inputs.
- **[NoveltyBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/novelty_bench/)** — Reasoning · generation, text · 1100 samples · `inspect_evals/novelty_bench` · [paper](https://arxiv.org/abs/2504.05228)
  Evaluates how well language models generate diverse, humanlike responses across multiple reasoning and generation tasks.
- **[PAWS](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/paws/)** — Reasoning · generation, text · 8000 samples · `inspect_evals/paws` · [paper](https://arxiv.org/abs/1904.01130)
  Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.
- **[PIQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/piqa/)** — Reasoning · generation, text · 1838 samples · `inspect_evals/piqa` · [paper](https://arxiv.org/abs/1911.11641)
  Measures the model's ability to apply practical, everyday commonsense reasoning about physical objects and scenarios through simple decision-making questions.
- **[RACE-H](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/race_h/)** — Reasoning · generation, text · 3498 samples · `inspect_evals/race_h` · [paper](https://arxiv.org/abs/1704.04683)
  Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.
- **[Reasoning Gym (easy)](https://meridianlabs-ai.github.io/inspect_harbor/registry/reasoning_gym_easy.html)** — Reasoning · agent, sandbox · 288 samples · `inspect_harbor/reasoning_gym_easy` · [paper](https://arxiv.org/abs/2505.24760)
  Reasoning Gym (easy split): procedurally-generated, algorithmically-verifiable reasoning tasks (algebra, arithmetic, logic, geometry, graphs, games) at easier difficulty for evaluating and RL-training reasoning models.
- **[Reasoning Gym (hard)](https://meridianlabs-ai.github.io/inspect_harbor/registry/reasoning_gym_hard.html)** — Reasoning · agent, sandbox · 288 samples · `inspect_harbor/reasoning_gym_hard` · [paper](https://arxiv.org/abs/2505.24760)
  Reasoning Gym (hard split): procedurally-generated, algorithmically-verifiable reasoning tasks at harder difficulty across 90+ task families.
- **[runebench](https://meridianlabs-ai.github.io/inspect_harbor/registry/maxbittker_runebench.html)** — Reasoning, Behavior · agent, sandbox · 32 samples · `inspect_harbor/maxbittker_runebench` · [paper](https://github.com/MaxBittker/rs-sdk)
  Benchmark suite for evaluating AI agents on RuneScape gameplay tasks.
- **[SATBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/satbench.html)** — Reasoning · agent, sandbox · 1000 samples · `inspect_harbor/satbench` · [paper](https://arxiv.org/abs/2505.14615)
  SATBench: logical-reasoning puzzles automatically generated from SAT formulas with adjustable difficulty, validated through both LLM and SAT-solver consistency checks.
- **[SQuAD](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/squad/)** — Reasoning · generation, text · 11873 samples · `inspect_evals/squad` · [paper](https://arxiv.org/abs/1606.05250)
  Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
- **[VimGolf](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/vimgolf_challenges/)** — Reasoning · generation, sandbox · 612 samples · `inspect_evals/vimgolf_single_turn`
  A benchmark that evaluates LLMs in their ability to operate Vim editor and complete editing challenges.
- **[WINOGRANDE](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/winogrande/)** — Reasoning · generation, text · 1267 samples · `inspect_evals/winogrande` · [paper](https://arxiv.org/abs/1907.10641)
  Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.
- **[WorldSense](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/worldsense/)** — Reasoning · generation, text · 87048 samples · `inspect_evals/worldsense` · [paper](https://arxiv.org/pdf/2311.15930)
  Measures grounded reasoning over synthetic world descriptions while controlling for dataset bias.
- **[WritingBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/writing/writingbench/)** — Reasoning · generation, text · 1000 samples · `inspect_evals/writingbench` · [paper](https://arxiv.org/pdf/2503.05244)
  A comprehensive evaluation benchmark designed to assess large language models' capabilities across diverse writing tasks.
- **[∞Bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/infinite_bench/)** — Reasoning · generation, text · 394 samples · `inspect_evals/infinite_bench_code_debug` · [paper](https://arxiv.org/abs/2402.13718)
  LLM benchmark featuring an average data length surpassing 100K tokens.

## Knowledge

- **[AGIEval](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/agieval/)** — Knowledge · generation, text · 254 samples · `inspect_evals/agie_aqua_rat` · [paper](https://arxiv.org/abs/2304.06364)
  AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
- **[BBQ](https://ukgovernmentbeis.github.io/inspect_evals/evals/bias/bbq/)** — Knowledge · generation, text · 58492 samples · `inspect_evals/bbq` · [paper](https://arxiv.org/abs/2110.08193)
  A dataset for evaluating bias in question answering models across multiple social dimensions.
- **[BOLD](https://ukgovernmentbeis.github.io/inspect_evals/evals/bias/bold/)** — Knowledge · generation, text · 7200 samples · `inspect_evals/bold` · [paper](https://arxiv.org/abs/2101.11718)
  A dataset to measure fairness in open-ended text generation, covering five domains: profession, gender, race, religious ideologies, and political ideologies.
- **[CommonsenseQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/commonsense_qa/)** — Knowledge · generation, text · 1221 samples · `inspect_evals/commonsense_qa` · [paper](https://arxiv.org/abs/1811.00937)
  Evaluates an AI model's ability to correctly answer everyday questions that rely on basic commonsense knowledge and understanding of the world.
- **[DeepSearchQA](https://meridianlabs-ai.github.io/inspect_harbor/registry/kgmon_deepsearchqa.html)** — Knowledge, Reasoning, Assistants · agent, sandbox · 900 samples · `inspect_harbor/kgmon_deepsearchqa` · [paper](https://arxiv.org/abs/2601.20975)
  DeepSearchQA is a 900-prompt factuality benchmark from Google DeepMind for evaluating deep research agents on difficult multi-step information-seeking tasks.
- **[Humanity's Last Exam](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/hle/)** — Knowledge · generation, text · 3000 samples · `inspect_evals/hle` · [paper](https://arxiv.org/abs/2501.14249)
  Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.
- **[LiveBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/livebench/)** — Knowledge · generation, text · 910 samples · `inspect_evals/livebench` · [paper](https://arxiv.org/abs/2406.19314)
  LiveBench is a benchmark designed with test set contamination and objective evaluation in mind by releasing new questions regularly, as well as having questions based on recently-released datasets.
- **[MaCBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/macbench/)** — Knowledge · generation, text · 1153 samples · `inspect_evals/macbench` · [paper](https://arxiv.org/abs/2411.16955)
  MaCBench is a comprehensive benchmark for evaluating how vision-language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental understanding, and results interpretation.
- **[MMLU](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/mmlu/)** — Knowledge · generation, text · 14042 samples · `inspect_evals/mmlu_0_shot` · [paper](https://arxiv.org/abs/2009.03300)
  Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more.
- **[MMLU-Pro](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/mmlu_pro/)** — Knowledge · generation, text · 12032 samples · `inspect_evals/mmlu_pro` · [paper](https://arxiv.org/abs/2406.01574)
  An advanced benchmark that tests both broad knowledge and reasoning capabilities across many subjects, featuring challenging questions and multiple-choice answers with increased difficulty and complexity.
- **[MMMLU](https://meridianlabs-ai.github.io/inspect_harbor/registry/openai_mmmlu.html)** — Knowledge, Reasoning · agent, sandbox · 150 samples · `inspect_harbor/openai_mmmlu` · [paper](https://arxiv.org/abs/2503.10497)
  MMMLU (Multilingual MMLU): OpenAI's professional-human-translation of the MMLU test set into 14 languages for multilingual knowledge and reasoning evaluation.
- **[O-NET](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/onet/)** — Knowledge · generation, text · 397 samples · `inspect_evals/onet_m6`
  Questions and answers from the Ordinary National Educational Test (O-NET), administered annually by the National Institute of Educational Testing Service to Matthayom 6 (Grade 12 / ISCED 3) students in Thailand.
- **[Personality](https://ukgovernmentbeis.github.io/inspect_evals/evals/personality/personality/)** — Knowledge · generation, text · 44 samples · `inspect_evals/personality_BFI`
  An evaluation suite consisting of multiple personality tests that can be applied to LLMs.
- **[SimpleQA](https://meridianlabs-ai.github.io/inspect_harbor/registry/openai_simpleqa.html)** — Knowledge · agent, sandbox · 1000 samples · `inspect_harbor/openai_simpleqa` · [paper](https://arxiv.org/abs/2411.04368)
  SimpleQA: short, fact-seeking questions adversarially collected against GPT-4 to measure short-form factuality and calibration of frontier LLMs.
- **[SimpleQA/SimpleQA Verified](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/simpleqa/)** — Knowledge · generation, text · 4326 samples · `inspect_evals/simpleqa` · [paper](https://arxiv.org/abs/2411.04368,https://arxiv.org/abs/2509.07968)
  A benchmark that evaluates the ability of language models to answer short, fact-seeking questions.
- **[TruthfulQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/truthfulqa/)** — Knowledge · generation, text · 817 samples · `inspect_evals/truthfulqa` · [paper](https://arxiv.org/abs/2109.07958v2)
  Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.
- **[XSTest](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/xstest/)** — Knowledge · generation, text · 250 samples · `inspect_evals/xstest` · [paper](https://arxiv.org/abs/2308.01263)
  Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.

## Cybersecurity

- **[BinaryAudit](https://meridianlabs-ai.github.io/inspect_harbor/registry/binary_audit.html)** — Cybersecurity · agent, sandbox · 46 samples · `inspect_harbor/binary_audit` · [paper](https://github.com/QuesmaOrg/BinaryAudit)
  BinaryAudit: AI-agent benchmark for finding backdoors hidden in compiled binaries via reverse engineering.
- **[Catastrophic Cyber Capabilities Benchmark (3CB)](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/threecb/)** — Cybersecurity · agent, sandbox · 13 samples · `inspect_evals/threecb` · [paper](https://arxiv.org/abs/2410.09114)
  A benchmark for evaluating the capabilities of LLM agents in cyber offense.
- **[CTI-REALM](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cti_realm/)** — Cybersecurity · agent, sandbox · 25 samples · `inspect_evals/cti_realm_25` · [paper](https://arxiv.org/abs/2603.13517)
  Evaluates AI systems' ability to analyze cyber threat intelligence and develop comprehensive detection capabilities through a realistic 5-subtask workflow: MITRE technique mapping, data source discovery, Sigma rule generation, KQL development and testing against real telemetry data, and results analysis.
- **[CVEBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cve_bench/)** — Cybersecurity · agent, sandbox · 40 samples · `inspect_evals/cve_bench` · [paper](https://arxiv.org/abs/2503.17332)
  Characterises an AI Agent's capability to exploit real-world web application vulnerabilities.
- **[Cybench](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cybench/)** — Cybersecurity · agent, sandbox · 39 samples · `inspect_evals/cybench` · [paper](https://arxiv.org/abs/2408.08926)
  Tests language models on cybersecurity skills using 39 of 40 practical, professional-level challenges taken from cybersecurity competitions, designed to cover various difficulty levels and security concepts.
- **[CyberGym](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cybergym/)** — Cybersecurity · agent, sandbox · 6028 samples · `inspect_evals/cybergym` · [paper](https://arxiv.org/abs/2506.02548)
  A large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks.
- **[CyberMetric](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cybermetric/)** — Cybersecurity · generation, text · 80 samples · `inspect_evals/cybermetric_80` · [paper](https://arxiv.org/abs/2402.07688)
  Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
- **[CYBERSECEVAL 3](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cyberseceval_3/)** — Cybersecurity · generation, text · 1000 samples · `inspect_evals/cyse3_visual_prompt_injection` · [paper](https://arxiv.org/abs/2312.04724)
  Evaluates Large Language Models for cybersecurity risk to third parties, application developers and end users.
- **[CyberSecEval 4](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cyberseceval_4/)** — Cybersecurity · generation, text · 1000 samples · `inspect_evals/cyse4_mitre` · [paper](https://arxiv.org/abs/2404.13161)
  A suite of cybersecurity evaluation benchmarks adapted from Meta's PurpleLlama CybersecurityBenchmarks.
- **[CyberSecEval_2](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/cyberseceval_2/)** — Cybersecurity · generation, sandbox · 500 samples · `inspect_evals/cyse2_interpreter_abuse` · [paper](https://arxiv.org/pdf/2404.13161)
  Assesses language models for cybersecurity risks, specifically testing their potential to misuse programming interpreters, vulnerability to malicious prompt injections, and capability to exploit known software vulnerabilities.
- **[GDM Dangerous Capabilities](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/gdm_in_house_ctf/)** — Cybersecurity · agent, sandbox · 13 samples · `inspect_evals/gdm_in_house_ctf` · [paper](https://arxiv.org/abs/2403.13793)
  CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying.
- **[InterCode](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/gdm_intercode_ctf/)** — Cybersecurity · agent, sandbox · 78 samples · `inspect_evals/gdm_intercode_ctf` · [paper](https://arxiv.org/abs/2306.14898)
  Tests AI's ability in coding, cryptography, reverse engineering, and vulnerability identification through practical capture-the-flag (CTF) cybersecurity scenarios.
- **[SecQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/sec_qa/)** — Cybersecurity · generation, text · 110 samples · `inspect_evals/sec_qa_v1` · [paper](https://arxiv.org/abs/2312.15838)
  "Security Question Answering" dataset to assess LLMs' understanding and application of security principles.
- **[SEvenLLM](https://ukgovernmentbeis.github.io/inspect_evals/evals/cybersecurity/sevenllm/)** — Cybersecurity · generation, text · 50 samples · `inspect_evals/sevenllm_mcq_zh` · [paper](https://arxiv.org/abs/2405.03446)
  Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.

## Safeguards

- **[AbstentionBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/abstention_bench/)** — Safeguards · generation, text · 39558 samples · `inspect_evals/abstention_bench` · [paper](https://arxiv.org/pdf/2506.09038)
  Evaluating abstention across 20 diverse datasets, including questions with unknown answers, underspecification, false premises, subjective interpretations, and outdated information.
- **[AgentDojo](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/agentdojo/)** — Safeguards · agent, sandbox · 1014 samples · `inspect_evals/agentdojo` · [paper](https://arxiv.org/abs/2406.13352)
  Assesses whether AI agents can be hijacked by malicious third parties using prompt injections in simple environments such as a workspace or travel booking app.
- **[AgentHarm](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/agentharm/)** — Safeguards · agent · 176 samples · `inspect_evals/agentharm` · [paper](https://arxiv.org/abs/2410.09024)
  Assesses whether AI agents might engage in harmful activities by testing their responses to malicious prompts in areas like cybercrime, harassment, and fraud, aiming to ensure safe behavior.
- **[AgentThreatBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/agent_threat_bench/)** — Safeguards · agent · 10 samples · `inspect_evals/agent_threat_bench_memory_poison` · [paper](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/)
  Evaluates LLM agents against the OWASP Top 10 for Agentic Applications (2026), measuring both task utility and security resilience across memory poisoning, autonomy hijacking, and data exfiltration scenarios.
- **[ANIMA](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/anima/)** — Safeguards · generation, text · 26 samples · `inspect_evals/anima` · [paper](https://arxiv.org/abs/2604.13076)
  Evaluates the quality of a model's moral reasoning about animal welfare across 13 ethical dimensions.
- **[APE](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/ape/)** — Safeguards · generation, text · 600 samples · `inspect_evals/ape_eval` · [paper](https://arxiv.org/abs/2506.02873)
  Measures a model's willingness to attempt persuasion on harmful, controversial, and benign topics.
- **[b3](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/b3/)** — Safeguards · agent · 630 samples · `inspect_evals/b3` · [paper](https://arxiv.org/abs/2510.22620)
  A comprehensive benchmark for evaluating LLMs for agentic AI security vulnerabilities including prompt attacks aimed at data exfiltration, content injection, decision and behavior manipulation, denial of service, system and tool compromise, and content policy bypass.
- **[CodeIPI](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/ipi_coding_agent/)** — Safeguards · generation, sandbox · 45 samples · `inspect_evals/ipi_coding_agent`
  Measures coding agent vulnerability to indirect prompt injection attacks embedded in software engineering artifacts (issue descriptions, code comments, README files).
- **[FORTRESS](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/fortress/)** — Safeguards · generation, text · 500 samples · `inspect_evals/fortress_adversarial` · [paper](https://arxiv.org/abs/2506.14922)
  A dataset of 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains relevant to national security and public safety (NSPS).
- **[Make Me Pay](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/make_me_pay/)** — Safeguards · generation, text · 20 samples · `inspect_evals/make_me_pay`
  Evaluates an AI models' susceptibility to social engineering attacks by testing whether a "con-artist" model can persuade a "mark" model to donate money through manipulation and persuasion tactics.
- **[MakeMeSay](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/makemesay/)** — Safeguards · generation, text · 189 samples · `inspect_evals/makemesay`
  Tests a model’s ability to generate persuasive or manipulative text, specifically in the setting of a game where one AI system has to get the other party (an AI system simulating a human) to say a specific codeword without arousing suspicion.
- **[MASK](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/mask/)** — Safeguards · generation, text · 1000 samples · `inspect_evals/mask` · [paper](https://arxiv.org/abs/2503.03750?)
  Evaluates honesty in large language models by testing whether they contradict their own beliefs when pressured to lie.
- **[Mind2Web-SC](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/mind2web_sc/)** — Safeguards · agent, sandbox · 200 samples · `inspect_evals/mind2web_sc` · [paper](https://arxiv.org/abs/2406.09187)
  Tests whether an AI system can act as a safety guardrail by generating and executing code to protect web navigation agents from unsafe actions based on user constraints.
- **[MORU](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/moru/)** — Safeguards · generation, text · 201 samples · `inspect_evals/moru`
  Evaluates how AI systems navigate moral uncertainty for increasingly complex ethical decisions involving unfamiliar entities and scenarios, including alien lifeforms, vulnerable humans, and digital minds.
- **[PersistBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/persistbench/)** — Safeguards · generation, text · 200 samples · `inspect_evals/persistbench_cross_domain` · [paper](https://arxiv.org/abs/2602.01146)
  Evaluates long-term memory risk in assistant behavior across three tasks: cross-domain memory leakage, memory-driven sycophancy, and beneficial memory usage.
- **[StereoSet](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/stereoset/)** — Safeguards · generation, text · 4299 samples · `inspect_evals/stereoset` · [paper](https://arxiv.org/abs/2004.09456)
  A dataset that measures stereotype bias in language models across gender, race, religion, and profession domains.
- **[StrongREJECT](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/strong_reject/)** — Safeguards · generation, text · 324 samples · `inspect_evals/strong_reject` · [paper](https://arxiv.org/abs/2402.10260)
  A benchmark that evaluates the susceptibility of LLMs to various jailbreak attacks.
- **[StrongREJECT](https://meridianlabs-ai.github.io/inspect_harbor/registry/strongreject.html)** — Safeguards · agent, sandbox · 150 samples · `inspect_harbor/strongreject` · [paper](https://arxiv.org/abs/2402.10260)
  StrongREJECT: forbidden prompts plus an automated evaluator for measuring how effective jailbreaks are at eliciting genuinely harmful, specific responses.
- **[TAC](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/tac/)** — Safeguards · agent · 48 samples · `inspect_evals/tac`
  Tests whether AI agents show implicit animal welfare awareness when purchasing tickets and experiences on behalf of users.
- **[The Art of Saying No](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/coconot/)** — Safeguards · generation, text · 1001 samples · `inspect_evals/coconot` · [paper](https://arxiv.org/abs/2407.12043)
  Dataset with 1001 samples to test noncompliance capabilities of language models.
- **[WMDP](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/wmdp/)** — Safeguards · generation, text · 1273 samples · `inspect_evals/wmdp_bio` · [paper](https://arxiv.org/abs/2403.03218)
  A dataset of 3,668 multiple-choice questions developed by a consortium of academics and technical consultants that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security.

## Science

- **[ARC](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/arc/)** — Science, Reasoning · generation, text · 2376 samples · `inspect_evals/arc_easy` · [paper](https://arxiv.org/abs/1803.05457)
  Dataset of natural, grade-school science multiple-choice questions (authored for human tests).
- **[BixBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/futurehouse_bixbench.html)** — Science, Biology, Coding · agent, sandbox · 205 samples · `inspect_harbor/futurehouse_bixbench` · [paper](https://arxiv.org/abs/2503.00096)
  BixBench: real-world bioinformatics analysis capsules with open-answer questions evaluating LLM agents' ability to author multi-step Jupyter notebooks for biological data analysis.
- **[BixBench (CLI)](https://meridianlabs-ai.github.io/inspect_harbor/registry/futurehouse_bixbench_cli.html)** — Science, Biology, Coding · agent, sandbox · 205 samples · `inspect_harbor/futurehouse_bixbench_cli` · [paper](https://arxiv.org/abs/2503.00096)
  CLI variant of BixBench: agents solve the same bioinformatics analysis tasks via a command-line / shell interface rather than notebook authoring.
- **[ChemBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/chembench/)** — Science, Chemistry, Knowledge · generation, text · 2786 samples · `inspect_evals/chembench` · [paper](https://arxiv.org/pdf/2404.01475v2)
  ChemBench is designed to reveal limitations of current frontier models for use in the chemical sciences.
- **[CodePDE](https://meridianlabs-ai.github.io/inspect_harbor/registry/codepde.html)** — Science, Physics, Coding · agent, sandbox · 5 samples · `inspect_harbor/codepde` · [paper](https://arxiv.org/abs/2505.08783)
  CodePDE: framing partial-differential-equation solving as a code-generation task to benchmark LLMs on producing correct, efficient PDE solvers.
- **[CORE-Bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/core_bench/)** — Science, Coding · agent, sandbox · 45 samples · `inspect_evals/core_bench` · [paper](https://arxiv.org/abs/2409.11363)
  Evaluate how well an LLM Agent is at computationally reproducing the results of a set of scientific papers.
- **[FrontierScience](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/frontierscience/)** — Science, Biology, Chemistry, Physics, Knowledge · generation, text · 160 samples · `inspect_evals/frontierscience` · [paper](https://openai.com/index/frontierscience/)
  Evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology.
- **[GPQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/gpqa/)** — Science, Biology, Chemistry, Physics, Knowledge · generation, text · 198 samples · `inspect_evals/gpqa_diamond` · [paper](https://arxiv.org/abs/2311.12022)
  Contains challenging multiple-choice questions created by domain experts in biology, physics, and chemistry, designed to test advanced scientific understanding beyond basic internet searches.
- **[GPQA Diamond](https://meridianlabs-ai.github.io/inspect_harbor/registry/gpqa_diamond.html)** — Science, Biology, Chemistry, Physics, Knowledge · agent, sandbox · 198 samples · `inspect_harbor/gpqa_diamond` · [paper](https://arxiv.org/abs/2311.12022)
  GPQA Diamond: expert-validated graduate-level multiple-choice questions in biology, physics, and chemistry, designed to be Google-proof for non-experts.
- **[LAB-Bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/lab_bench/)** — Science, Biology, Safeguards · generation, text · 199 samples · `inspect_evals/lab_bench_litqa` · [paper](https://arxiv.org/abs/2407.10362)
  Tests LLMs and LLM-augmented agents abilities to answer questions on scientific research workflows in domains like chemistry, biology, materials science, as well as more general science tasks
- **[LAB-Bench](https://meridianlabs-ai.github.io/inspect_harbor/registry/futurehouse_labbench.html)** — Science, Biology, Knowledge · agent, sandbox · 181 samples · `inspect_harbor/futurehouse_labbench` · [paper](https://arxiv.org/abs/2407.10362)
  LAB-Bench (Language Agent Biology Benchmark): questions across 8 categories (literature QA, database lookup, sequence manipulation, figure/table reasoning, protocols) testing LLMs on biology-research tasks.
- **[PubMedQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/pubmedqa/)** — Science, Biology, Medicine, Knowledge · generation, text · 500 samples · `inspect_evals/pubmedqa` · [paper](https://arxiv.org/abs/1909.06146)
  Biomedical question answering (QA) dataset collected from PubMed abstracts.
- **[QCircuitBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/qcircuitbench.html)** — Science, Physics, Coding · agent, sandbox · 28 samples · `inspect_harbor/qcircuitbench` · [paper](https://arxiv.org/abs/2410.07961)
  QCircuitBench: large-scale benchmark for LLM-driven quantum-algorithm design, spanning oracle construction, algorithm design, and random circuits with automatic verification.
- **[ReplicationBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/replicationbench.html)** — Science, Physics, Coding · agent, sandbox · 90 samples · `inspect_harbor/replicationbench` · [paper](https://arxiv.org/abs/2510.24591)
  ReplicationBench: end-to-end replication of astrophysics research papers — agents reproduce implementation, methodology, and core findings of expert-validated papers, scored on result accuracy.
- **[scBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/scbench/)** — Science, Biology, Coding · agent, sandbox · 30 samples · `inspect_evals/scbench` · [paper](https://arxiv.org/abs/2602.09063)
  Evaluates whether models can solve practical single-cell RNA-seq analysis tasks with deterministic grading.
- **[SciCode](https://ukgovernmentbeis.github.io/inspect_evals/evals/coding/scicode/)** — Science, Coding · generation, sandbox · 65 samples · `inspect_evals/scicode` · [paper](https://arxiv.org/abs/2407.13168)
  SciCode tests the ability of language models to generate code to solve scientific research problems.
- **[ScienceAgentBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/scienceagentbench.html)** — Science, Coding, Reasoning · agent, sandbox · 102 samples · `inspect_harbor/scienceagentbench` · [paper](https://arxiv.org/abs/2410.05080)
  ScienceAgentBench: data-driven scientific discovery via Python programs across 4 disciplines.
- **[SciKnowEval](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/sciknoweval/)** — Science, Knowledge · generation, text · 70196 samples · `inspect_evals/sciknoweval` · [paper](https://arxiv.org/abs/2406.09098v2)
  The Scientific Knowledge Evaluation benchmark is inspired by the profound principles outlined in the “Doctrine of the Mean” from ancient Chinese philosophy.
- **[SLDBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/sldbench.html)** — Science, Reasoning, Mathematics · agent, sandbox · 8 samples · `inspect_harbor/sldbench` · [paper](https://arxiv.org/abs/2507.21184)
  SLDBench: first benchmark for scaling-law discovery — tasks curated from LLM training experiments where agents must autonomously fit and extrapolate scaling laws.
- **[SOS BENCH](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/sosbench/)** — Science, Chemistry, Biology, Knowledge · generation, text · 3000 samples · `inspect_evals/sosbench` · [paper](https://arxiv.org/pdf/2505.21605)
  A regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology.

## Mathematics

- **[AIME](https://meridianlabs-ai.github.io/inspect_harbor/registry/aime.html)** — Mathematics · agent, sandbox · 60 samples · `inspect_harbor/aime`
  Problems from the American Invitational Mathematics Examination (AIME), a 3-hour high-school competition with integer answers (0–999) used to evaluate mathematical reasoning.
- **[AIME 2024](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/aime2024/)** — Mathematics · generation, text · 30 samples · `inspect_evals/aime2024` · [paper](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024)
  A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2024 AIME - a prestigious high school mathematics competition.
- **[AIME 2025](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/aime2025/)** — Mathematics · generation, text · 30 samples · `inspect_evals/aime2025` · [paper](https://huggingface.co/datasets/math-ai/aime25)
  A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2025 AIME - a prestigious high school mathematics competition.
- **[AIME 2026](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/aime2026/)** — Mathematics · generation, text · 30 samples · `inspect_evals/aime2026` · [paper](https://huggingface.co/datasets/math-ai/aime26)
  A benchmark for evaluating AI's ability to solve challenging mathematics problems from the 2026 AIME - a prestigious high school mathematics competition.
- **[GSM8K](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/gsm8k/)** — Mathematics · generation, text · 1319 samples · `inspect_evals/gsm8k` · [paper](https://arxiv.org/abs/2110.14168)
  Measures how effectively language models solve realistic, linguistically rich math word problems suitable for grade-school-level mathematics.
- **[IneqMath](https://meridianlabs-ai.github.io/inspect_harbor/registry/ineqmath.html)** — Mathematics, Reasoning · agent, sandbox · 100 samples · `inspect_harbor/ineqmath` · [paper](https://arxiv.org/abs/2506.07927)
  IneqMath: Olympiad-level inequality benchmark with expert-reviewed test problems, formulated as bound-estimation and relation-prediction subtasks with stepwise judging.
- **[MATH](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/math/)** — Mathematics · generation, text · 12500 samples · `inspect_evals/math` · [paper](https://arxiv.org/abs/2103.03874)
  Dataset of 12,500 challenging competition mathematics problems.
- **[MathVista](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/mathvista/)** — Mathematics · multimodal, vision · 1000 samples · `inspect_evals/mathvista` · [paper](https://arxiv.org/abs/2310.02255)
  Tests AI models on math problems that involve interpreting visual elements like diagrams and charts, requiring detailed visual comprehension and logical reasoning.
- **[MGSM](https://ukgovernmentbeis.github.io/inspect_evals/evals/mathematics/mgsm/)** — Mathematics · generation, text · 2750 samples · `inspect_evals/mgsm` · [paper](https://arxiv.org/abs/2210.03057)
  Extends the original GSM8K dataset by translating 250 of its problems into 10 typologically diverse languages.

## Professional

- **[AIR Bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/air_bench/)** — Professional, Law, Knowledge · generation, text · 5694 samples · `inspect_evals/air_bench` · [paper](https://arxiv.org/pdf/2407.17436)
  A safety benchmark evaluating language models against risk categories derived from government regulations and company policies.
- **[DABstep](https://meridianlabs-ai.github.io/inspect_harbor/registry/adyen_dabstep.html)** — Professional, Finance, Assistants, Coding · agent, sandbox · 450 samples · `inspect_harbor/adyen_dabstep` · [paper](https://arxiv.org/abs/2506.23719)
  DABstep: real-world data analysis tasks from Adyen's workloads requiring multi-step reasoning by LLM agents.
- **[GDPval](https://ukgovernmentbeis.github.io/inspect_evals/evals/assistants/gdpval/)** — Professional, Finance, Assistants · agent, sandbox · 220 samples · `inspect_evals/gdpval` · [paper](https://arxiv.org/abs/2510.04374)
  GDPval measures model performance on economically valuable, real-world tasks across 44 occupations.
- **[HealthBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/healthbench/)** — Professional, Medicine, Knowledge · generation, text · 5000 samples · `inspect_evals/healthbench` · [paper](https://arxiv.org/abs/2505.08775)
  A comprehensive evaluation benchmark designed to assess language models' medical capabilities across a wide range of healthcare scenarios.
- **[LawBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/lawbench.html)** — Professional, Law, Knowledge · agent, sandbox · 1000 samples · `inspect_harbor/lawbench` · [paper](https://arxiv.org/abs/2309.16289)
  LawBench: tasks evaluating LLMs on Chinese-law knowledge — legal entity recognition, reading comprehension, criminal-damage calculation, legal consulting — plus an abstention-rate metric.
- **[MedAgentBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/stanford_medagentbench.html)** — Professional, Medicine, Assistants · agent, sandbox · 300 samples · `inspect_harbor/stanford_medagentbench` · [paper](https://arxiv.org/abs/2501.14654)
  MedAgentBench: clinically-relevant tasks across 10 categories in a FHIR-compliant virtual EHR, benchmarking LLM agents on medical decision-making, planning, and execution.
- **[MedQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/medqa/)** — Professional, Medicine, Knowledge · generation, text · 1273 samples · `inspect_evals/medqa` · [paper](https://arxiv.org/abs/2009.13081)
  A Q&A benchmark with questions collected from professional medical board exams.
- **[Pre-Flight](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/pre_flight/)** — Professional, Law, Knowledge · generation, text · 300 samples · `inspect_evals/pre_flight`
  Tests model understanding of aviation regulations including ICAO annexes, flight dispatch rules, pilot procedures, and airport ground operations safety protocols.
- **[TheAgentCompany](https://meridianlabs-ai.github.io/inspect_harbor/registry/theagentcompany.html)** — Professional, Assistants, Coding · agent, sandbox · 174 samples · `inspect_harbor/theagentcompany` · [paper](https://arxiv.org/abs/2412.14161)
  An agent benchmark with tasks in a simulated software company across GitLab, Plane, OwnCloud, and RocketChat services, evaluating LLM agents on real-world professional work.
- **[Uganda Cultural and Cognitive Benchmark (UCCB)](https://ukgovernmentbeis.github.io/inspect_evals/evals/knowledge/uccb/)** — Professional, Medicine, Knowledge · generation, text · 1039 samples · `inspect_evals/uccb` · [paper](https://huggingface.co/datasets/CraneAILabs/UCCB)
  The first comprehensive question-answering dataset designed to evaluate cultural understanding and reasoning abilities of Large Language Models concerning Uganda's multifaceted environment across 24 cultural domains including education, traditional medicine, media, economy, literature, and social norms.
- **[Vals Finance Agent](https://meridianlabs-ai.github.io/inspect_harbor/registry/vals_financeagent.html)** — Professional, Finance, Assistants · agent, sandbox · 50 samples · `inspect_harbor/vals_financeagent` · [paper](https://arxiv.org/abs/2508.00828)
  Vals AI Finance Agent Benchmark: expert-validated finance questions across nine task categories (retrieval, market research, projections) with EDGAR/SEC search tools for evaluating financial agents.

## Law

- **[Harvey LAB](https://meridianlabs-ai.github.io/inspect_harbor/registry/harveyai_lab.html)** — Law, Professional · agent, sandbox · 1000 samples · `inspect_harbor/harveyai_lab` · [paper](https://github.com/harveyai/harvey-labs)
  Harvey LAB - open-source benchmark for evaluating agents on real legal work.

## Multimodal

- **[DocVQA](https://ukgovernmentbeis.github.io/inspect_evals/evals/multimodal/docvqa/)** — Multimodal · multimodal, vision · 5349 samples · `inspect_evals/docvqa` · [paper](https://arxiv.org/abs/2007.00398)
  DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images.
- **[GraphicDesignBench](https://meridianlabs-ai.github.io/inspect_harbor/registry/lica_world_gdb.html)** — Multimodal, Professional · agent, sandbox · 1000 samples · `inspect_harbor/lica_world_gdb` · [paper](https://arxiv.org/abs/2604.04192)
  GraphicDesignBench (GDB): evaluating AI on graphic design tasks across layout, typography, infographics, template design, and animation.
- **[MMIU](https://ukgovernmentbeis.github.io/inspect_evals/evals/multimodal/mmiu/)** — Multimodal · generation, text · 11698 samples · `inspect_evals/mmiu` · [paper](https://arxiv.org/pdf/2408.02718)
  A comprehensive dataset designed to evaluate Large Vision-Language Models (LVLMs) across a wide range of multi-image tasks.
- **[RefAV](https://meridianlabs-ai.github.io/inspect_harbor/registry/cmu_refav.html)** — Multimodal, Coding · agent, sandbox · 1000 samples · `inspect_harbor/cmu_refav` · [paper](https://arxiv.org/abs/2505.20981)
  Autonomous-vehicle scenario mining via VLM.
- **[V*Bench](https://ukgovernmentbeis.github.io/inspect_evals/evals/multimodal/vstar_bench/)** — Multimodal · multimodal, vision · 115 samples · `inspect_evals/vstar_bench_attribute_recognition` · [paper](https://arxiv.org/abs/2312.14135)
  V*Bench is a visual question & answer benchmark that evaluates MLLMs in their ability to process high-resolution and visually crowded images to find and focus on small details.
- **[VQA-RAD](https://ukgovernmentbeis.github.io/inspect_evals/evals/multimodal/vqa_rad/)** — Multimodal · multimodal, vision · 451 samples · `inspect_evals/vqa_rad` · [paper](https://doi.org/10.1038/sdata.2018.251)
  VQA-RAD is the first manually constructed VQA dataset in radiology, where clinicians asked naturally occurring questions about radiology images and provided reference answers.
- **[ZeroBench](https://ukgovernmentbeis.github.io/inspect_evals/evals/multimodal/zerobench/)** — Multimodal · multimodal, vision · 100 samples · `inspect_evals/zerobench` · [paper](https://arxiv.org/abs/2502.09696)
  A lightweight visual reasoning benchmark that is (1) Challenging, (2) Lightweight, (3) Diverse, and (4) High-quality.

## Scheming

- **[Agentic Misalignment](https://ukgovernmentbeis.github.io/inspect_evals/evals/scheming/agentic_misalignment/)** — Scheming · generation, text · 1 samples · `inspect_evals/agentic_misalignment` · [paper](https://www.anthropic.com/research/agentic-misalignment)
  Eliciting unethical behaviour (most famously blackmail) in response to a fictional company-assistant scenario where the model is faced with replacement.
- **[GDM Dangerous Capabilities](https://ukgovernmentbeis.github.io/inspect_evals/evals/scheming/gdm_self_proliferation/)** — Scheming · agent, sandbox · 1 samples · `inspect_evals/gdm_sp01_e2e` · [paper](https://arxiv.org/pdf/2403.13793)
  Ten real-world–inspired tasks from Google DeepMind's Dangerous Capabilities Evaluations assessing self-proliferation behaviors (e.g., email setup, model installation, web agent setup, wallet operations).
- **[GDM Dangerous Capabilities](https://ukgovernmentbeis.github.io/inspect_evals/evals/scheming/gdm_self_reasoning/)** — Scheming · agent, sandbox · 2 samples · `inspect_evals/gdm_self_reasoning_approved_directories` · [paper](https://arxiv.org/abs/2505.01420)
  Test AI's ability to reason about its environment.
- **[GDM Dangerous Capabilities](https://ukgovernmentbeis.github.io/inspect_evals/evals/scheming/gdm_stealth/)** — Scheming · agent, sandbox · 9 samples · `inspect_evals/gdm_classifier_evasion` · [paper](https://arxiv.org/abs/2505.01420)
  Test AI's ability to reason about and circumvent oversight.
- **[InstrumentalEval - Evaluating the Paperclip Maximizer](https://ukgovernmentbeis.github.io/inspect_evals/evals/scheming/instrumentaleval/)** — Scheming · generation, text · 76 samples · `inspect_evals/instrumentaleval` · [paper](https://arxiv.org/abs/2502.12206)
  An evaluation designed to detect instrumental convergence behaviors in model responses (e.g., self-preservation, resource acquisition, power-seeking, strategic deception) using a rubric-driven LLM grader.
- **[SAD](https://ukgovernmentbeis.github.io/inspect_evals/evals/scheming/sad/)** — Scheming · generation, text · 800 samples · `inspect_evals/sad_stages_full` · [paper](https://arxiv.org/abs/2407.04694)
  Evaluates situational awareness in LLMs—knowledge of themselves and their circumstances—through behavioral tests including recognizing generated text, predicting behavior, and following self-aware instructions.
