Log Dataframes

Overview

Inspect eval logs have a hierarchical structure which is well suited to flexibly capturing all the elements of an evaluation. However, when analysing or visualising log data you will often want to transform logs into a dataframe. The inspect_ai.analysis module includes a variety of functions for extracting Pandas dataframes from logs, including:

Function	Description
evals_df()	Evaluation level data (e.g. task, model, scores, etc.). One row per log file.
samples_df()	Sample level data (e.g. input, metadata, scores, errors, etc.) One row per sample, where each log file contains many samples.
messages_df()	Message level data (e.g. role, content, etc.). One row per message, where each sample contains many messages.
events_df()	Event level data (type, timing, content, etc.). One row per event, where each sample contains many events.

Each function extracts a default set of columns, with id fields (e.g. eval_id, sample_id) automatically included. Additionally, a log field which includes the URI of the log file read from is included.

You can further tailor column reading to work in whatever way you need for your analysis. Extracted dataframes can either be denormalized (e.g. if you want to immediately summarise or plot them) or normalised (e.g. if you are importing them into a SQL database).

Inspect Viz

Inspect Viz is a data visualization framework built to work with the Inspect data frame functions described below. After you’ve explored the basics of data frames you may also want to check out Inspect Viz.

Basics

Reading Data

Use the evals_df() function to read a dataframe containing a row for each log file:

# read logs from a given log directory
from inspect_ai.analysis import evals_df
evals_df("logs")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 51 entries, eval_id to score_model_graded_qa_stderr

The default configuration for evals_df() reads a predefined set of columns. You can customise column reading in a variety of ways (covered below in Column Definitions).

Use the samples_df() function to read a dataframe with a record for each sample across a set of log files. For example, here we read all of the samples in the “logs” directory:

from inspect_ai.analysis import samples_df

samples_df("logs")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Columns: 13 entries, sample_id to retries

By default, sample_df() reads all of the columns in the EvalSampleSummary data structure (12 columns), along with the eval_id for linking back to the parent eval log file.

Column Groups

When reading dataframes, there are a number of pre-built column groups you can use to read various subsets of columns. For example:

from inspect_ai.analysis import (
    EvalInfo, EvalModel, EvalResults, evals_df
)

evals_df(
    logs="logs", 
    columns=EvalInfo + EvalModel + EvalResults
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 23 entries, eval_id to score_headline_value

This dataframe has 23 columns rather than the 51 we saw when using the default evals_df() congiruation, reflecting the explicit columns groups specified.

You can also use column groups to join columns for doing analysis or plotting. For example, here we include eval level data along with each sample:

from inspect_ai.analysis import (
    EvalInfo, EvalModel, SampleSummary, samples_df
)

samples_df(
    logs="logs", 
    columns=EvalInfo + EvalModel + SampleSummary
)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Columns: 27 entries, sample_id to retries

This dataframe has 27 columns rather than than the 13 we saw for the default samples_df() behavior, reflecting the additional eval level columns. You can create your own column groups and definitions to further customise reading (see Column Definitions for details).

Filtering Logs

The above examples read all of the logs within a given directory. You can also use the list_eval_logs() function to filter the list of logs based on arbitrary criteria as well control whether log listings are recursive.

For example, here we read only log files with a status of “success”:

# read only successful logs from a given log directory
logs = list_eval_logs("logs", filter=lambda log: log.status == "success")
evals_df(logs)

Here we read only logs with the task name “popularity”:

# read only logs with task name 'popularity'
def task_filter(log: EvalLog) -> bool:
    return log.eval.task == "popularity"
    
logs = list_eval_logs("logs", filter=task_filter)
evals_df(logs)

We can also choose to read a directory non-recursively:

# read only the logs at the top level of 'logs'
logs = list_eval_logs("logs", recursive=False)
evals_df(logs)

Parallel Reading

The samples_df(), messages_df(), and events_df() functions can be slow to run if you are reading full samples from hundreds of logs, especially logs with larger samples (e.g. agent trajectories).

One easy mitigation when using samples_df() is to stick with the default SampleSummary columns only, as these require only a very fast read of a header (the actual samples don’t need to be loaded).

If you need to read full samples, events, or messages and the read is taking longer than you’d like, you can enable parallel reading using the parallel option:

from inspect_ai.analysis import (
    SampleMessages, SampleSummary samples_df, events_df
)

# we need to read full sample messages so we parallelize
samples = samples_df(
    "logs", 
    columns=SampleSummary + SampleMessages,
    parallel=True 
)

# events require fully loading samples so we parallelize
events = events_df(
    "logs",
    parallel=True
)

Parallel reading uses the Python ProcessPoolExecutor with the number of workers based on mp.cpu_count(). The workers are capped at 8 by default as typically beyond this disk and memory contention dominate performance. If you wish you can override this default by passing a number of workers explicitly:

events = events_df(
    "logs",
    parallel=16
)

Note that the evals_df() function does not have a parallel option as it only does very inexpensive reads of log headers, so the overhead required for parallelisation would most often make the function slower to run.

Databases

You can also read multiple dataframes and combine them into a relational database. Imported dataframes automatically include fields that can be used to join them (e.g. eval_id is in both the evals and samples tables).

For example, here we read eval and sample level data from a log directory and import both tables into a DuckDb database:

import duckdb
from inspect_ai.analysis import evals_df, samples_df

con = duckdb.connect()
con.register('evals', evals_df("logs"))
con.register('samples', samples_df("logs"))

We can now execute a query to find all samples generated using the google provider:

result = con.execute("""
    SELECT * 
    FROM evals e
    JOIN samples s ON e.eval_id = s.eval_id
    WHERE e.model LIKE 'google/%'
""").fetchdf()

Data Preparation

After reading data frames from log files, there will often be additional data preparation required for plotting or analysis. Some common transformations are provided as built in functions that satisfy the Operation protocol. To apply these transformations, use the prepare() function.

For example, if you have used the inspect view bundle command to publish logs to a website, you can use the log_viewer() operation to map log file paths to their published URLs:

from inspect_ai.analysis import (
    evals_df, log_viewer, model_info, prepare
)

df = evals_df("logs")
df = prepare(df, [
    model_info(),
    log_viewer("eval", {"logs": "https://logs.example.com"})
])

See below for details on available data preparation functions.

model_info()

Add additional model metadata to an eval data frame. For example:

df = evals_df("logs")
df = prepare(df, model_info())

Fields added (when available) include:

model_organization_name: Displayable model organization (e.g. OpenAI, Anthropic, etc.)
model_display_name: Displayable model name (e.g. Gemini Flash 2.5)
model_snapshot: A snapshot (version) string, if available (e.g. “latest” or “20240229”)
model_release_date: The model’s release date
model_knowledge_cutoff_date: The model’s knowledge cutoff date

Inspect includes built in support for many models (based upon the model string in the dataframe). If you are using models for which Inspect does not include model metadata, you may include your own model metadata (see the model_info() reference for additional details).

task_info()

Map task names to task display names (e.g. “gpqa_diamond” -> “GPQA Diamond”).

df = evals_df("logs")
df = prepare(df, [
    task_info({"gpqa_diamond": "GPQA Diamond"})
])

See the task_info() reference for additional details.

log_viewer()

Add a “log_viewer” column to an eval data frame by mapping log file paths to remote URLs. Pass mappings from the local log directory (or S3 bucket) to the URL where the logs have been publishing using inspect view bundle. For example:

df = evals_df("logs")
df = prepare(df, [
    log_viewer("eval", {"logs": "https://logs.example.com"})
])

Note that the code above targets “eval” (the top level viewer page for an eval). Other available targets include “sample”, “event”, and “message”. See the log_viewer() reference for additional details.

frontier()

Adds a “frontier” column to each task. The value of the “frontier” column will be True if for the task, the model was the top-scoring model among all models available at the moment the model was released; otherwise it will be False.

The frontier() requires scores and model release dates, so must be run after the model_info() operation.

from inspect_ai.analysis import (
    evals_df, frontier, log_viewer, model_info, prepare
)

df = evals_df("logs")
df = prepare(df, [
    model_info(),
    frontier()
])

score_to_float()

Converts one or more score columns to a float representation of the score.

For each column specified, this operation will convert the values to floats using the provided value_to_float function. The column value will be replaced with the float value.

from inspect_ai.analysis import (
    samples_df, frontier, model_info, prepare, score_to_float
)

df = samples_df("logs")
df = prepare(df, [
    score_to_float("score_includes")
])

Column Definitions

The examples above all use built-in column specifications (e.g. EvalModel, EvalResults, SampleSummary, etc.). These specifications exist as a convenient starting point but can be replaced fully or partially by your own custom definitions.

Column definitions specify how JSON data is mapped into dataframe columns, and are specified using subclasses of the Column class (e.g. EvalColumn, SampleColumn). For example, here is the definition of the built-in EvalTask column group:

EvalTask: list[Column] = [
    EvalColumn("task_name", path="eval.task", required=True),
    EvalColumn("task_version", path="eval.task_version", required=True),
    EvalColumn("task_file", path="eval.task_file"),
    EvalColumn("task_attribs", path="eval.task_attribs"),
    EvalColumn("task_arg_*", path="eval.task_args"),
    EvalColumn("solver", path="eval.solver"),
    EvalColumn("solver_args", path="eval.solver_args"),
    EvalColumn("sandbox_type", path="eval.sandbox.type"),
    EvalColumn("sandbox_config", path="eval.sandbox.config"),
]

Columns are defined with a name, a path (location within JSON to read their value from), and other options (e.g. required, type, etc.) . Column paths use JSON Path expressions to indicate how they should be read from JSON.

Many fields within eval logs are optional, and path expressions will automatically resolve to None when they include a missing field (unless the required=True option is specified).

Here are are all of the options available for Column definitions:

Column Options

Parameter	Type	Description
`name`	`str`	Column name for dataframe. Can include wildcard characters (e.g. `task_arg_*`) for mapping dictionaries into multiple columns.
`path`	`str` \| `JSONPath`	Path into JSON to extract the column from (uses JSON Path expressions). Subclasses also implement path handlers that take e.g. an EvalLog and return a value.
`required`	`bool`	Is the field required (i.e. should an error occur if it not found).
`default`	`JsonValue`	Default value to yield if the field or its parents are not found in JSON.
`type`	`Type[ColumnType]`	Validation check and directive to attempt to coerce the data into the specified `type`. Coercion from `str` to other types is done after interpreting the string using YAML (e.g. `"true"` -> `True`).
`value`	`Callable[[JsonValue], JsonValue]`	Function used to transform the value read from JSON into a value for the dataframe (e.g. converting a `list` to a comma-separated `str`).

Here are some examples that demonstrate the use of various options:

# required field
EvalColumn("run_id", path="eval.run_id", required=True)

# coerce field from int to str
SampleColumn("id", path="id", required=True, type=str)

# split metadata dict into multiple columns
SampleColumn("metadata_*", path="metadata")

# transform list[str] to str
SampleColumn("target", path="target", value=list_as_str),

Column Merging

If a column is name is repeated within a list of columns then the column definition encountered last is utilised. This makes it straightforward to override default column definitions. For example, here we override the behaviour of the default sample metadata columns (keeping it as JSON rather than splitting it into multiple columns):

 samples_df(
     logs="logs",
     columns=SampleSummary + [SampleColumn("metadata", path="metadata")]
 )

Strict Mode

By default, dataframes are read in strict mode, which means that if fields are missing or paths are invalid an error is raised and the import is aborted. You can optionally set strict=False, in which case importing will proceed and a tuple containing pd.DataFrame and a list of any errors encountered is returned. For example:

from inspect_ai.analysis import evals_df

evals, errors = evals_df("logs", strict=False)
if len(errors) > 0:
    print(errors)

Evals

EvalColumns defines a default set of roughly 50 columns to read from the top level of an eval log. EvalColumns is in turn composed of several sets of column definitions that you can be used independently, these include:

Type	Description
EvalInfo	Descriptive information (e.g. created, tags, metadata, git commit, etc.)
EvalTask	Task configuration (name, file, args, solver, etc.)
EvalModel	Model name, args, generation config, etc.
EvalDataset	Dataset name, location, sample ids, etc.
EvalConfig	Epochs, approval, sample limits, etc.
EvalResults	Status, errors, samples completed, headline metric.
EvalScores	All scores and metrics broken into separate columns.

The eval_id field is automatically included in all eval data frames. Additionally, a log field which includes the URI of the log file read from is included.

Multi-Columns

The task_args dictionary and eval scores data structure are both expanded into multiple columns by default:

EvalColumn("task_arg_*", path="eval.task_args")
EvalColumn("score_*_*", path=eval_log_scores_dict)

Note that scores are a two-level dictionary of score_<scorer>_<metric> and are extracted using a custom function. If you want to handle scores a different way you can build your own set of eval columns with a custom scores handler. For example, here we take a subset of eval columns along with our own custom handler (custom_scores_fn) for scores:

evals_df(
    logs="logs", 
    columns=(
        EvalInfo
        + EvalModel
        + EvalResults
        + ([EvalColumn("score_*_*", path=custom_scores_fn)])
    )
)

Custom Extraction

The example above demonstrates the use of custom extraction functions, which take an EvalLog and return a JsonValue.

For example, here is the default extraction function for the the dictionary of scores/metrics:

def scores_dict(log: EvalLog) -> JsonValue:
    if log.results is None:
        return None
    
    metrics: JsonValue = [
        {
            score.name: {
                metric.name: metric.value for metric in score.metrics.values()
            }
        }
        for score in log.results.scores
    ]
    return metrics

Which is then used in the definition of the EvalScores column group as follows:

EvalScores: list[Column] = [
    EvalColumn("score_*_*", path=scores_dict),
]

Samples

The samples_df() function can read from either sample summaries (EvalSampleSummary) or full sample records (EvalSample).

By default, the SampleSummary column group is used, which reads only from summaries, resulting in considerably higher performance than reading full samples.

SampleSummary: list[Column] = [
    SampleColumn("id", path="id", required=True, type=str),
    SampleColumn("epoch", path="epoch", required=True),
    SampleColumn("input", path=sample_input_as_str, required=True),
    SampleColumn("target", path="target", required=True, value=list_as_str),
    SampleColumn("metadata_*", path="metadata"),
    SampleColumn("score_*", path="scores", value=score_values),
    SampleColumn("model_usage", path="model_usage"),
    SampleColumn("total_time", path="total_time"),
    SampleColumn("working_time", path="total_time"),
    SampleColumn("error", path="error"),
    SampleColumn("limit", path="limit"),
    SampleColumn("retries", path="retries"),
]

The eval_id and sample_id fields are automatically included in all sample data frames. Additionally, a log field which includes the URI of the log file read from is included.

By default, only score values are included in the SampleSummary columns. If you want to additional read the score answer, metadata, and explanation then use the SampleScores column group. For example:

from inspect_ai.analysis import (
    SampleScores, SampleSummary, samples_df
)

samples_df(
    logs="logs", 
    columns = SampleSummary + SampleScores
)

If you want to read all of the messages contained in a sample into a string column, use the SampleMessages column group. For example, here we read the summary field and the messages:

from inspect_ai.analysis import (
    SampleMessages, SampleSummary, samples_df
)

samples_df(
    logs="logs", 
    columns = SampleSummary + SampleMessages
)

Note that reading SampleMessages requires reading full sample content, so will take considerably longer than reading only summaries.

When you create a samples data frame the eval_id of its parent evaluation is automatically included. You can additionally include other fields from the evals table, for example:

samples_df(
    logs="logs", 
    columns = EvalModel + SampleSummary + SampleMessages
)

Multi-Columns

Note that the metadata and score columns are both dictionaries that are expanded into multiple columns:

SampleColumn("metadata_*", path="metadata")
SampleColumn("score_*", path="scores", value=score_values)

This might or might not be what you want for your data frame. To preserve them as JSON, remove the _*:

SampleColumn("metadata", path="metadata")
SampleColumn("score", path="scores")

You could also write a custom extraction handler to read them in some other way.

Full Samples

SampleColumn will automatically determine whether it is referencing a field that requires a full sample read (for example, messages or store). There are five fields in sample summaries that have reduced footprint in the summary (input, metadata, and scores, error, and limit). For these, fields specify full=True to force reading from the full sample record. For example:

SampleColumn("limit_type", path="limit.type", full=True)
SampleColumn("limit_value", path="limit.limit", full=True)

If you are only interested in reading full values for metadata, you can use full=True when calling samples_df() as shorthand for this:

samples_df(logs="logs", full=True)

Custom Extraction

As with EvalColumn, you can also extract data from a sample using a callback function passed as the path:

def model_reasoning_tokens(summary: EvalSampleSummary) -> JsonValue:
    ## extract reasoning tokens from summary.model_usage

SampleColumn("model_reasoning_tokens", path=model_reasoning_tokens)

Sample summaries were enhanced in version 0.3.93 (May 1, 2025) to include the metadata, model_usage, total_time, working_time, and retries fields. If you need to read any of these values you can update older logs with the new fields by round-tripping them through inspect log convert. For example:

$ inspect log convert ./logs --to eval --output-dir ./logs-amended

Sample IDs

The samples_df() function produces a globally unique ID for each sample, contained in the sample_id field. This field is also included in the data frames created by messages_df() and events_df() as a parent sample reference.

Since sample_id is globally unique, it is suitable for use in tables and views that span multiple evaluations.

Note that samples_df() also includes id and epoch fields that serve distinct purposes: id references the corresponding sample in the task’s dataset, while epoch indicates the iteration of execution.

Messages

The messages_df() function enables reading message level data from a set of eval logs. Each row corresponds to a message, and includes a sample_id and eval_id for linking back to its parents.

The messages_df() function takes a filter parameter which can either be a list of role designations or a function that performs filtering. For example:

assistant_messages = messages_df("logs", filter=["assistant"])

Default Columns

The default MessageColumns includes MessageContent and MessageToolCalls:

MessageContent: list[Column] = [
    MessageColumn("role", path="role", required=True),
    MessageColumn("content", path=message_text),
    MessageColumn("source", path="source"),
]

MessageToolCalls: list[Column] = [
    MessageColumn("tool_calls", path=message_tool_calls),
    MessageColumn("tool_call_id", path="tool_call_id"),
    MessageColumn("tool_call_function", path="function"),
    MessageColumn("tool_call_error", path="error.message"),
]

MessageColumns: list[Column] = MessageContent + MessageToolCalls

When you create a messages data frame the parent sample_id and eval_id are automatically included in each record. You can additionally include other fields from these tables, for example:

messages = messages_df(
    logs="logs",
    columns=EvalModel + MessageColumns             
)

Additionally, a log field which includes the URI of the log file read from is included.

Custom Extraction

Two of the fields above are resolved using custom extraction functions (content and tool_calls). Here is the source code for those functions:

def message_text(message: ChatMessage) -> str:
    return message.text

def message_tool_calls(message: ChatMessage) -> str | None:
    if isinstance(message, ChatMessageAssistant) and message.tool_calls is not None:
        tool_calls = "\n".join(
            [
                format_function_call(
                    tool_call.function, tool_call.arguments, width=1000
                )
                for tool_call in message.tool_calls
            ]
        )
        return tool_calls
    else:
        return None

Events

The events_df() function enables reading event level data from a set of eval logs. Each row corresponds to an event, and includes a sample_id and eval_id for linking back to its parents.

Because events are so heterogeneous, there is no default columns specification for calls to events_df(). Rather, you can compose columns from the following pre-built groups:

Type	Description
EventInfo	Event type and span id.
EventTiming	Start and end times (both clock time and working time)
ModelEventColumns	Read data from model events.
ToolEventColumns	Read data from tool events.

The eval_id, sample_id, and event_id fields are automatically included in all event data frames. Additionally, a log field which includes the URI of the log file read from is included.

The events_df() function also takes a filter parameter which can provide a function that performs filtering. For example, to read all model events:

def model_event_filter(event: Event) -> bool:
    return event.event == "model"

model_events = events_df(
    logs="logs", 
    columns=EventTiming + ModelEventColumns,
    filter=model_event_filter
)

To read all tool events:

def tool_event_filter(event: Event) -> bool:
    return event.event == "tool"

model_events = events_df(
    logs="logs", 
    columns=EvalModel + EventTiming + ToolEventColumns,
    filter=tool_event_filter
)

Note that for tool events we also include the EvalModel column group as model information is not directly embedded in tool events (whereas it is within model events).

Custom

You can create custom column types that extract data based on additional parameters. For example, imagine you want to write a set of extraction functions that are passed a ReportConfig and an EvalLog (the report configuration might specify scores to extract, normalisation constraints, etc.)

Here we define a new ReportColumn class that derives from EvalColumn:

import functools
from typing import Callable
from pydantic import BaseModel, JsonValue

from inspect_ai.log import EvalLog
from inspect_ai.analysis import EvalColumn

class ReportConfig(BaseModel):
    # config fields
    ...

class ReportColumn(EvalColumn):
    def __init__(
        self,
        name: str,
        config: ReportConfig,
        extract: Callable[[ReportConfig, EvalLog], JsonValue],
        *,
        required: bool = False,
    ) -> None:
        super().__init__(
            name=name,
            path=functools.partial(extract, config),
            required=required,
        )

The key here is using functools.partial to adapt the function that takes config and log into a function that takes log (which is what the EvalColumn class works with).

We can now create extraction functions that take a ReportConfig and an EvalLog and pass them to ReportColumn:

# read dict scores from log according to config
def read_scores(config: ReportConfig, log: EvalLog) -> JsonValue:
    ...

# config for a given report
config = ReportConfig(...)

# column that reads scores from log based on config
ReportColumn("score_*", config, read_scores)