Log Dataframes

Note

Dataframe functions are currently in beta and are exported from the inspect_ai.analysis.beta module. The beta module will be preserved after final release so that code written against it now will continue to work after the beta.

Overview

Inspect eval logs have a hierarchical structure which is well suited to flexibly capturing all the elements of an evaluation. However, when analysing or visualising log data you will often want to transform logs into a dataframe. The inspect_ai.analysis module includes a variety of functions for extracting Pandas dataframes from logs, including:

Function Description
evals_df() Evaluation level data (e.g. task, model, scores, etc.). One row per log file.
samples_df() Sample level data (e.g. input, metadata, scores, errors, etc.) One row per sample, where each log file contains many samples.
messages_df() Message level data (e.g. role, content, etc.). One row per message, where each sample contains many messages.
events_df() Event level data (type, timing, content, etc.). One row per event, where each sample contains many events.

Each function extracts a default set of columns, however you can tailor column reading to work in whatever way you need for your analysis. Extracted dataframes can either be denormalized (e.g. if you want to immediately summarise or plot them) or normalised (e.g. if you are importing them into a SQL database).

Below we’ll walk through a few examples, then after that provide more in-depth documentation on customising how dataframes are read for various scenarios.

Basics

Reading Data

Use the evals_df() function to read a dataframe containing a row for each log file (note that we import from inspect_ai.analysis.beta since the dataframe functions are currently in beta):

# read logs from a given log directory
from inspect_ai.analysis.beta import evals_df
evals_df("logs")   
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 51 entries, eval_id to score_model_graded_qa_stderr

The default configuration for evals_df() reads a predefined set of columns. You can customise column reading in a variety of ways (covered below in Columns).

Use the samples_df() function to read a dataframe with a record for each sample across a set of log files. For example, here we read all of the samples in the “logs” directory:

from inspect_ai.analysis.beta import samples_df

samples_df("logs")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Columns: 13 entries, sample_id to retries

By default, sample_df() reads all of the columns in the EvalSampleSummary data structure (12 columns), along with the eval_id for linking back to the parent eval log file.

Column Groups

When reading dataframes, there are a number of pre-built column groups you can use to read various subsets of columns. For example:

from inspect_ai.analysis.beta import (
    EvalInfo, EvalModel, EvalResults, evals_df
)

evals_df(
    logs="logs", 
    columns=EvalInfo + EvalModel + EvalResults
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 23 entries, eval_id to score_headline_value

This dataframe has 23 columns rather than the 51 we saw when using the default evals_df() congiruation, reflecting the explicit columns groups specified.

You can also use column groups to join columns for doing analysis or plotting. For example, here we include eval level data along with each sample:

from inspect_ai.analysis.beta import (
    EvalInfo, EvalModel, SampleSummary, samples_df
)

samples_df(
    logs="logs", 
    columns=EvalInfo + EvalModel + SampleSummary
)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407
Columns: 27 entries, sample_id to retries

This dataframe has 27 columns rather than than the 13 we saw for the default samples_df() behavior, reflecting the additional eval level columns. You can create your own column groups and definitions to further customise reading (see Columns for details).

Filtering Logs

The above examples read all of the logs within a given directory. You can also use the list_eval_logs() function to filter the list of logs based on arbitrary criteria as well control whether log listings are recursive.

For example, here we read only log files with a status of “success”:

# read only successful logs from a given log directory
logs = list_eval_logs("logs", filter=lambda log: log.status == "success")
evals_df(logs)

Here we read only logs with the task name “popularity”:

# read only logs with task name 'popularity'
def task_filter(log: EvalLog) -> bool:
    return log.eval.task == "popularity"
    
logs = list_eval_logs("logs", filter=task_filter)
evals_df(logs)

We can also choose to read a directory non-recursively:

# read only the logs at the top level of 'logs'
logs = list_eval_logs("logs", recursive=False)
evals_df(logs)

Databases

You can also read multiple dataframes and combine them into a relational database. Imported dataframes automatically include fields that can be used to join them (e.g. eval_id is in both the evals and samples tables).

For example, here we read eval and sample level data from a log directory and import both tables into a DuckDb database:

import duckdb
from inspect_ai.analysis.beta import evals_df, samples_df

con = duckdb.connect()
con.register('evals', evals_df("logs"))
con.register('samples', samples_df("logs"))

We can now execute a query to find all samples generated using the google provider:

result = con.execute("""
    SELECT * 
    FROM evals e
    JOIN samples s ON e.eval_id = s.eval_id
    WHERE e.model LIKE 'google/%'
""").fetchdf()

Columns

The examples above all use built-in column specifications (e.g. EvalModel, EvalResults, SampleSummary, etc.). These specifications exist as a convenient starting point but can be replaced fully or partially by your own custom definitions.

Column definitions specify how JSON data is mapped into dataframe columns, and are specified using subclasses of the Column class (e.g. EvalColumn, SampleColumn). For example, here is the definition of the built-in EvalTask column group:

EvalTask: list[Column] = [
    EvalColumn("task_name", path="eval.task", required=True),
    EvalColumn("task_version", path="eval.task_version", required=True),
    EvalColumn("task_file", path="eval.task_file"),
    EvalColumn("task_attribs", path="eval.task_attribs"),
    EvalColumn("task_arg_*", path="eval.task_args"),
    EvalColumn("solver", path="eval.solver"),
    EvalColumn("solver_args", path="eval.solver_args"),
    EvalColumn("sandbox_type", path="eval.sandbox.type"),
    EvalColumn("sandbox_config", path="eval.sandbox.config"),
]

Columns are defined with a name, a path (location within JSON to read their value from), and other options (e.g. required, type, etc.) . Column paths use JSON Path expressions to indicate how they should be read from JSON.

Many fields within eval logs are optional, and path expressions will automatically resolve to None when they include a missing field (unless the required=True option is specified).

Here are are all of the options available for Column definitions:

Column Options

Parameter Type Description
name str Column name for dataframe. Can include wildcard characters (e.g. task_arg_*) for mapping dictionaries into multiple columns.
path str | JSONPath Path into JSON to extract the column from (uses JSON Path expressions). Subclasses also implement path handlers that take e.g. an EvalLog and return a value.
required bool Is the field required (i.e. should an error occur if it not found).
default JsonValue Default value to yield if the field or its parents are not found in JSON.
type Type[ColumnType] Validation check and directive to attempt to coerce the data into the specified type. Coercion from str to other types is done after interpreting the string using YAML (e.g. "true" -> True).
value Callable[[JsonValue], JsonValue] Function used to transform the value read from JSON into a value for the dataframe (e.g. converting a list to a comma-separated str).

Here are some examples that demonstrate the use of various options:

# required field
EvalColumn("run_id", path="eval.run_id", required=True)

# coerce field from int to str
SampleColumn("id", path="id", required=True, type=str)

# split metadata dict into multiple columns
SampleColumn("metadata_*", path="metadata")

# transform list[str] to str
SampleColumn("target", path="target", value=list_as_str),

Column Merging

If a column is name is repeated within a list of columns then the column definition encountered last is utilised. This makes it straightforward to override default column definitions. For example, here we override the behaviour of the default sample metadata columns (keeping it as JSON rather than splitting it into multiple columns):

 samples_df(
     logs="logs",
     columns=SampleSummary + [SampleColumn("metadata", path="metadata")]
 )

Strict Mode

By default, dataframes are read in strict mode, which means that if fields are missing or paths are invalid an error is raised and the import is aborted. You can optionally set strict=False, in which case importing will proceed and a tuple containing pd.DataFrame and a list of any errors encountered is returned. For example:

from inspect_ai.analysis.beta import evals_df

evals, errors = evals_df("logs", strict=False)
if len(errors) > 0:
    print(errors)

Evals

EvalColumns defines a default set of roughly 50 columns to read from the top level of an eval log. EvalColumns is in turn composed of several sets of column definitions that you can be used independently, these include:

Type Description
EvalInfo Descriptive information (e.g. created, tags, metadata, git commit, etc.)
EvalTask Task configuration (name, file, args, solver, etc.)
EvalModel Model name, args, generation config, etc.
EvalDataset Dataset name, location, sample ids, etc.
EvalConfig Epochs, approval, sample limits, etc.
EvalResults Status, errors, samples completed, headline metric.
EvalScores All scores and metrics broken into separate columns.

Multi-Columns

The task_args dictionary and eval scores data structure are both expanded into multiple columns by default:

EvalColumn("task_arg_*", path="eval.task_args")
EvalColumn("score_*_*", path=eval_log_scores_dict)

Note that scores are a two-level dictionary of score_<scorer>_<metric> and are extracted using a custom function. If you want to handle scores a different way you can build your own set of eval columns with a custom scores handler. For example, here we take a subset of eval columns along with our own custom handler (custom_scores_fn) for scores:

evals_df(
    logs="logs", 
    columns=(
        EvalInfo
        + EvalModel
        + EvalResults
        + ([EvalColumn("score_*_*", path=custom_scores_fn)])
    )
)

Custom Extraction

The example above demonstrates the use of custom extraction functions, which take an EvalLog and return a JsonValue.

For example, here is the default extraction function for the the dictionary of scores/metrics:

def scores_dict(log: EvalLog) -> JsonValue:
    if log.results is None:
        return None
    
    metrics: JsonValue = [
        {
            score.name: {
                metric.name: metric.value for metric in score.metrics.values()
            }
        }
        for score in log.results.scores
    ]
    return metrics

Which is then used in the definition of the EvalScores column group as follows:

EvalScores: list[Column] = [
    EvalColumn("score_*_*", path=scores_dict),
]

Samples

The samples_df() function can read from either sample summaries (EvalSampleSummary) or full sample records (EvalSample).

By default, the SampleSummary column group is used, which reads only from summaries, resulting in considerably higher performance than reading full samples.

SampleSummary: list[Column] = [
    SampleColumn("id", path="id", required=True, type=str),
    SampleColumn("epoch", path="epoch", required=True),
    SampleColumn("input", path=sample_input_as_str, required=True),
    SampleColumn("target", path="target", required=True, value=list_as_str),
    SampleColumn("metadata_*", path="metadata"),
    SampleColumn("score_*", path="scores", value=score_values),
    SampleColumn("model_usage", path="model_usage"),
    SampleColumn("total_time", path="total_time"),
    SampleColumn("working_time", path="total_time"),
    SampleColumn("error", path="error"),
    SampleColumn("limit", path="limit"),
    SampleColumn("retries", path="retries"),
]

If you want to read all of the messages contained in a sample into a string column, use the SampleMessages column group. For example, here we read the summary field and the messages:

from inspect_ai.analysis.beta import (
    SampleMessages, SampleSummary, samples_df
)

samples_df(
    logs="logs", 
    columns = SampleSummary + SampleMessages
)

Note that reading SampleMessages requires reading full sample content, so will take considerably longer than reading only summaries.

When you create a samples data frame the eval_id of its parent evaluation is automatically included. You can additionally include other fields from the evals table, for example:

samples_df(
    logs="logs", 
    columns = EvalModel + SampleSummary + SampleMessages
)

Multi-Columns

Note that the metadata and score columns are both dictionaries that are expanded into multiple columns:

SampleColumn("metadata_*", path="metadata")
SampleColumn("score_*", path="scores", value=score_values)

This might or might not be what you want for your data frame. To preserve them as JSON, remove the _*:

SampleColumn("metadata", path="metadata")
SampleColumn("score", path="scores")

You could also write a custom extraction handler to read them in some other way.

Full Samples

SampleColumn will automatically determine whether it is referencing a field that requires a full sample read (for example, messages or store). There are five fields in sample summaries that have reduced footprint in the summary (input, metadata, and scores, error, and limit). For these, fields specify full=True to force reading from the full sample record. For example:

SampleColumn("limit_type", path="limit.type", full=True)
SampleColumn("limit_value", path="limit.limit", full=True)

Custom Extraction

As with EvalColumn, you can also extract data from a sample using a callback function passed as the path:

def model_reasoning_tokens(summary: EvalSampleSummary) -> JsonValue:
    ## extract reasoning tokens from summary.model_usage

SampleColumn("model_reasoning_tokens", path=model_reasoning_tokens)

Sample summaries were enhanced in version 0.3.93 (May 1, 2025) to include the metadata, model_usage, total_time, working_time, and retries fields. If you need to read any of these values you can update older logs with the new fields by round-tripping them through inspect log convert. For example:

$ inspect log convert ./logs --to eval --output-dir ./logs-amended

Messages

The messages_df() function enables reading message level data from a set of eval logs. Each row corresponds to a message, and includes a sample_id and eval_id for linking back to its parents.

The messages_df() function takes a filter parameter which can either be a list of role designations or a function that performs filtering. For example:

assistant_messages = messages_df("logs", filter=["assistant"])

Default Columns

The default MessageColumns includes MessageContent and MessageToolCalls:

MessageContent: list[Column] = [
    MessageColumn("role", path="role", required=True),
    MessageColumn("content", path=message_text),
    MessageColumn("source", path="source"),
]

MessageToolCalls: list[Column] = [
    MessageColumn("tool_calls", path=message_tool_calls),
    MessageColumn("tool_call_id", path="tool_call_id"),
    MessageColumn("tool_call_function", path="function"),
    MessageColumn("tool_call_error", path="error.message"),
]

MessageColumns: list[Column] = MessageContent + MessageToolCalls

When you create a messages data frame the parent sample_id and eval_id are automatically included in each record. You can additionally include other fields from these tables, for example:

messages = messages_df(
    logs="logs",
    columns=EvalModel + MessageColumns             
)

Custom Extraction

Two of the fields above are resolved using custom extraction functions (content and tool_calls). Here is the source code for those functions:

def message_text(message: ChatMessage) -> str:
    return message.text

def message_tool_calls(message: ChatMessage) -> str | None:
    if isinstance(message, ChatMessageAssistant) and message.tool_calls is not None:
        tool_calls = "\n".join(
            [
                format_function_call(
                    tool_call.function, tool_call.arguments, width=1000
                )
                for tool_call in message.tool_calls
            ]
        )
        return tool_calls
    else:
        return None

Events

The events_df() function enables reading event level data from a set of eval logs. Each row corresponds to an event, and includes a sample_id and eval_id for linking back to its parents.

Because events are so heterogeneous, there is no default columns specification for calls to events_df(). Rather, you can compose columns from the following pre-built groups:

Type Description
EventInfo Event type and span id.
EventTiming Start and end times (both clock time and working time)
ModelEventColumns Read data from model events.
ToolEventColumns Read data from tool events.

The events_df() function also takes a filter parameter which can either be a list of event types or a function that performs filtering. For example, to read all model events:

model_events = events_df(
    logs="logs", 
    columns=EventTiming + ModelEventColumns,
    filter=["model"]
)

To read all tool events:

model_events = events_df(
    logs="logs", 
    columns=EvalModel + EventTiming + ToolEventColumns,
    filter=["tool"]
)

Note that for tool events we also include the EvalModel column group as model information is not directly embedded in tool events (whereas it is within model events).

Custom

You can create custom column types that extract data based on additional parameters. For example, imagine you want to write a set of extraction functions that are passed a ReportConfig and an EvalLog (the report configuration might specify scores to extract, normalisation constraints, etc.)

Here we define a new ReportColumn class that derives from EvalColumn:

import functools
from typing import Callable
from pydantic import BaseModel, JsonValue

from inspect_ai.log import EvalLog
from inspect_ai.analysis.beta import EvalColumn

class ReportConfig(BaseModel):
    # config fields
    ...

class ReportColumn(EvalColumn):
    def __init__(
        self,
        name: str,
        config: ReportConfig,
        extract: Callable[[ReportConfig, EvalLog], JsonValue],
        *,
        required: bool = False,
    ) -> None:
        super().__init__(
            name=name,
            path=functools.partial(extract, config),
            required=required,
        )

The key here is using functools.partial to adapt the function that takes config and log into a function that takes log (which is what the EvalColumn class works with).

We can now create extraction functions that take a ReportConfig and an EvalLog and pass them to ReportColumn:

# read dict scores from log according to config
def read_scores(config: ReportConfig, log: EvalLog) -> JsonValue:
    ...

# config for a given report
config = ReportConfig(...)

# column that reads scores from log based on config
ReportColumn("score_*", config, read_scores)