Log Dataframes
Dataframe functions are currently in beta and are exported from the inspect_ai.analysis.beta module. The beta module will be preserved after final release so that code written against it now will continue to work after the beta.
Overview
Inspect eval logs have a hierarchical structure which is well suited to flexibly capturing all the elements of an evaluation. However, when analysing or visualising log data you will often want to transform logs into a dataframe. The inspect_ai.analysis module includes a variety of functions for extracting Pandas dataframes from logs, including:
Function | Description |
---|---|
evals_df() | Evaluation level data (e.g. task, model, scores, etc.). One row per log file. |
samples_df() | Sample level data (e.g. input, metadata, scores, errors, etc.) One row per sample, where each log file contains many samples. |
messages_df() | Message level data (e.g. role, content, etc.). One row per message, where each sample contains many messages. |
events_df() | Event level data (type, timing, content, etc.). One row per event, where each sample contains many events. |
Each function extracts a default set of columns, however you can tailor column reading to work in whatever way you need for your analysis. Extracted dataframes can either be denormalized (e.g. if you want to immediately summarise or plot them) or normalised (e.g. if you are importing them into a SQL database).
Below we’ll walk through a few examples, then after that provide more in-depth documentation on customising how dataframes are read for various scenarios.
Basics
Reading Data
Use the evals_df() function to read a dataframe containing a row for each log file (note that we import from inspect_ai.analysis.beta
since the dataframe functions are currently in beta):
# read logs from a given log directory
from inspect_ai.analysis.beta import evals_df
"logs") evals_df(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8 Columns: 51 entries, eval_id to score_model_graded_qa_stderr
The default configuration for evals_df() reads a predefined set of columns. You can customise column reading in a variety of ways (covered below in Columns).
Use the samples_df() function to read a dataframe with a record for each sample across a set of log files. For example, here we read all of the samples in the “logs” directory:
from inspect_ai.analysis.beta import samples_df
"logs") samples_df(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407 Columns: 13 entries, sample_id to retries
By default, sample_df()
reads all of the columns in the EvalSampleSummary data structure (12 columns), along with the eval_id
for linking back to the parent eval log file.
Column Groups
When reading dataframes, there are a number of pre-built column groups you can use to read various subsets of columns. For example:
from inspect_ai.analysis.beta import (
EvalInfo, EvalModel, EvalResults, evals_df
)
evals_df(="logs",
logs=EvalInfo + EvalModel + EvalResults
columns )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8 Columns: 23 entries, eval_id to score_headline_value
This dataframe has 23 columns rather than the 51 we saw when using the default evals_df() congiruation, reflecting the explicit columns groups specified.
You can also use column groups to join columns for doing analysis or plotting. For example, here we include eval level data along with each sample:
from inspect_ai.analysis.beta import (
EvalInfo, EvalModel, SampleSummary, samples_df
)
samples_df(="logs",
logs=EvalInfo + EvalModel + SampleSummary
columns )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408 entries, 0 to 407 Columns: 27 entries, sample_id to retries
This dataframe has 27 columns rather than than the 13 we saw for the default samples_df() behavior, reflecting the additional eval level columns. You can create your own column groups and definitions to further customise reading (see Columns for details).
Filtering Logs
The above examples read all of the logs within a given directory. You can also use the list_eval_logs() function to filter the list of logs based on arbitrary criteria as well control whether log listings are recursive.
For example, here we read only log files with a status
of “success”:
# read only successful logs from a given log directory
= list_eval_logs("logs", filter=lambda log: log.status == "success")
logs evals_df(logs)
Here we read only logs with the task name “popularity”:
# read only logs with task name 'popularity'
def task_filter(log: EvalLog) -> bool:
return log.eval.task == "popularity"
= list_eval_logs("logs", filter=task_filter)
logs evals_df(logs)
We can also choose to read a directory non-recursively:
# read only the logs at the top level of 'logs'
= list_eval_logs("logs", recursive=False)
logs evals_df(logs)
Databases
You can also read multiple dataframes and combine them into a relational database. Imported dataframes automatically include fields that can be used to join them (e.g. eval_id
is in both the evals and samples tables).
For example, here we read eval and sample level data from a log directory and import both tables into a DuckDb database:
import duckdb
from inspect_ai.analysis.beta import evals_df, samples_df
= duckdb.connect()
con 'evals', evals_df("logs"))
con.register('samples', samples_df("logs")) con.register(
We can now execute a query to find all samples generated using the google
provider:
= con.execute("""
result SELECT *
FROM evals e
JOIN samples s ON e.eval_id = s.eval_id
WHERE e.model LIKE 'google/%'
""").fetchdf()
Columns
The examples above all use built-in column specifications (e.g. EvalModel, EvalResults, SampleSummary, etc.). These specifications exist as a convenient starting point but can be replaced fully or partially by your own custom definitions.
Column definitions specify how JSON data is mapped into dataframe columns, and are specified using subclasses of the Column class (e.g. EvalColumn, SampleColumn). For example, here is the definition of the built-in EvalTask column group:
list[Column] = [
EvalTask: "task_name", path="eval.task", required=True),
EvalColumn("task_version", path="eval.task_version", required=True),
EvalColumn("task_file", path="eval.task_file"),
EvalColumn("task_attribs", path="eval.task_attribs"),
EvalColumn("task_arg_*", path="eval.task_args"),
EvalColumn("solver", path="eval.solver"),
EvalColumn("solver_args", path="eval.solver_args"),
EvalColumn("sandbox_type", path="eval.sandbox.type"),
EvalColumn("sandbox_config", path="eval.sandbox.config"),
EvalColumn( ]
Columns are defined with a name
, a path
(location within JSON to read their value from), and other options (e.g. required
, type
, etc.) . Column paths use JSON Path expressions to indicate how they should be read from JSON.
Many fields within eval logs are optional, and path expressions will automatically resolve to None
when they include a missing field (unless the required=True
option is specified).
Here are are all of the options available for Column definitions:
Column Options
Parameter | Type | Description |
---|---|---|
name |
str |
Column name for dataframe. Can include wildcard characters (e.g. task_arg_* ) for mapping dictionaries into multiple columns. |
path |
str | JSONPath |
Path into JSON to extract the column from (uses JSON Path expressions). Subclasses also implement path handlers that take e.g. an EvalLog and return a value. |
required |
bool |
Is the field required (i.e. should an error occur if it not found). |
default |
JsonValue |
Default value to yield if the field or its parents are not found in JSON. |
type |
Type[ColumnType] |
Validation check and directive to attempt to coerce the data into the specified type . Coercion from str to other types is done after interpreting the string using YAML (e.g. "true" -> True ). |
value |
Callable[[JsonValue], JsonValue] |
Function used to transform the value read from JSON into a value for the dataframe (e.g. converting a list to a comma-separated str ). |
Here are some examples that demonstrate the use of various options:
# required field
"run_id", path="eval.run_id", required=True)
EvalColumn(
# coerce field from int to str
"id", path="id", required=True, type=str)
SampleColumn(
# split metadata dict into multiple columns
"metadata_*", path="metadata")
SampleColumn(
# transform list[str] to str
"target", path="target", value=list_as_str), SampleColumn(
Column Merging
If a column is name is repeated within a list of columns then the column definition encountered last is utilised. This makes it straightforward to override default column definitions. For example, here we override the behaviour of the default sample metadata
columns (keeping it as JSON rather than splitting it into multiple columns):
samples_df(="logs",
logs=SampleSummary + [SampleColumn("metadata", path="metadata")]
columns )
Strict Mode
By default, dataframes are read in strict
mode, which means that if fields are missing or paths are invalid an error is raised and the import is aborted. You can optionally set strict=False
, in which case importing will proceed and a tuple containing pd.DataFrame
and a list of any errors encountered is returned. For example:
from inspect_ai.analysis.beta import evals_df
= evals_df("logs", strict=False)
evals, errors if len(errors) > 0:
print(errors)
Evals
EvalColumns defines a default set of roughly 50 columns to read from the top level of an eval log. EvalColumns is in turn composed of several sets of column definitions that you can be used independently, these include:
Type | Description |
---|---|
EvalInfo | Descriptive information (e.g. created, tags, metadata, git commit, etc.) |
EvalTask | Task configuration (name, file, args, solver, etc.) |
EvalModel | Model name, args, generation config, etc. |
EvalDataset | Dataset name, location, sample ids, etc. |
EvalConfig | Epochs, approval, sample limits, etc. |
EvalResults | Status, errors, samples completed, headline metric. |
EvalScores | All scores and metrics broken into separate columns. |
Multi-Columns
The task_args
dictionary and eval scores data structure are both expanded into multiple columns by default:
"task_arg_*", path="eval.task_args")
EvalColumn("score_*_*", path=eval_log_scores_dict) EvalColumn(
Note that scores are a two-level dictionary of score_<scorer>_<metric>
and are extracted using a custom function. If you want to handle scores a different way you can build your own set of eval columns with a custom scores handler. For example, here we take a subset of eval columns along with our own custom handler (custom_scores_fn
) for scores:
evals_df(="logs",
logs=(
columns
EvalInfo+ EvalModel
+ EvalResults
+ ([EvalColumn("score_*_*", path=custom_scores_fn)])
) )
Custom Extraction
The example above demonstrates the use of custom extraction functions, which take an EvalLog and return a JsonValue
.
For example, here is the default extraction function for the the dictionary of scores/metrics:
def scores_dict(log: EvalLog) -> JsonValue:
if log.results is None:
return None
= [
metrics: JsonValue
{
score.name: {for metric in score.metrics.values()
metric.name: metric.value
}
}for score in log.results.scores
]return metrics
Which is then used in the definition of the EvalScores column group as follows:
list[Column] = [
EvalScores: "score_*_*", path=scores_dict),
EvalColumn( ]
Samples
The samples_df() function can read from either sample summaries (EvalSampleSummary) or full sample records (EvalSample).
By default, the SampleSummary column group is used, which reads only from summaries, resulting in considerably higher performance than reading full samples.
list[Column] = [
SampleSummary: "id", path="id", required=True, type=str),
SampleColumn("epoch", path="epoch", required=True),
SampleColumn("input", path=sample_input_as_str, required=True),
SampleColumn("target", path="target", required=True, value=list_as_str),
SampleColumn("metadata_*", path="metadata"),
SampleColumn("score_*", path="scores", value=score_values),
SampleColumn("model_usage", path="model_usage"),
SampleColumn("total_time", path="total_time"),
SampleColumn("working_time", path="total_time"),
SampleColumn("error", path="error"),
SampleColumn("limit", path="limit"),
SampleColumn("retries", path="retries"),
SampleColumn( ]
If you want to read all of the messages contained in a sample into a string column, use the SampleMessages column group. For example, here we read the summary field and the messages:
from inspect_ai.analysis.beta import (
SampleMessages, SampleSummary, samples_df
)
samples_df(="logs",
logs= SampleSummary + SampleMessages
columns )
Note that reading SampleMessages requires reading full sample content, so will take considerably longer than reading only summaries.
When you create a samples data frame the eval_id
of its parent evaluation is automatically included. You can additionally include other fields from the evals table, for example:
samples_df(="logs",
logs= EvalModel + SampleSummary + SampleMessages
columns )
Multi-Columns
Note that the metadata
and score
columns are both dictionaries that are expanded into multiple columns:
"metadata_*", path="metadata")
SampleColumn("score_*", path="scores", value=score_values) SampleColumn(
This might or might not be what you want for your data frame. To preserve them as JSON, remove the _*
:
"metadata", path="metadata")
SampleColumn("score", path="scores") SampleColumn(
You could also write a custom extraction handler to read them in some other way.
Full Samples
SampleColumn will automatically determine whether it is referencing a field that requires a full sample read (for example, messages
or store
). There are five fields in sample summaries that have reduced footprint in the summary (input
, metadata
, and scores
, error
, and limit
). For these, fields specify full=True
to force reading from the full sample record. For example:
"limit_type", path="limit.type", full=True)
SampleColumn("limit_value", path="limit.limit", full=True) SampleColumn(
Custom Extraction
As with EvalColumn, you can also extract data from a sample using a callback function passed as the path
:
def model_reasoning_tokens(summary: EvalSampleSummary) -> JsonValue:
## extract reasoning tokens from summary.model_usage
"model_reasoning_tokens", path=model_reasoning_tokens) SampleColumn(
Sample summaries were enhanced in version 0.3.93 (May 1, 2025) to include the metadata
, model_usage
, total_time
, working_time
, and retries
fields. If you need to read any of these values you can update older logs with the new fields by round-tripping them through inspect log convert
. For example:
$ inspect log convert ./logs --to eval --output-dir ./logs-amended
Messages
The messages_df() function enables reading message level data from a set of eval logs. Each row corresponds to a message, and includes a sample_id
and eval_id
for linking back to its parents.
The messages_df() function takes a filter
parameter which can either be a list of role
designations or a function that performs filtering. For example:
= messages_df("logs", filter=["assistant"]) assistant_messages
Default Columns
The default MessageColumns includes MessageContent and MessageToolCalls:
list[Column] = [
MessageContent: "role", path="role", required=True),
MessageColumn("content", path=message_text),
MessageColumn("source", path="source"),
MessageColumn(
]
list[Column] = [
MessageToolCalls: "tool_calls", path=message_tool_calls),
MessageColumn("tool_call_id", path="tool_call_id"),
MessageColumn("tool_call_function", path="function"),
MessageColumn("tool_call_error", path="error.message"),
MessageColumn(
]
list[Column] = MessageContent + MessageToolCalls MessageColumns:
When you create a messages data frame the parent sample_id
and eval_id
are automatically included in each record. You can additionally include other fields from these tables, for example:
= messages_df(
messages ="logs",
logs=EvalModel + MessageColumns
columns )
Custom Extraction
Two of the fields above are resolved using custom extraction functions (content
and tool_calls
). Here is the source code for those functions:
def message_text(message: ChatMessage) -> str:
return message.text
def message_tool_calls(message: ChatMessage) -> str | None:
if isinstance(message, ChatMessageAssistant) and message.tool_calls is not None:
= "\n".join(
tool_calls
[
format_function_call(=1000
tool_call.function, tool_call.arguments, width
)for tool_call in message.tool_calls
]
)return tool_calls
else:
return None
Events
The events_df() function enables reading event level data from a set of eval logs. Each row corresponds to an event, and includes a sample_id
and eval_id
for linking back to its parents.
Because events are so heterogeneous, there is no default columns
specification for calls to events_df(). Rather, you can compose columns from the following pre-built groups:
Type | Description |
---|---|
EventInfo | Event type and span id. |
EventTiming | Start and end times (both clock time and working time) |
ModelEventColumns | Read data from model events. |
ToolEventColumns | Read data from tool events. |
The events_df() function also takes a filter
parameter which can either be a list of event types or a function that performs filtering. For example, to read all model events:
= events_df(
model_events ="logs",
logs=EventTiming + ModelEventColumns,
columnsfilter=["model"]
)
To read all tool events:
= events_df(
model_events ="logs",
logs=EvalModel + EventTiming + ToolEventColumns,
columnsfilter=["tool"]
)
Note that for tool events we also include the EvalModel column group as model information is not directly embedded in tool events (whereas it is within model events).
Custom
You can create custom column types that extract data based on additional parameters. For example, imagine you want to write a set of extraction functions that are passed a ReportConfig
and an EvalLog (the report configuration might specify scores to extract, normalisation constraints, etc.)
Here we define a new ReportColumn
class that derives from EvalColumn:
import functools
from typing import Callable
from pydantic import BaseModel, JsonValue
from inspect_ai.log import EvalLog
from inspect_ai.analysis.beta import EvalColumn
class ReportConfig(BaseModel):
# config fields
...
class ReportColumn(EvalColumn):
def __init__(
self,
str,
name:
config: ReportConfig,
extract: Callable[[ReportConfig, EvalLog], JsonValue],*,
bool = False,
required: -> None:
) super().__init__(
=name,
name=functools.partial(extract, config),
path=required,
required )
The key here is using functools.partial to adapt the function that takes config
and log
into a function that takes log
(which is what the EvalColumn class works with).
We can now create extraction functions that take a ReportConfig
and an EvalLog and pass them to ReportColumn
:
# read dict scores from log according to config
def read_scores(config: ReportConfig, log: EvalLog) -> JsonValue:
...
# config for a given report
= ReportConfig(...)
config
# column that reads scores from log based on config
"score_*", config, read_scores) ReportColumn(