Dataset config¶
A dataset config is an optional YAML file that tells binoc dataset
semantics and how a renderer should present the resulting changes if
you want grouped output. You do not need a config to run binoc diff —
the defaults handle built-in formats through the correspondence engine.
A config becomes useful when you want to:
- Declare dataset semantics, such as logical file correspondence and row identity, for plugins that understand those fields.
- Teach the Markdown renderer how to group plugin-specific tags for your domain.
- Configure a renderer's behavior (HTML theme, CI failure rules, …) without changing code.
Work in progress
Config key coverage is currently partial and will expand as renderer-specific config grows. If a key you need is missing here, check the sources referenced from each section or file an issue.
Top-level shape¶
dataset:
files:
correspondences:
- name: running-list
left:
path_regex: '^(?P<list>running_list)_as_of_[0-9]{4}\.csv$'
right:
path_regex: '^(?P<list>running_list)_as_of_[0-9]{4}\.csv$'
key: '${list}'
logical_path: '${list}.csv'
on_null_key: diagnostic
on_duplicate_key: diagnostic
tables:
- logical_name: products
columns: ['BLA Number', 'Product Number']
correspondence:
expand_renamed_unchanged_collections: true
output:
markdown:
groups:
- heading: "Substantive changes"
tags:
- binoc.column-addition
- binoc.column-removal
- binoc.row-addition
- binoc.content-changed
- heading: "Clerical changes"
tags:
- binoc.column-reorder
- binoc.whitespace-change
Passing this file via binoc diff A B --config dataset.yaml (or
through binoc.Config.from_file(path) in Python) applies it to the
run.
Rejected pipeline keys¶
The CLI no longer uses comparators, transformers, or
transformer_config to build the diff pipeline. Those keys are rejected if
they appear in config files; correspondence rules now own expansion, parsing,
pairing, edit writing, compaction, and projection.
dataset¶
The dataset block is a top-level semantic description of the dataset being
compared. Core carries this value through unchanged and exposes it to plugins
under the dataset key in their run config; core does not interpret paths,
tables, delimiters, or keys.
The SDK owns a shared v1 shape so independently authored plugins can agree on common dataset semantics:
dataset.files.correspondencesdeclares that files with different snapshot paths are the same logical file.dataset.tablesdeclares table row keys. It can be a list for the common case, or an object withdefaultsandentrieswhen shared policy is useful.dataset.correspondence.expand_renamed_unchanged_collectionscontrols a correspondence-engine performance tradeoff for renamed unchanged containers.
dataset:
files:
correspondences:
- name: state-records
left:
path_regex: '^data/state_(?P<state>[A-Z]{2})\.csv$'
right:
path_regex: '^by-state/(?P<state>[A-Z]{2})/records\.csv$'
key: '${state}'
logical_path: 'states/${state}.csv'
cardinality: one-to-one
on_null_key: diagnostic
on_duplicate_key: diagnostic
report_path_change: false
tables:
- logical_name: products
columns: ['BLA Number', 'Product Number']
- path: data.csv
columns: ['id']
correspondence:
expand_renamed_unchanged_collections: true
cardinality is currently one-to-one. on_null_key and
on_duplicate_key accept diagnostic, error, or ignore; plugins decide how
to apply those policies for the semantics they implement.
dataset.files.correspondences¶
Declared correspondence rules tell binoc that an unmatched removed file and an
unmatched added file are the same logical file even though their paths differ.
When a rule produces one unambiguous left match and one unambiguous right match
for the same key, binoc links those side items in the correspondence engine and
lets normal writers explain the linked content. If report_path_change is
false and the content is identical, the projected node is pruned like any other
identical change. If report_path_change is true, binoc keeps a move-style path
change node even when the content is identical.
dataset:
files:
correspondences:
- name: state-records
left:
path_regex: '^data/state_(?P<state>[A-Z]{2})\.csv$'
right:
path_regex: '^by-state/(?P<state>[A-Z]{2})/records\.csv$'
key: '${state}'
logical_path: 'states/${state}.csv'
on_null_key: diagnostic
on_duplicate_key: diagnostic
report_path_change: false
path_regex uses named capture groups. key and logical_path are templates
that substitute captures as ${name}. If logical_path is omitted, the right
path is used. Null keys, duplicate keys, one-to-many, and many-to-one matches
are skipped by default with warning diagnostics; use error to emit an
error-severity diagnostic or ignore to silence those diagnostics. Error
diagnostics do not stop the snapshot comparison.
dataset.correspondence¶
The correspondence engine defaults
expand_renamed_unchanged_collections: true. This is the correct-by-default
setting: if a folder or archive is renamed but its own contents are unchanged,
binoc still expands inside it so it can detect facts such as a file copied out
of that renamed collection.
Set it to false for the faster short-circuit posture. In that mode, renamed
unchanged collections can be settled without looking beneath them, so copy or
move provenance involving their children may be reported less specifically.
Decompression size caps¶
When binoc expands a .zip, .tar/.tgz, or .gz, it bounds the decompressed
output as a decompression-bomb defense. If a bundle exceeds a cap, expansion
fails with a clear diagnostic and binoc falls back to comparing the archive as
opaque bytes — so dataset semantics such as row keys never get applied inside it.
The defaults sit at GiB scale (per-entry 4 GiB, archive total 8 GiB, gzip
4 GiB) and handle multi-GB government bundles, but a very large bundle may still
need a higher ceiling. Each cap is a byte count and can be raised
independently:
dataset:
correspondence:
max_archive_entry_bytes: 6442450944 # 6 GiB: largest single member
max_archive_total_bytes: 17179869184 # 16 GiB: whole-archive output
max_gzip_bytes: 6442450944 # 6 GiB: single .gz stream
Omit a key to keep its default. Raise a cap only as high as your real data requires; the bound is what protects you from a maliciously crafted archive that expands to far more than its compressed size.
dataset.tables¶
Tabular row identity is optional. Without it, binoc.tabular_analyzer keeps
the positional fallback: row additions/removals are counted by row count
differences, and changed cells are compared at the same row offset. With
columns, the analyzer builds a table-local key and matches rows by that key
before reporting row additions, removals, and modified cells.
dataset:
tables:
- path: data.csv
columns: ['id']
- logical_name: products
columns: ['BLA Number', 'Product Number']
- path: workbook.xlsx
logical_name: Products
columns: ['id']
Use logical_name for logical table children produced by multi-table parsers
or collection parse rules. Use path or path_regex for ordinary
single-file CSVs. When an entry includes both a source selector and
logical_name, both must match; for example, the workbook.xlsx entry above
targets the Products logical table inside that source item.
When several entries should share the same identity failure policy, expand
tables to an object with defaults and entries:
dataset:
tables:
defaults:
row_identity:
on_null_key: diagnostic
on_duplicate_key: error
entries:
- logical_name: applications
columns: ['ApplNo']
- path_regex: '^data/products\.csv$'
columns: ['BLA Number', 'Product Number']
For unusual cases, entries also accept explicit match and row_identity
blocks, but the flat selector and columns form above is preferred.
Rows with blank key components are null-key rows. Keys that appear more
than once on either side are duplicate-key rows. The default
diagnostic policy emits warnings and leaves those rows unmatched so
normal add/remove counts still reflect them. error emits an error-severity
diagnostic without stopping the comparison. ignore suppresses the diagnostic
while keeping the same conservative matching behavior.
output.<renderer>¶
Each renderer gets its own config section, keyed by the renderer's short name. Unknown sections are ignored, and any renderer without a section receives an empty object and applies its own defaults.
The Markdown renderer is the most interesting case today.
output.markdown.verbosity¶
Controls how much renderer-visible evidence the Markdown changelog shows:
summaryrenders only the main one-line bullet for each reportable node.examplesrenders the summary plus bounded inline examples from anydetail_blocksattached to the node. This is the default.fullrenders all captured detail blocks and examples from the changeset, still subject to the renderer's hard safety budget.
The renderer never reopens source data. If a node advertises an extract aspect,
the changelog points you at binoc extract for the exhaustive content.
output.markdown.max_examples_per_block¶
Only used at verbosity: examples. Caps how many examples the renderer shows
from each structured detail block before it switches to a "showing N of M"
message and an extract hint.
output.markdown.max_detail_blocks_per_node¶
Only used at verbosity: examples. Caps how many structured detail blocks the
renderer shows under a single changelog bullet.
output.markdown.max_value_chars¶
Caps how many characters of a single example value the renderer prints inline
before truncating it with ....
output.markdown.max_rendered_detail_bytes¶
Hard safety budget for all rendered detail lines across the whole Markdown output. When the renderer hits this budget it stops printing further inline detail and leaves the summary bullets intact.
output.markdown.groups¶
An ordered list of group definitions. Each group has a literal heading
string and a tags list. The renderer looks up each tagged node against this
list and places the change under the first matching heading.
output:
markdown:
groups:
- heading: "Review first"
tags:
- bio.cross-contamination
- heading: "Substantive changes"
tags:
- binoc.column-addition
- binoc.row-addition
- bio.sequence-change # custom tag from a plugin
- heading: "Clerical changes"
tags:
- binoc.column-reorder
- binoc.whitespace-change
- bio.header-change # custom tag from a plugin
A node with multiple tags goes to the first matching group; declared order is
both display order and priority order. Anything unmapped falls under
Other Changes, but only when at least one group is configured. If groups
is omitted or empty, the default Markdown output is a flat factual list with
no section headings. This is
intentionally a renderer concern, not an IR concern — a single
changeset can be rendered with different grouping policies for
different audiences. See
Significance classification
and
Renderer config ADR for the rationale.
Other renderer config¶
The output block can hold config for any registered renderer. For
the shape of an HTML renderer config, a CI-check renderer config,
etc., consult the renderer's documentation (for third-party
renderers) or source (for binoc-stdlib). Each renderer deserializes
its own section.
Where to go next¶
- Diff two snapshots — the default pipeline in action.
- Install and use plugins — adding third-party plugin names to the config.
- Plugin discovery — how plugin names become running code.
- Renderer config ADR — the decision record for per-renderer sections.