Unified Dataset Config and Identity Policy¶
Date: 2026-06-01 Status: Accepted; implementation notes superseded in part by Correspondence-First Engine
Note: the Markdown-renderer grouping example below is superseded by
2026-06-02-renderer_groups.md. Current config
uses output.markdown.groups, not output.markdown.significance.
Note: the dataset-semantics shape remains current, but implementation passages
that mention comparator-to-transformer orchestration or pending_recompare are
superseded by the correspondence engine. Declared file identity is now pair-rule
evidence over side items.
Context¶
Several tabular features need the same concept but have been phrased as separate requests:
- keyed row identity for tables, such as FDA product rows keyed by
["BLA Number", "Product Number"] - declared file correspondence across snapshots, where two different paths should be treated as the same logical file
- tabular parse options, especially delimiter selection
- different table keys and parse options within one multi-table dataset
- clear behavior when keys are null, duplicated, one-to-many, or many-to-one
Treating each feature as plugin-specific configuration would make real datasets awkward to describe. A workbook, a SQLite database, and a directory full of CSVs can all be "the same dataset" from the user's perspective. The configuration surface should let users describe dataset semantics once, while plugins consume the parts they understand.
At the same time, Binoc's core rules still apply:
- the controller must remain type-ignorant
- parse rules publish source data as artifacts
- pair, writer, compaction, and projection rules optimize the correspondence result before changeset projection
- significance remains a renderer concern
- configuration is passed into the run, not read from global state
Decision¶
Binoc will add a unified, SDK-owned dataset semantics section to the dataset config. It describes file identity, table identity, row identity, and parse options in one place. Core may carry this config and pass it to plugins, but it does not interpret paths, tables, delimiters, or keys.
The existing orchestration sections stay:
comparators:
- binoc.zip
- binoc.tar
- binoc.directory
- binoc.csv
- binoc.text
- binoc.binary
transformers:
- binoc.declared_correspondence
- binoc.correlation_detector
- binoc.fuzzy_correlation_detector
- binoc.folder_move_detector
- binoc.tabular_analyzer
- binoc.column_reorder_detector
output:
markdown:
groups:
- heading: "Substantive changes"
tags: [binoc.schema-change, binoc.row-addition]
- heading: "Clerical follow-up"
tags: [binoc.column-reorder]
The new semantic section is separate from plugin order:
dataset:
files:
correspondences:
- name: fda-quarterly-csvs
left:
path_regex: '^raw/(?P<table>[^/]+)/(?P<year>[0-9]{4})\.csv$'
right:
path_regex: '^normalized/(?P<year>[0-9]{4})/(?P<table>[^/]+)\.csv$'
key: '${table}:${year}'
logical_path: 'tables/${table}-${year}.csv'
cardinality: one-to-one
on_null_key: diagnostic
on_duplicate_key: diagnostic
report_path_change: false
tables:
defaults:
parse:
header: true
delimiter: ','
row_identity:
on_null_key: diagnostic
on_duplicate_key: diagnostic
entries:
applications:
match:
logical_name: applications
parse:
delimiter: ','
row_identity:
columns: ['ApplNo']
products:
match:
logical_name: products
row_identity:
columns: ['BLA Number', 'Product Number']
adverse_events:
match:
source:
path_regex: '^events/.*\.tsv$'
parse:
delimiter: '\t'
row_identity:
columns: ['case_id', 'event_seq']
Type sketch¶
The SDK owns the schema and helper types. The exact Rust names can move during implementation, but the shape is:
pub struct DatasetSemanticsV1 {
pub files: FileIdentityConfig,
pub tables: TableConfig,
}
pub struct FileIdentityConfig {
pub correspondences: Vec<FileCorrespondenceRule>,
}
pub struct FileCorrespondenceRule {
pub name: String,
pub left: FileSelector,
pub right: FileSelector,
pub key: Template,
pub logical_path: Option<Template>,
pub cardinality: Cardinality,
pub on_null_key: IdentityFailurePolicy,
pub on_duplicate_key: IdentityFailurePolicy,
pub report_path_change: bool,
}
pub struct TableConfig {
pub defaults: TableDefaults,
pub entries: BTreeMap<String, TableEntry>,
}
pub struct TableEntry {
pub match_: TableSelector,
pub parse: TabularParseOptions,
pub row_identity: RowIdentity,
}
pub struct RowIdentity {
pub columns: Vec<String>,
pub cardinality: Cardinality,
pub on_null_key: IdentityFailurePolicy,
pub on_duplicate_key: IdentityFailurePolicy,
}
pub enum Cardinality {
OneToOne,
}
pub enum IdentityFailurePolicy {
Diagnostic,
Error,
Ignore,
}
Cardinality is intentionally narrow in v1. User language can name the hazard
as "one-to-many" or "many-to-one", but Binoc will not auto-match those shapes
until a concrete aggregate matching design exists.
Plugin-specific config remains available for plugin knobs that are not dataset
semantics. The current transformer_config pattern should be mirrored for
comparators as comparator_config; renderers keep using output.<renderer>.
The semantic dataset section is the preferred surface for keys, parse options,
and correspondence rules. The lower-level per-plugin sections remain escape
hatches, not the main UX.
Comparators and transformers receive run config¶
Comparators need configuration for parse options, and transformers need the same semantic config for keyed row analysis and declared file correspondence. The follow-on implementation should pass a run-config view to plugins, analogous to the current renderer and transformer config values:
pub struct PluginRunConfig<'a> {
/// The comparator_config / transformer_config / output value selected for
/// this plugin, depending on plugin kind.
pub plugin: &'a serde_json::Value,
/// The top-level dataset semantics value, passed through unchanged by core.
pub dataset: &'a serde_json::Value,
}
Core only selects the plugin's own config value and passes the dataset semantics
value through. It does not deserialize or act on table/file fields. SDK/stdlib
helpers deserialize dataset into DatasetSemanticsV1 for plugins that opt in.
This is a trait/config plumbing change, not a comparator-to-transformer pipeline rework. The existing pipeline remains:
- comparators build the initial tree and publish artifacts
- root-scope correlation transformers rewrite or inflate the tree
- data-shape transformers analyze artifacts
- renderers classify and format tags
File correspondence is an explicit correlation pass¶
Declared file correspondence belongs before heuristic move/copy detection. It is
implemented as a root-scope transformer, tentatively
binoc.declared_correspondence, ordered before binoc.correlation_detector and
binoc.fuzzy_correlation_detector.
The transformer:
- walks residual add/remove leaves
- applies user-declared correspondence rules
- builds a key index from the left and right selectors
- pairs only unambiguous one-to-one matches
- creates a node at the declared
logical_pathor the right path - sets
pending_recompareso the normal comparator pipeline parses the pair
This uses the existing transformer-initiated re-dispatch mechanism from the rename-and-modify ADR. No new controller phase is required.
Declared correspondence is not a replacement for the move detectors:
- declared correspondence is user-supplied identity
- exact/fuzzy correlation is inferred content similarity
folder_move_detectoris a reporting rollup over already-detected file moves
Declared matches are removed from the residual add/remove pool before heuristic
detectors run, so the two layers do not double count. If
report_path_change: false, a path difference is treated as snapshot layout and
is not rendered as a move. If report_path_change: true, the node gets a factual
path-change tag and source/destination details; renderers may surface that
without treating it as a heuristic binoc.move.
Row identity is table-local¶
Row keys live on table entries, not globally. A dataset can have different keys per logical table:
dataset:
tables:
entries:
applications:
row_identity:
columns: ['ApplNo']
products:
row_identity:
columns: ['BLA Number', 'Product Number']
The tabular analyzer resolves the current table identity from the table collection artifact or from the single-table source location. It then applies the matching table entry and compares rows by key instead of position.
No configured key means the current positional behavior is retained. That is important for simple CSVs without stable IDs.
Null, duplicate, and many-to-many policy¶
File keys and row keys use the same identity-index rule:
| Left count | Right count | Result |
|---|---|---|
| 1 | 1 | match |
| 1 | 0 | removal |
| 0 | 1 | addition |
| 0 | N | additions, unless null/duplicate policy says otherwise |
| N | 0 | removals, unless null/duplicate policy says otherwise |
| 1 | N | ambiguous one-to-many |
| N | 1 | ambiguous many-to-one |
| N | M | ambiguous many-to-many |
N means more than one item for the same normalized key. The default
diagnostic policy does not guess. It emits an identity diagnostic tag and
details, leaves the ambiguous members unmatched, and lets normal add/remove
reporting continue. error fails the run with a config/data quality error.
ignore suppresses the diagnostic and treats the members as if no key existed
for them.
Standard tags:
binoc.identity-diagnosticbinoc.null-keybinoc.duplicate-keybinoc.ambiguous-keybinoc.file-correspondence-ambiguousbinoc.row-identity-ambiguous
The renderer may group these as warnings or review-first changes through normal significance config. The IR still carries factual tags only.
Consequences¶
- Users get one place to describe dataset semantics, even when those semantics affect multiple plugins.
- CSV delimiter and other parse options now have a proper config path; the CSV comparator will need comparator config plumbing.
- Keyed row diffs and file correspondence share the same conservative identity failure policy.
- Declared correspondence runs before heuristic move/copy detectors but does not replace them.
- The controller remains type-ignorant: it passes JSON config through and
handles
pending_recompare, but it does not understand tables or paths. - Follow-on implementation can proceed without a new pipeline architecture.
Alternatives Considered¶
Put everything in per-plugin config. This fits the current
transformer_config model but creates a scattered UX: CSV delimiter in one
place, row keys in another, file correspondence in a third. It also makes
multi-table datasets awkward because the same logical table metadata would need
to be repeated for each plugin that cares.
Teach the controller about file correspondence before directory comparison.
This would pair files earlier, but it makes the controller understand paths and
dataset identity. A root-scope declared-correspondence transformer preserves the
existing type-ignorant controller and reuses pending_recompare.
Treat declared correspondence as a generalization that replaces move detectors. Rejected because declared identity and inferred similarity answer different questions. A user declaration can say two changing paths are the same logical file even when their contents differ. A heuristic detector can still find moves that users did not configure.
Auto-match duplicate keys greedily. Rejected for both rows and files. Greedy pairing can fabricate precise-looking cell or file changes from ambiguous identity. Binoc should report the ambiguity and require either better keys or a future explicit aggregate matching mode.
Promote key failures directly to significance levels. Rejected for the same reason significance is not in the IR. Key failures are factual tags; renderers decide whether they are warnings, clerical issues, or substantive review items.