Tiered Artifact Metadata: Column, Table, and a parser_metadata_v1 Artifact¶
Date: 2026-06-15 Status: Implemented (channels + producers in CFM-80; rendering + significance in CFM-82)
Context¶
A parser cracks a source format (CSV, Stata, SAS, Excel, SQLite, …) into one
tabular_v1 artifact, or — for multi-table containers — a set of tabular_v1
child nodes. But the container format carries facts beyond the rows and columns:
per-variable labels and display formats, value-label dictionaries, a dataset
label, the source-format version, file encoding, creator/tooling provenance.
Before this change, the stat-binary plugin computed a rich per-variable and
file-level metadata JSON during parsing and then discarded it — it had been
attached to a tabular_collection_v1 parent manifest that no longer exists.
Deleting the dead extraction (the immediately preceding cleanup) raised the real
question: these facts are genuinely useful to downstream consumers — rewrite
rules and renderers — and a user may well consider a change to them (a relabeled
variable, a dropped value-label set, a new file encoding) a relevant change.
So a metadata channel should exist. The question was its shape.
Two observations narrowed the design:
-
The facts come in three grains, keyed differently. Per-column facts are keyed to a column; table facts to a table; file-level facts to the parse/node as a whole. Jamming all three into one blob is exactly what made the old metadata homeless when its single host (the collection manifest) went away.
-
Artifacts are the parser's output channel. In binoc, "a parser attaches something to a node" is spelled "a parse rule publishes an artifact." A node already carries multiple artifacts of different formats, each diffed independently by a format-matched writer. So metadata is not a new concept to bolt onto the node type — it is more of the existing artifact mechanism.
Decision¶
Carry metadata in three tiers, by key:
| Tier | Grain | Home | Codec |
|---|---|---|---|
| 1 | per-column | TabularData.column_metadata (parallel to headers) |
tabular_v1 |
| 2 | per-table | TabularData.table_metadata (an open bag) |
tabular_v1 |
| 3 | per-parse / file-level | a second artifact on the parsed node | parser_metadata_v1 |
-
Tiers 1 and 2 ride on
tabular_v1.column_metadata: Vec<serde_json::Value>is parallel toheaders(mirroringcolumn_types); each entry is an open object, orNullfor a column with no metadata.table_metadata: serde_json::Valueis a table-scoped bag. Both are optional andskip_serializing_ifempty, so a table with no metadata serializes byte-identically to before and existing tabular producers are unaffected. -
Tier 3 is a new
parser_metadata_v1artifact, codecParserMetadata { format, value }—formatnames the source format ("stata_dta","sas7bdat","sas_xport") so a consumer can interpretvaluewithout guessing;valueis an open bag diffed generically. It rides as a second artifact on the parsed node, published via a newParseOutput.artifactsfield (the parent-node analogue ofParsedChild.artifacts). A single-table leaf carriestabular_v1+parser_metadata_v1; a multi-table container (which has no table of its own) carriesparser_metadata_v1alongside itstabular_v1children.
Metadata does not need a current consumer to be useful. The deliverable here
is the channel, populated by a real producer (stat-binary, whose previously
discarded labels/formats/value-labels/version facts are restored into the right
tiers). Carrying the facts on artifacts already makes them available to any
future rewrite rule, renderer, extractor, or third-party plugin, and visible in
traces and debugging — exactly as tabular_v1 is useful independent of which
writer consumes it. A changeset is the projected edit list, so carried-but-
unconsumed metadata produces zero output churn until a writer reads it; the
full test suite confirmed no snapshot changed.
Volatile fields are excluded from the populated metadata (file created/modified timestamps). They are wall-clock noise that would make a metadata diff fire on every re-export; if a use case wants them later they can be added under a clearly low-significance key.
Rendering is deliberately deferred¶
The engine runs one writer per link (first match wins), so a node's owning
writer — TabularWriter for a leaf, ContainerWriter for a container — would be
the single place to render metadata changes, by reading the relevant artifacts.
That is a real, in-grain extension, but it needs its own significance design (a
relabeled column vs. a dropped value-label set vs. a creator rename are not
equally interesting, and significance is a renderer/config concern, not an IR
one). Rather than bundle that judgment into the channel work, this ADR ships the
channel and populated producers; a later change adds the writer-side rendering
and significance mapping. The "no current consumer needed" principle is what
makes this staging legitimate rather than half-done.
Alternatives Considered¶
-
A new attribute on the node (
ItemRef) instead of an artifact. Rejected.ItemRefis identity/locator (path, hash, size, media type) plus the one projection channel; it exists before any parser runs. A metadata bag there would be null for the vast majority of nodes, carry no diff machinery (core is type-ignorant and could not diff it), and leak a domain concept into the core node type. Artifacts already solve diffing and keep domain knowledge behind the opaqueformat + bytesseam. -
Reusing
structured_document_v1for tier 3. Rejected. That format and its writer mean "this node is a document the user authored" (it emitsdocument.value_change). Metadata is facts about a different artifact; reusing the document format would mislabel a creator-name change as a document edit and conflate significance routing. A distinctparser_metadata_v1keeps the vocabulary honest. -
Folding tier 3 into tier 2 for single-table leaves (no separate artifact on leaves, only on containers). Rejected for consistency:
parser_metadata_v1should mean the same thing and render the same way wherever it rides, so leaves and containers both carry it. The cost is a second artifact on leaf nodes, which the multi-artifact-per-node model already supports. -
A synthetic metadata child node. Rejected — it pollutes the logical tree with a child the container does not actually contain and needs stable cross-snapshot naming. Metadata is a property of a node, not a member of it.
-
Artifact-format inheritance /
parser_metadataas a subtype of a record artifact. Considered and deferred.parser_metadata_v1is kept a flat, standalone format for now; if it grows typed structure, that is av2of the codec rather than new inheritance machinery in the artifact-format system.
Consequences¶
binoc-sdk:TabularDatagainscolumn_metadata+table_metadata(withwith_column_metadata/with_table_metadatabuilders); newparser_metadata_v1()format andParserMetadatacodec;ParseOutputgainsartifactsfor secondary parent-node artifacts.binoc-core: the parse driver publishesParseOutput.artifactson the parsed node, after the primarybytesartifact.binoc-stat-binary: restores the previously discarded metadata into the three tiers — Stata/SAS variable labels, formats, and value-label set names intocolumn_metadata; dataset name/label intotable_metadata; source-format identity, version/encoding, and value-label dictionaries into aparser_metadata_v1artifact (on each leaf, and on the.xptcontainer).- Tabular producers in other plugins gain the new (empty) fields with no behavior change.