IR and changesets¶
The IR (intermediate representation) is binoc's lingua franca. Every comparator emits IR nodes; every transformer rewrites them; every renderer consumes them. The IR is the contract — both within a single binoc run and across releases — so understanding its shape is essential to writing plugins or building a pipeline that consumes binoc's output.
A changeset is a tree of DiffNode values¶
Each changeset is a single DiffNode (the root) with children, grandchildren,
and so on. The shape mirrors the input snapshots: a directory becomes a
container node with file children; a zip archive becomes a container node
with the archive's contents as children; a CSV file becomes a leaf with
column / row details.
flowchart TD
Root["root: directory (modify)"]
Root --> A["data/extra.csv (add)"]
Root --> B["data/records.csv (modify, +1 row)"]
Root --> C["docs/readme.txt (modify, +2/-1 lines)"]
DiffNode fields¶
The full set of fields, defined in
binoc-core/src/ir.rs:
| Field | Type | Purpose |
|---|---|---|
action |
open string | What happened: "add", "remove", "modify", "move", "reorder", "identical", … Plugins may define new values. |
item_type |
open string | What the item is: "directory", "file", "tabular", "zip_archive", … The core never interprets it. |
path |
string | Logical path within the snapshot, e.g. "archive.zip/data/file.csv". |
source_path |
optional | For moves and renames: the original (left-side) path. |
summary |
optional | Human-readable one-liner ("2 lines added, 1 removed"). Set by comparators or transformers; rendered by renderers. |
tags |
set of strings | Semantic observations: binoc.column-reorder, binoc.content-changed, … Open and namespaced by convention. |
children |
list | Child diff nodes forming the tree structure. |
details |
map | Comparator-specific structured data (column lists, row counts, hashes). |
annotations |
map | Transformer-added metadata, kept separate from details. |
comparator |
optional | Which comparator produced this node — provenance for the extract chain. |
transformed_by |
list | Transformers that modified this node, in order — provenance for the extract chain. |
artifacts |
list of ArtifactDescriptor |
Pointers to typed payloads published by comparators. See Artifacts and composition. |
A few of these fields are transient — they exist during the live diff session but are stripped on serialization. See the transient fields on wire ADR for what crosses the boundary and what doesn't.
Three design commitments behind the IR¶
Everything is openly typed¶
action, item_type, and tags are plain strings. They are conventions,
not enforced enums. A genomics plugin can emit action: "gap-shift" without
touching core. A pipeline integrator's downstream code can match on
item_type == "tabular" without binoc dictating an enum it has to keep up
with.
The trade-off is that the IR offers no static guarantees about the value of
these fields. The convention is to namespace custom values
(biobinoc.fasta-alignment, not fasta-alignment) so that pipelines that
care about precise dispatch can do so safely. See
Vocabulary.
Tags are facts, not judgments¶
Every tag in the IR is a factual observation: binoc.column-reorder
means "the columns were reordered" — not "this is unimportant." Whether a
column reorder counts as clerical or substantive is a renderer concern,
mapped from tags via configuration. See
Significance classification.
This split is the reason a single dataset's binoc output can be read differently by different audiences without re-running the diff: the same changeset JSON, fed through two renderer configs, can produce two changelogs that disagree about what mattered.
The tree is structural, not just additive¶
The controller keeps identical nodes in the tree during transformer
execution and prunes them only at the end. This means a
copy-detection transformer can correlate an add node with an
unchanged file across the snapshot — that file is still in the tree.
See the
full comparison tree ADR
for why.
flowchart LR
subgraph Transform["Transformer view (full tree)"]
T0["root (modify)"]
T1["docs/guide.txt (identical)"]
T2["data/records.csv (modify, +1 row)"]
T0 --> T1
T0 --> T2
end
subgraph Render["Renderer view (after prune_identical)"]
R0["root (modify)"]
R2["data/records.csv (modify, +1 row)"]
R0 --> R2
end
What a changeset looks like on disk¶
A changeset on disk is JSON. The root node is the top-level object; children nest naturally. The default Markdown renderer reads the tree, applies significance classification, and produces grouped output:
# Changelog: snapshot-a → snapshot-b
## Substantive Changes
- **data/records.csv**: 1 row added
- **data/extra.csv**: New table (2 columns, 1 row)
## Clerical Changes
- **summary.csv**: Columns reordered (content unchanged)
Pipeline integrators consume the JSON directly. The changeset JSON schema is the canonical contract.
Combining changesets¶
binoc changelog changeset-1.json changeset-2.json … reads multiple stored
changesets and produces a single changelog spanning all of them. The
combinator is just another renderer; it does not modify the IR.
This is the model for "dataset history" use cases: store one changeset per release, combine them on demand to produce a release-spanning view.
Where to go next¶
- For the cross-plugin composition mechanism that lives alongside details and tags → Artifacts and composition.
- For how a node's
comparatorandtransformed_byprovenance fields are used → Extract and provenance. - For the long-form ADRs: transient fields on wire, opportunistic ItemRef metadata, full comparison tree.
- For the JSON contract → changeset schema reference.