Skip to content

IR and changesets

The IR (intermediate representation) is binoc's lingua franca. Every comparator emits IR nodes; every transformer rewrites them; every renderer consumes them. The IR is the contract — both within a single binoc run and across releases — so understanding its shape is essential to writing plugins or building a pipeline that consumes binoc's output.

A changeset is a tree of DiffNode values

Each changeset is a single DiffNode (the root) with children, grandchildren, and so on. The shape mirrors the input snapshots: a directory becomes a container node with file children; a zip archive becomes a container node with the archive's contents as children; a CSV file becomes a leaf with column / row details.

flowchart TD
    Root["root: directory (modify)"]
    Root --> A["data/extra.csv (add)"]
    Root --> B["data/records.csv (modify, +1 row)"]
    Root --> C["docs/readme.txt (modify, +2/-1 lines)"]

DiffNode fields

The full set of fields, defined in binoc-core/src/ir.rs:

Field Type Purpose
action open string What happened: "add", "remove", "modify", "move", "reorder", "identical", … Plugins may define new values.
item_type open string What the item is: "directory", "file", "tabular", "zip_archive", … The core never interprets it.
path string Logical path within the snapshot, e.g. "archive.zip/data/file.csv".
source_path optional For moves and renames: the original (left-side) path.
summary optional Human-readable one-liner ("2 lines added, 1 removed"). Set by comparators or transformers; rendered by renderers.
tags set of strings Semantic observations: binoc.column-reorder, binoc.content-changed, … Open and namespaced by convention.
children list Child diff nodes forming the tree structure.
details map Comparator-specific structured data (column lists, row counts, hashes).
annotations map Transformer-added metadata, kept separate from details.
comparator optional Which comparator produced this node — provenance for the extract chain.
transformed_by list Transformers that modified this node, in order — provenance for the extract chain.
artifacts list of ArtifactDescriptor Pointers to typed payloads published by comparators. See Artifacts and composition.

A few of these fields are transient — they exist during the live diff session but are stripped on serialization. See the transient fields on wire ADR for what crosses the boundary and what doesn't.

Three design commitments behind the IR

Everything is openly typed

action, item_type, and tags are plain strings. They are conventions, not enforced enums. A genomics plugin can emit action: "gap-shift" without touching core. A pipeline integrator's downstream code can match on item_type == "tabular" without binoc dictating an enum it has to keep up with.

The trade-off is that the IR offers no static guarantees about the value of these fields. The convention is to namespace custom values (biobinoc.fasta-alignment, not fasta-alignment) so that pipelines that care about precise dispatch can do so safely. See Vocabulary.

Tags are facts, not judgments

Every tag in the IR is a factual observation: binoc.column-reorder means "the columns were reordered" — not "this is unimportant." Whether a column reorder counts as clerical or substantive is a renderer concern, mapped from tags via configuration. See Significance classification.

This split is the reason a single dataset's binoc output can be read differently by different audiences without re-running the diff: the same changeset JSON, fed through two renderer configs, can produce two changelogs that disagree about what mattered.

The tree is structural, not just additive

The controller keeps identical nodes in the tree during transformer execution and prunes them only at the end. This means a copy-detection transformer can correlate an add node with an unchanged file across the snapshot — that file is still in the tree. See the full comparison tree ADR for why.

flowchart LR
    subgraph Transform["Transformer view (full tree)"]
        T0["root (modify)"]
        T1["docs/guide.txt (identical)"]
        T2["data/records.csv (modify, +1 row)"]
        T0 --> T1
        T0 --> T2
    end

    subgraph Render["Renderer view (after prune_identical)"]
        R0["root (modify)"]
        R2["data/records.csv (modify, +1 row)"]
        R0 --> R2
    end

What a changeset looks like on disk

A changeset on disk is JSON. The root node is the top-level object; children nest naturally. The default Markdown renderer reads the tree, applies significance classification, and produces grouped output:

# Changelog: snapshot-a → snapshot-b

## Substantive Changes

- **data/records.csv**: 1 row added
- **data/extra.csv**: New table (2 columns, 1 row)

## Clerical Changes

- **summary.csv**: Columns reordered (content unchanged)

Pipeline integrators consume the JSON directly. The changeset JSON schema is the canonical contract.

Combining changesets

binoc changelog changeset-1.json changeset-2.json … reads multiple stored changesets and produces a single changelog spanning all of them. The combinator is just another renderer; it does not modify the IR.

This is the model for "dataset history" use cases: store one changeset per release, combine them on demand to produce a release-spanning view.

Where to go next