Derive Parse-Rule Link Gating from Pair Reads¶
Date: 2026-06-13 Status: Implemented
Context¶
A parse rule turns raw bytes into a published artifact. Most parse rules
should only run on nodes that have already been paired (linked): there is
no point parsing a SQLite file or a stacked-CSV collection on a leaf that
will never have a counterpart. The engine modeled this with a per-rule
ParseDescriptor.requires_link: bool flag.
The cross-format tabular pairing feature broke that model. TabularPair
compares parsed tabular_v1 content to detect a CSV→TSV reformat as a
single reformatted table. To do that it must read the tabular_v1
artifact before any link exists — pairing is exactly what produces the
link. So CsvParse.requires_link was hand-flipped to false.
That hand-flip exposed a deeper fact: requires_link is not a local
property of a parse rule. It is a statement about the whole ruleset —
"no pre-link consumer needs my output." A parse rule author cannot
correctly set it in isolation, because the correct value depends on which
pair rules happen to be configured alongside it. Two facts encoded in two
places that can never legally disagree (the parse rule's flag and the pair
rules' declared reads) are a latent inconsistency waiting to bite.
Decision¶
Remove ParseDescriptor.requires_link and derive the gate from the
ruleset. Once per run, before the fixed-point loop, the driver
(binoc-core/src/correspondence/driver.rs) computes:
preconsumed_formats: BTreeSet<ArtifactFormat>
= union of every CoreRule::Pair rule's descriptor().reads
A parse rule is link-gated iff its descriptor.output is not in
preconsumed_formats:
if !preconsumed_formats.contains(&descriptor.output)
&& store.links.of_node(id).is_empty()
{
continue;
}
Pair rules are the only pre-link artifact consumers today: writers and
annotators run post-link, and expand rules read raw data, not artifacts.
So the union of pair reads is the complete set of formats that must
exist on unlinked nodes. The derivation reproduces the prior behavior
exactly: CsvParse (output tabular_v1) is read by TabularPair → un-
gated (the former hand-flip, now formalized); every other parse rule
(CsvStackedTablesParse, SqliteParseRule, the stat-binary collection
parsers, all output tabular_collection_v1, read by no pair rule) →
stays link-gated.
This stays a static pre-pass: it sets a per-format gate and does not reorder or schedule rule execution. The engine remains a fixed-point loop.
Alternatives Considered¶
- Keep
requires_linkand add a lint that cross-checks it against pair reads. Rejected: validating a redundant field is strictly worse than deriving the only consistent value. The field could still be authored wrong; the lint would just catch it later. - Schedule parse rules explicitly relative to pair rules instead of a static gate. Rejected as over-engineering. The fixed-point loop already converges; a per-format boolean gate is sufficient and keeps the engine's execution model unchanged.
Known Limitation¶
descriptor.output is a parse rule's primary format only. Child
artifacts published via ParsedChild (e.g. the stacked-table parser
emits child tabular_v1 artifacts) are not declared in the descriptor,
so the derivation cannot see them. For the current ruleset this is
harmless: no pair rule needs a format that is produced only as an
undeclared child pre-link. If a future pair rule needed to read such a
child format on unlinked nodes, the producing parser would have to declare
that format (e.g. via descriptor-level child-format declarations) so the
derivation could un-gate it. Noted here as a future edge rather than
solved speculatively.