Work on performance optimization¶
Goal. Measure Binoc performance with fixtures large enough to expose real engine costs, then make optimizations only when the measurements show a net win and the projected changeset remains deterministic.
This page is for core contributors changing the correspondence engine or stdlib rule pack. For the original CFM-44 decision record, see CFM-44 Measured Correspondence Performance.
Ground rules¶
Performance work is gated by two checks:
- Outputs are invariant. Serial and optimized runs must produce the same
projected changeset. The perf report exposes a
changeset_json_hash; hash changes are regressions unless the code change intentionally changes semantics and the test vectors are updated with that rationale. - Structural metrics are exact; timing is a trend. Compare
rounds, per-rule invocations, fires, links, compaction counts, writer usage, anddescription_costexactly. Treat wall time and per-rule elapsed time as noisy evidence that needs repeated runs or a before/after gap large enough to survive noise.
Do not optimize against tiny fixtures. They are useful for correctness tests, but they hide the costs that dominate real snapshots: filesystem traversal, hashing, parse work, and cross-tree pairing.
Run the harness¶
The reusable report command emits one JSON object per execution mode:
With no arguments, just perf uses the same default synthetic shape:
The same binary can measure real snapshot pairs:
The underlying binary is:
cargo run --release -q -p binoc-stdlib --bin perf_report -- \
--groups 1 --files-per-group 200 --rows-per-file 1000
Run a single execution mode when you need attributable wall time, CPU, and RSS:
just perf --mode serial --groups 1 --files-per-group 200 --rows-per-file 1000
just perf --mode parallel_parse --groups 1 --files-per-group 200 --rows-per-file 1000
Use named fixture families for the standard scaling matrix:
just perf --family row-scale
just perf --family file-count-scale
just perf --family directory-scale
just perf --family fuzzy-threshold --mode serial
The ignored baseline test is useful when you want human-readable stderr and a hard serial-vs-parallel equality assertion:
cargo test --release -p binoc-stdlib --test performance_baseline \
performance_baseline_reports_driver_hotspots -- --ignored --nocapture
Override its fixture shape with environment variables:
BINOC_PERF_GROUPS=80 BINOC_PERF_FILES_PER_GROUP=20 BINOC_PERF_ROWS_PER_FILE=25 \
cargo test --release -p binoc-stdlib --test performance_baseline \
performance_baseline_reports_driver_hotspots -- --ignored --nocapture
The report includes resources.user_cpu_ms, resources.system_cpu_ms,
resources.max_rss_kb, and resources.max_rss_delta_kb on Unix/macOS.
max_rss_kb is the process high-water mark after the run; use --mode serial
or --mode parallel_parse when you need single-mode attribution.
Drill into a hot phase with a sampling profiler¶
perf_report attributes time to phases (expand_ms, parse_ms, pair_ms)
and to each rule (rule_elapsed_nanos). That is enough to tell you which phase
dominates, but not which function inside it. When a phase looks pathological,
switch to a sampling profiler for function-level attribution. The two tools
compose: just perf finds the hot phase; the profiler finds the hot
function.
just profile-diff data/snapshots/foo/2025-01 data/snapshots/foo/2025-02
just profile-diff LEFT RIGHT path/to/dataset.binoc.yaml # with a config
profile-diff builds the native binoc-cli under the profiling Cargo profile
(release optimizations plus line-table symbols, so the flame graph is readable
without distorting timings) and records it with
samply, which opens the Firefox Profiler
UI. Install once with cargo install samply.
Profile the real CLI command, not perf_report: perf_report always runs
the default engine config, so it cannot measure a keyed/--config run ā the
exact path most likely to be slow on real data. Reach for sampling when phase
metrics point at one phase but not at one rule, when the cost is suspected in
shared helpers (CSV field parsing, row-key construction, changeset
serialization) that span rules, or when wall time is large but per-rule numbers
look unremarkable.
Scaling dimensions¶
The synthetic fixture has three knobs:
| Dimension | Fixture control | What it stresses |
|---|---|---|
| Rows per file | --rows-per-file |
CSV parse CPU, bytes read, row/field artifact construction. |
| Flat file count | Increase --files-per-group while keeping --groups 1 |
File registration, hashing, sibling scans, and whole-view pair proposal overhead. |
| Directory/group count | Increase --groups while holding files per group mostly constant |
Directory expansion, child registration, path handling, and repeated directory-level work. |
Fuzzy rename detection needs a separate fixture family because it is driven by unmatched remove/add candidates, not CSV row volume. The current stdlib guard allows fuzzy scoring through 20 files per side, or 400 candidate pairs, and skips scoring once a 21x21 candidate set would exceed that cap.
For each axis, hold the others constant and run two or three sizes. A useful matrix is:
| Axis | Example sequence |
|---|---|
| Rows per file | 1x200x250, 1x200x1000, 1x200x8000 |
| Flat file count | 1x50x1000, 1x200x250, 1x1000x100 |
| Directory/group count | 10x20x250, 80x20x250, 320x20x250 |
| Fuzzy candidates | 5, 10, 20, 21, and 50 renamed text files per side |
The shorthand is groups x files_per_group x rows_per_file.
Current measurements¶
The June 13, 2026 follow-up pass found:
| Scaling axis | Current result |
|---|---|
| Rows per file | CSV parsing scales roughly with bytes. Parallel parse stayed deterministic and improved wall time by 11-37%; the largest row-heavy run was 420 ms serial vs 265 ms parallel over 61 MB total input. |
| Flat file count | Whole-view pairing did not become the bottleneck. Pair time stayed around 0-4 ms; directory expansion and file registration/hash cost dominated many-file cases. The 1x1000x100 tiny-row run regressed under parallel parse, 329 ms serial vs 384 ms parallel, because saved parse time was outweighed by expansion and system noise. |
| Directory/group count | Expansion is the clearest current limit. At 320x20x250, or 6,400 files per side and 13,442 total nodes, serial wall time was 937 ms with 622 ms in expand and 30 ms in pair. Parallel parse was 879 ms with 635 ms expand and 30 ms pair. |
| Fuzzy rename candidates | The fuzzy cap is effective. At 20x20 candidates, scoring and linking took about 1.3 ms; at 21x21, fuzzy scoring skipped entirely. No unbounded fuzzy-path behavior appeared. |
| Duplicate-hash renames | Whole-view scheduling was not the issue, but HashPair candidate selection was. A 4,000-file identical-content rename stress case initially spent 385 ms in pair rules; pre-sorted hash buckets and cursors reduced that to 13 ms with the same rounds and link counts. |
| Resource posture | No immediate memory issue showed up. A 61 MB row-heavy two-mode run reported 118 MB max RSS; a 13k-node directory-heavy two-mode run reported 69 MB max RSS. Row-heavy runs were user-CPU dominated; directory-heavy runs spent more time in system calls. |
Every serial/parallel pair in that pass produced identical changeset hashes.
Optimization priority¶
Work through this order unless a new fixture clearly changes the profile:
- Improve the harness first. Add single-mode execution or per-mode
resource capture so reports can include max RSS plus user and system CPU
without wrapping the whole two-mode process. Add named fixture families for
row-scale, file-count-scale, directory-scale, and fuzzy-threshold runs.
This is implemented in
perf_report; preserve it when extending the harness. - Prototype cheaper directory expansion and child registration. This is
the current top hotspot. Start with replacing depth-1
WalkDiruse inDirectoryExpandwithstd::fs::read_dirplus a stable sort, then measure whether traversal, metadata, hashing, or registration is actually dominant. That firstread_dirspike regressed on both flat and directory-heavy fixtures, so keepWalkDirfor now. Next likely spike: deterministic batched hashing or registration for large sibling sets. - Keep parallel parse, but tune only with the matrix. It helps row-heavy
CSV fixtures and is mostly harmless elsewhere, but tiny-file shapes can
regress overall. The current
32..=1024job guard held up under threshold experiments; do not change it until the named fixture matrix shows a net wall-time win. - Keep dirty-set/frontier pairing parked. Pair time is currently a small
share of stdlib runs, and the current pair trait has no declared read set.
Revisit when scheduler time becomes material or while redesigning the
EngineViewtransit shape. If pair time spikes inside a specific rule, optimize that rule first, as with the duplicate-hashHashPairbucket fix. - Keep fuzzy caps. They are part of the performance contract, not merely a temporary guard. Raising the cap needs a better prefilter or a fixture that proves the larger candidate set is safe.
After an optimization, run the relevant matrix, the ignored baseline test, and the normal verification loop: