Tutorial¶
Binoc is most useful when a plain filesystem diff would be noisy. In this tutorial you will use the sample snapshots in this repository, run a few increasingly useful diffs, and finish by teaching binoc one new format.
Clone the repository so you have the tutorial fixtures:
git clone https://github.com/harvard-lil/binoc
cd binoc
Install binoc so the plain binoc command is available:
pip install binoc
Or run it without installing:
uvx binoc diff path/to/snapshot-a path/to/snapshot-b
The examples below show the plain commands a user would run after
installing binoc.
Build the example snapshots¶
The repository keeps some fixtures as source trees and materializes the real archives on demand. Run this once from the repository root:
You now have a test-vectors-materialized/ folder full of ready-to-diff
snapshots.
Run your first diff¶
Start with the simplest possible case: two identical snapshots.
binoc diff ./test-vectors-materialized/trivial-identical/snapshot-a ./test-vectors-materialized/trivial-identical/snapshot-b
# Changelog: ./test-vectors-materialized/trivial-identical/snapshot-a → ./test-vectors-materialized/trivial-identical/snapshot-b
No changes detected.
See a text-file change¶
Now compare two snapshots where one text file changed:
cat ./test-vectors-materialized/single-file-modify-text/snapshot-a/story.txt
printf '\n---\n'
cat ./test-vectors-materialized/single-file-modify-text/snapshot-b/story.txt
binoc diff ./test-vectors-materialized/single-file-modify-text/snapshot-a ./test-vectors-materialized/single-file-modify-text/snapshot-b
# Changelog: ./test-vectors-materialized/single-file-modify-text/snapshot-a → ./test-vectors-materialized/single-file-modify-text/snapshot-b
## Substantive Changes
- **story.txt**: 2 lines added, 1 removed
So far this is just a changelog-style summary of what you would get with a textual diff.
CSV-aware diffing¶
Binoc gets more interesting when the format itself matters. Here the same rows are present in both snapshots, but the CSV columns were reordered:
cat ./test-vectors-materialized/csv-column-reorder/snapshot-a/data.csv
printf '\n---\n'
cat ./test-vectors-materialized/csv-column-reorder/snapshot-b/data.csv
binoc diff ./test-vectors-materialized/csv-column-reorder/snapshot-a ./test-vectors-materialized/csv-column-reorder/snapshot-b
# Changelog: ./test-vectors-materialized/csv-column-reorder/snapshot-a → ./test-vectors-materialized/csv-column-reorder/snapshot-b
## Clerical Changes
- **data.csv**: Columns reordered (content unchanged)
That is the first genuinely useful binoc result. A line-oriented diff would treat this like a rewrite. Binoc understands the header row and recognizes that the data itself did not change.
Now look at a more realistic update: one new column, a reorder, and a new row in the same file.
binoc diff ./test-vectors-materialized/csv-mixed-changes/snapshot-a ./test-vectors-materialized/csv-mixed-changes/snapshot-b
# Changelog: ./test-vectors-materialized/csv-mixed-changes/snapshot-a → ./test-vectors-materialized/csv-mixed-changes/snapshot-b
## Substantive Changes
- **data.csv**: Column added: 'email'; columns reordered; 1 row added
The changelog expresses a complex change in terms a dataset maintainer can act on.
Look inside a zip¶
The same diff command works on nested content too:
binoc diff ./test-vectors-materialized/zip-simple/snapshot-a ./test-vectors-materialized/zip-simple/snapshot-b
# Changelog: ./test-vectors-materialized/zip-simple/snapshot-a → ./test-vectors-materialized/zip-simple/snapshot-b
## Substantive Changes
- **archive.zip/data.txt**: 1 line added, 1 removed
- **archive.zip/extra.txt**: New file (1 line)
Binoc expands container formats like zip before dispatching their contents.
Paths such as archive.zip/data.txt point inside the archive.
Teaching binoc a new format¶
Binoc can learn domain formats through plugins. For example, the standard library does not understand FASTA:
binoc diff ./docs/examples/fasta-demo/snapshot-a/sequences.fasta ./docs/examples/fasta-demo/snapshot-b/sequences.fasta
# Changelog: ./docs/examples/fasta-demo/snapshot-a/sequences.fasta → ./docs/examples/fasta-demo/snapshot-b/sequences.fasta
## Substantive Changes
- **sequences.fasta**: Content changed (92 bytes → 102 bytes)
All the default tool can say is that the two FASTA files are different.
With a short plugin, we can report the useful fact instead of the generic byte change. This example defines the comparator in one Python script; the same class could later be packaged as a reusable plugin.
python - <<'PY'
from pathlib import Path
import binoc
class FastaComparator(binoc.Comparator):
name = 'bio.fasta'
extensions = ['.fasta', '.fa']
def compare(self, pair):
left = self._parse(Path(pair.left_path).read_text()) if pair.left_path else {}
right = self._parse(Path(pair.right_path).read_text()) if pair.right_path else {}
ids = sorted(set(left) | set(right))
sequences_changed = sum(
1
for record_id in ids
if left.get(record_id, {}).get('seq') != right.get(record_id, {}).get('seq')
)
headers_changed = sum(
1
for record_id in ids
if left.get(record_id, {}).get('hdr') != right.get(record_id, {}).get('hdr')
)
if not sequences_changed and not headers_changed:
return binoc.Identical()
summary = (
f'{sequences_changed} sequence(s) changed'
if sequences_changed
else f'Headers updated ({headers_changed} records); sequences unchanged'
)
tags = ['bio.header-change'] if headers_changed else []
if sequences_changed:
tags.append('bio.sequence-change')
return binoc.Leaf(
binoc.DiffNode(
action='modify',
item_type='fasta',
path=pair.logical_path,
summary=summary,
tags=tags,
)
)
@staticmethod
def _parse(text):
records = {}
current = None
for line in text.strip().split('\n'):
if line.startswith('>'):
current = line.split()[0][1:]
records[current] = {'hdr': line, 'seq': ''}
elif current:
records[current]['seq'] += line.strip()
return records
config = binoc.Config(comparators=['binoc.text'])
config.add_comparator(FastaComparator())
changeset = binoc.diff(
'./docs/examples/fasta-demo/snapshot-a/sequences.fasta',
'./docs/examples/fasta-demo/snapshot-b/sequences.fasta',
config=config,
)
print(binoc.to_markdown([changeset]))
PY
# Changelog: ./docs/examples/fasta-demo/snapshot-a/sequences.fasta → ./docs/examples/fasta-demo/snapshot-b/sequences.fasta
## Other Changes
- **sequences.fasta**: Headers updated (2 records); sequences unchanged
Now binoc reports that only the headers changed, and these two FASTA files do not substantively differ.
Where next¶
If you want to keep working from the command line, continue with Diff two snapshots, Save and render changesets, and Extract changed data.
If you want to use packaged extensions, go to Install and use plugins. If you want to turn the FASTA script into a real plugin, continue with Write a Python comparator.
If you want to learn more about the design of binoc, start with the Architecture overview.