Revision-prone datasets: a cross-field catalog¶

Research note, not normative documentation

This is a background survey of the datasets binoc is most likely to be pointed at, not a description of how binoc currently behaves or a commitment to support any of them. It is a curation aid for the showcase pipeline and a coverage check for the rule families. Roughly 110 datasets across ten fields were web-verified against primary sources in June 2026; load-bearing claims carry a link to the source. Maintenance status, release URLs, and acquirability drift — re-verify before building a target on any single entry. Poke holes in it.

binoc generates a changelog by diffing two snapshots of the same logical dataset. So the datasets that matter are not the ones that change — almost everything changes — but the ones that silently overwrite already-published records: a value edited, a category reassigned, a column renamed, a historical series restated, a label corrected, a boundary redrawn, with no per-record changelog shipped. The single most important distinction in this whole survey is revision vs. vintage. A revision changes a record that was already there (BEA restates 2015 GDP; ClinVar flips a variant from pathogenic to benign; the tz database corrects Mexico's 1921 offset). A vintage is a fresh cohort wearing last year's name (NHANES samples new people each cycle; the ACS 1-year file is a different population). Diff a revision and binoc produces a true changelog; diff a vintage and it produces a wall of meaningless churn. The headline finding: in every field, authoritative reference data revises in place and ships no changelog — and a surprising fraction of publishers ship a partial, machine-readable answer key for those revisions (a NEWS file, a prev_symbol column, a delisting table), which doubles as a ground-truth oracle for evaluating binoc's output.

What counts: revision, not vintage¶

The showcase already encodes this distinction the hard way. The accepted targets are all true revisions of a static-or-restated record: VA veterans by sex and age (a Gender column relabeled Sex, every number intact); CDC BRFSS pre-2011 prevalence (a frozen historical series silently word-edited in 2025); the FDA Purple Book and USDA FoodData Central (the same catalog re-released monthly / semi-annually with no notes). And the targets the team had to rescope are exactly the vintage traps: the HMDA target's first draft compared different reporting years (disjoint cohorts of loans) before being rescoped to the same year restated across snapshot → one-year → three-year vintages; the NSCH target is parked because year-over-year is a fresh sample of children, so only the schema (questions appearing/disappearing) is diffable, not the rows.

This catalog generalizes that lesson. Each entry is tagged with a revision character — the vocabulary the rest of the document uses:

Tag	Meaning	Canonical example
silent value edit	a published number/string is overwritten in place	BEA restates a prior quarter's GDP
reclassification	a record's category/status flips	ClinVar pathogenic → benign
reformat / rename	the value is stable but its label/shape changes	`Gender` → `Sex`; HGNC `BAI1` → `ADGRB1`
restatement / backfill	a whole historical span is re-stated at once	Our World in Data recomputes rolling metrics
retroactive methodology re-release	a new method re-derives the entire back-series	IHME GBD re-models 1990→present each release
accession merge / deprecation	identifiers are merged, retired, or remapped	dbSNP `RsMergeArch`; GO term obsoletion
relabeling	ground-truth labels are corrected on fixed inputs	ImageNet/MNIST label-error fixes
continuous editing	the same entities mutate constantly	Wikidata, OSM, OFAC SDN
versioned silent update	a fixed name points at moving content	Hugging Face `main`; Kaggle "latest"

Legend used in every table below. Verdict — ✅ true-revision · ◑ dual-axis (true-revision on one axis, vintage on another; diff only the right artifacts) · 🚫 vintage trap (listed as a contrast). Acquire — 🟢 turnkey (publisher hosts dated/numbered snapshots, deltas, or Git history) · 🟡 DIY (stable-schema file, but you must capture dated copies yourself / Wayback) · 🔴 hard (gated, paywalled, deleted, or no snapshots).

Public health, epidemiology & medicine¶

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
CDC BRFSS historical files	silent value edit / rename — named in the Lancet 2025 analysis of 114 federal datasets silently edited Jan–Mar 2025 (106 swapped "gender"→"sex"), with DataRescue mirrors	SAS/XPT, CSV · annual + silent re-release	🟢	◑
IHME Global Burden of Disease	retroactive methodology re-release — each edition re-models 1990→present; vitamin-A deaths went 233k→28k for the same years between GBD 2017 and 2019 (PMC9991746)	CSV, API · per-release	🟢	✅
ClinVar	reclassification — ~6% of variants reclassified; 40% of common pathogenic variants downgraded (Sci Rep); stable variant IDs	VCF/XML · monthly archived	🟢	✅
FDA Purple Book	continuous editing — monthly full-DB re-release tags rows U/N/R but ships no narrative changelog	CSV · monthly	🟢	✅
FDA Orange Book	reclassification — products silently moved to/from the Discontinued section; patent/exclusivity rows edited	`~`-delimited TXT · monthly	🟢	✅
DailyMed	continuous editing — labels superseded daily under a stable SetID; "as-of-date" archive	HL7 SPL XML · daily delta	🟢	✅
RxNorm	accession merge — RXCUIs go active→obsolete→remapped; a Terminology-Status API exists to chase them	RRF · monthly + weekly	🟢	✅
ICD-10-CM	reclassification / rename — FY2026 alone revised 38 existing codes; stable code keys	tabular · annual (Oct 1)	🟢	✅
WHO ICD-11 (MMS)	reclassification — continuous-maintenance model with annual versioned releases; titles/hierarchy edited	API, spreadsheet · annual versioned	🟢	✅
NCI SEER	retroactive recoding — Apr 2021 dropped ~10k 1973–2000 cases from all DBs via a behavior recode (change log)	SEER*Stat, ASCII · annual	🔴	◑
Our World in Data COVID-19	restatement/backfill — source corrections back-fill history; rolling metrics recomputed; every edit is a Git commit	CSV/JSON · daily (Git)	🟢	✅
JHU CSSE COVID-19	restatement/backfill — prior days revised continuously 2020–23; frozen since Mar 2023, full Git history preserved	CSV · frozen	🟢	✅ (historical)
CDC WONDER mortality	restatement — provisional counts "continually revised"; cells <10 suppressed (a count can vanish between pulls)	query/TXT export · weekly→annual	🟡	✅
CMS Care Compare	silent value edit — star ratings/thresholds recomputed each period with no per-provider note; stable CCN	CSV · quarterly	🟢	✅
NHANES	(contrast) — a new independent sample each 2-year cycle; positional diff is noise	XPT, CSV · biennial	—	🚫

Economics, macro, finance & trade¶

Statistical agencies revise published history as a matter of routine, and many maintain a real-time / vintage archive precisely so the overwritten numbers survive — which makes before/after snapshots unusually easy.

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
ALFRED (Archival FRED)	silent value edit — FRED overwrites series in place; ALFRED keeps every vintage (>206k revisions tracked for Z.1 alone). A two-vintage download service by design	CSV/XLS/API · continuous	🟢	✅ (gold standard)
BEA GDP / NIPA	retroactive benchmark — the 2023 comprehensive update revised GDI back to 1979Q1	CSV/XLS/API · quarterly + 5-yr	🟢	✅
BLS CES payrolls	benchmark restatement — the preliminary 2025 benchmark was −911,000 jobs, restated across months	CSV/XLS/API · monthly + annual	🟢	✅
BLS CPI (seasonally adjusted)	silent value edit — SA factors recomputed each January, restating ~5 years of SA history (NSA unchanged)	CSV/XLS/API · monthly	🟢	✅
Fed Financial Accounts (Z.1)	continuous editing — "all data subject to revision on an ongoing basis"; major revisions flagged each release	CSV (DDP)/API · quarterly	🟢	✅
IMF World Economic Outlook DB	retroactive restatement — each Apr/Oct vintage rewrites 1980→ history; vintages archived to 2007	XLS/SDMX/CSV · biannual	🟢	✅
World Bank WDI	restatement / rebasing — Nigeria's 2014 rebase made 2010–12 GDP 60–75% higher; PPP/ref-year rebased silently	CSV/XLS/API · quarterly + annual	🟡	✅
Penn World Table	retroactive methodology — v10.01 changed the investment deflator, altering 1950–2019 capital/TFP series; v11.0 current	XLSX/Stata · numbered versions	🟢	✅
Maddison Project DB	retroactive methodology — the 2023 update revised long-run GDP-pc for 169 countries	XLSX/Stata · versioned (2020, 2023)	🟢	✅
HMDA national loan-level	restatement — the same year published as Snapshot → 1-Year → 3-Year as late filings/resubmissions arrive	CSV (pipe)/API · staged	🟢	✅
OECD Main Economic Indicators	continuous editing — OECD ships a dedicated "Original Release Data and Revisions" (MEI-ORDR) DB to track first-release vs current	CSV/SDMX/XLS · monthly	🟢	✅
USDA WASDE	continuous editing — prior-month balance-sheet estimates revised in place; USDA hosts a historical-revisions tool	PDF/XML/XLS · monthly	🟢	✅
EIA Petroleum Supply (PSM/PSA)	restatement — the annual PSA revises up to 10 years of production history (use PSM/PSA, not the un-revised weekly WPSR)	CSV/XLS/API · monthly + annual	🟡	✅
UN Comtrade	silent value edit — reporters resubmit revised prior-period trade, overwriting earlier figures; no official vintage archive	CSV/JSON/API · continuous	🔴	✅
Census ACS	(mostly contrast) — year-over-year is fresh sampling and 5-yr windows overlap; only the COVID-era 2020 reweighting is same-period revision	CSV/API · annual	🟡	🚫

Geospatial, Earth observation, climate & weather¶

Temperature records carry homogenization adjustments that change past months; satellite archives are reprocessed into new "Collections" that overwrite the science values for already-observed dates; boundary files are re-released with silent geometry edits.

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
NOAA GHCN-Monthly v4	silent value edit — each run re-applies the Pairwise Homogenization Algorithm over the whole record, changing past adjusted months	fixed-width text · monthly	🟢	✅
NASA GISTEMP v4	restatement — a dated update log records concrete corrections (e.g. a 2025-09 fix to a station off by ~12°C); a ready-made oracle	CSV/NetCDF · monthly	🟢	✅
HadCRUT5 / CRUTEM5 / HadSST4	methodology re-release — SST bias corrections restate the whole record between versions	NetCDF/CSV ensembles · ~monthly	🟢	✅
Berkeley Earth	silent value edit — the entire record is re-estimated each monthly run	text/NetCDF · monthly	🟢	✅
ERA5 / ERA5.1	reprocessing — ERA5.1 replaced 2000–2006 to fix a stratospheric cold bias; served as an explicit paired dataset	NetCDF/GRIB · continuous + corrections	🟡	✅
MODIS Collections (C6→C6.1)	collection re-release — recalibration re-derives values for already-observed tiles/dates	HDF-EOS/GeoTIFF · per-collection	🟡	✅
Landsat Collections (C1→C2)	collection re-release — geometry + radiometry re-derived under the same scene IDs	GeoTIFF · per-collection	🟡	✅
Sentinel-2 Collection-1	reprocessing — same scene/date, new values; old baselines were deleted Oct–Nov 2024, so the acquire window is closing	JPEG2000 (SAFE) · campaign	🔴	✅
Satellite GMSL	reprocessing — altimeter retracking restates the 1993→ trend between versions	NetCDF/CSV · versioned	🟡	✅
NOAA nClimGrid-Monthly	restatement — preliminary replaced by final for the same period	NetCDF/text · monthly	🟢	✅
USGS ComCat earthquakes	silent value edit — a stable event ID's preferred magnitude/location is revised from auto → reviewed → ISC reconciliation	GeoJSON/QuakeML/CSV · continuous	🟡	✅
WDPA / Protected Planet	continuous editing — the same WDPAID's boundary geometry is silently replaced; monthly snapshots back to 2017-07	Shapefile/File GDB · monthly	🟢	✅
OpenStreetMap (full-history)	continuous editing — the full-history PBF already contains every version of every node/way/relation	`.osm.pbf`/XML · weekly + full-history	🟢	✅
GeoNames	continuous editing — the same geonameId's coords/population change; daily `modifications-*` deltas shipped	TSV · daily	🟢	✅
EPA AQS (AirData)	silent value edit — old samples altered on audit/reanalysis; a "was certified but data changed" status exists	CSV · continuous + annual cert	🟡	✅
Natural Earth	continuous editing — same features re-shaped between versions, but a Git `CHANGELOG` already exists (weak motivation, good oracle)	Shapefile/GeoJSON · semver	🟢	✅ (weak)
GADM	mixed — major versions re-shape geometries (Kashmir split) but also fold in genuinely new subdivisions	GeoPackage/Shapefile · major versions	🟡	◑
Census TIGER/Line	(mostly contrast) — annual vintages are dominated by legitimately-new boundaries; same-entity geometry shifts are a minority	Shapefile/GeoPackage · annual	🟢	🚫
NOAA Storm Events	(mostly contrast) — NCEI reformats but states it does not change values; revisions are append-only late reports	CSV · monthly	🟢	🚫

Genomics & life-science reference¶

The richest field by acquirability: nearly all publish dated/numbered releases on open FTP, and several ship their own diff artifact (a built-in answer key). The differentiator is whether already-present entries change (true-revision) or releases mostly bolt on new sequences (vintage).

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
ClinVar	reclassification — clinical significance flips P↔VUS↔B on a stable variant ID; monthly VCFs archived by year	VCF/XML/TSV · monthly	🟢	✅ (flagship)
dbSNP	accession merge — rsIDs merge/deprecate; the `RsMergeArch` table is the publisher's own map	VCF/JSON/flat · per-build	🟡	✅
GRCh38 + patches	mixed — GRCh37→GRCh38 is a true coordinate revision; the p1–p14 patches mostly add fix-/alt-loci without changing main-chromosome bases	FASTA/AGP/BED · ~annual patch	🟢	◑
RefSeq	silent value edit — NM_/NP_ sequences revised with a version-suffix bump (NM_005656.1→.6)	FASTA/GenBank/GFF · bi-monthly	🟢	✅
Ensembl / GENCODE	re-annotation — gene/transcript models revised, stable-ID versions bump, IDs retired/merged	GTF/GFF3/FASTA · ~quarterly	🟢	✅
UniProt / Swiss-Prot	re-annotation + accession merge — sequences corrected; UniSave gives per-entry history	flat/FASTA/XML · 8-weekly	🟢	✅
Pfam	accession deprecation — families "killed"/merged into clans; `dead_families` list shipped	Stockholm HMM · numbered	🟢	✅
InterPro	restatement — member-DB signatures re-integrated; entries change	XML/TSV · 8-weekly	🟢	✅
PDB (wwPDB)	re-refinement — entries re-versioned; the 2007 remediation + a 2022–23 268-entry re-release transform coordinates	PDB/mmCIF · weekly + campaigns	🟢	✅
NCBI Taxonomy	merge + rename — taxids merged to secondary; names/ranks change; the taxid-changelog tool is an oracle	taxdump (flat) · ~daily	🟢	✅
GTDB	reclassification — organisms renamed/moved across releases (e.g. Shigella folded into E. coli)	TSV/FASTA/trees · numbered	🟢	✅
Gene Ontology	accession deprecation — ~4,173 terms obsoleted in 3 years; the `go-ontology-changes` file is a ready answer key	OBO/OWL/GAF · monthly	🟢	✅
HGNC	rename — official gene-symbol changes (`BAI1`→`ADGRB1`); a `prev_symbol` field is built in	TSV/JSON · continuous	🟢	✅
miRBase	rename/renumber — miRNAs renamed (miR-422b→miR-378) and re-bounded; ships `miRNA.diff` + `miRNA.dead`	FASTA/EMBL/GFF · numbered	🟢	✅ (notorious)
gnomAD	(mostly contrast) — cross-version AF changes are driven by new samples, not reprocessing the same variants	VCF/Hail/TSV · major versions	🟢	◑
OMIM	continuous editing — entries/allelic-variant classifications edited nightly under stable MIM numbers; registration-gated, no clean FTP archive	flat/API · continuous	🔴	✅

Physical-science reference (chemistry, materials, physics, astronomy)¶

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
CODATA fundamental constants	re-fit — the 2022 adjustment moved α by 4.5× its 2018 uncertainty, shifting 15 dependent constants; archived ASCII per adjustment	ASCII/HTML · ~4-yearly	🟢	✅
IUPAC standard atomic weights	value edit + reclassification — argon went from 39.948±0.001 to the interval [39.792, 39.963] in 2021	HTML/PDF · biennial-ish	🟢	✅
Particle Data Group RPP	re-fit — the neutron-lifetime world average drifted 885.7 s (≤2010) → 878.6 s (2026) on a stable node; machine-readable `mass_width` files per year	web/PDF/CSV · annual/biennial	🟢	✅
HITRAN	methodology re-release — HITRAN2020 completely replaced the CO₂ line list for all 12 isotopologues vs 2016	`.par` fixed-width · major editions	🟡	✅
Materials Project	recompute — v2021.05.13 silently changed formation energies for many existing `mp-id`s via a new correction scheme	JSON/API/dumps · dated versions	🟡	✅
NASA Exoplanet Archive	reclassification — re-selecting a planet's "default parameter set" changes its headline mass/radius/period	CSV/VOTable/TAP · weekly	🟡	◑
Gaia data releases	re-derivation — the same source_id's astrometry/photometry is re-derived (EDR3→DR3 photometry correction folded in)	VOTable/FITS/TAP · major DR + errata	🟡	◑
ChEMBL	re-curation — ChEMBL_33 re-annotated ~250k existing activities; full dumps kept indefinitely	DB dumps/RDF/SDF · numbered	🟢	✅
DrugBank	corrections — invalid structures/FASTA headers fixed under a stable DBID; academic license required	XML/CSV/SDF · semver	🟡	✅
Crystallography Open Database	continuous editing — every CIF is under SVN; each correction is a new revision (already version-controlled; binoc adds the human summary)	CIF/MySQL/SVN · continuous	🟢	✅
NIST Atomic Spectra DB	re-compilation — energy levels/wavelengths revised across versions; but only the current version is served	web/ASCII export · numbered	🔴	✅
Minor Planet Center MPCORB	re-derivation — a designation's orbital elements re-fit daily as observations arrive; MPC hosts no archive of past dailies	fixed-width/JSON/SQLite · daily	🟡	✅
PubChem Compound	re-standardization — existing CIDs re-canonicalized as the structure pipeline re-runs; dominated by appends	SDF/XML/ASN.1 · rolling + monthly dump	🟡	◑
SIMBAD / VizieR (CDS)	continuous editing — SIMBAD revises an object's coords/cross-IDs as literature is folded in; no dated dumps	TAP/VOTable/ASCII · continuous	🔴	◑
NIST-JANAF tables	(contrast) — historically revised across editions but frozen since 1998; no live before/after	HTML/PDF · frozen	—	🚫

Standards, identifier registries & knowledge bases¶

Chosen because these domains revise in place by construction — there is no append-only trap here. Many ship a publisher-authored changelog that doubles as ground truth.

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
IANA tz database	retroactive correction — 2025a corrected Philippine offsets before 1900 & 1937–90; 2024b corrected Mexico 1921–1997. The `NEWS` file is a built-in answer key	text source · ~3–6/yr	🟢	✅ (flagship)
OFAC SDN list	value edit + delisting — entities added, removed, and silently edited (aliases, passport numbers) with no per-record changelog	XML/CSV/PIP · ~daily	🟢	✅ (flagship)
GLEIF LEI	value edit + status — legal names/addresses revised; status ISSUED→LAPSED→RETIRED; daily delta files shipped	XML/CSV/JSON · daily ×3	🟢	✅
CVE / NVD	rescore + reclassification — CVSS scores revised, descriptions edited, records flipped to REJECT; per-CVE change history exposed	JSON (CVE 5.0) · hourly	🟢	✅
Unicode CLDR	value edit — a locale's translations/number/date formats change between tagged releases	XML (LDML)/JSON · ~2/yr	🟢	✅
MITRE ATT&CK	reclassification + revoke — techniques revoked/merged/renamed (T1574.002 → renamed T1574.001); official detailed changelog	STIX 2.1 JSON · ~2/yr	🟢	✅
ISO 3166 country codes	reassignment + rename — names change (Turkey→Türkiye, Macedonia→North Macedonia); official DB paywalled, GitHub mirrors carry history	DB/newsletters · ad hoc	🟡	✅
ISO 4217 currency codes	reassignment + delisting — numbered amendments retire/replace codes	XML/PDF · per amendment	🟢	✅
OurAirports	reassignment — a persistent integer ID survives an airport code change or a status→closed; full Git history	CSV · nightly (Git)	🟢	✅
Wikidata	continuous editing — the same entity's statements change (and get vandalized + reverted); weekly JSON dumps	JSON/RDF · weekly + live	🟢	✅
MusicBrainz	continuous editing + merges — MBID redirects record merges; twice-weekly dumps	PostgreSQL/JSON · ~2×/wk	🟢	✅
Public Suffix List	edit + delisting — rules edited/removed under a single file; full Git history	`.dat` · a few/wk	🟢	✅
IEEE MAC OUI registry	reassignment/rename — org names change on M&A under a fixed prefix	TXT/CSV · ~daily	🟡	✅
IANA Root Zone DB	reassignment + delisting — registry-operator changes, ccTLD retirements	HTML/`root.zone` · continuous	🟡	✅
DBpedia	continuous editing — re-extracted from Wikipedia each release, so the same entity's facts shift	RDF/TTL · periodic	🟢	✅

Government, legal, civic & ML benchmarks¶

Dataset (publisher)	Revises	Format · cadence	Acq.	Verdict
eCFR	continuous editing — the same regulation text is amended in place; the live "current XML" overwrites with no section-level diff exposed	XML/JSON/PDF · daily	🟢	✅ (flagship)
US Code (OLRC)	continuous editing + reclassification — sections renumbered in place; OLRC ships editorial-reclassification tables as an answer key	USLM XML · release points	🟢	✅ (flagship)
SEC EDGAR Financial Statement Sets	restatement — a fiscal period is refiled (10-K/A) with restated figures; EDGAR keeps the original and every amendment forever	TSV/ZIP (XBRL) · quarterly	🟢	✅ (flagship)
labelerrors.com corrected sets	relabeling — given vs. corrected labels keyed to original indices across 10 benchmarks; ≥6% of the ImageNet val set, 2,916 val errors (arXiv). A pre-built gold diff	JSON/CSV overlay · one-shot	🟢	✅ (ML flagship)
ImageNet + ReaL/ReLabel	relabeling — same images, single → corrected/multi-label; 30–34% of images have multiple valid labels	label files · multiple relabelings	🟢	✅
MS COCO	relabeling — ~273k annotation errors found; MJ-COCO-2025 is a corrected re-release sharing image IDs	JSON annotations · patched + forks	🟢	✅
Hugging Face Hub datasets	versioned silent update — a dataset is a Git repo; `main` advances and `load_dataset` pulls new content unless `revision=` is pinned	Parquet/Arrow/CSV · per-commit	🟢	✅
CourtListener / Free Law	silent value edit — opinions corrected/withdrawn/superseded; text re-OCR'd over time, under a stable cluster/opinion ID	JSON bulk/API · rolling	🟢	✅
FEC filings	restatement — amendments (F3 amend-1, -2…) supersede the original for the same committee/period	`.FEC`/CSV/JSON · nightly	🟢	✅
MIT Election Lab returns	silent value edit — parallel "unofficial" and "official/certified" repos hold the same contest's revised totals	CSV · per cycle (Git)	🟢	✅
USPTO Patent Assignment	reclassification — assignee/role disambiguation re-resolved across annual editions for the same patents	CSV bulk · annual	🟡	✅
Congressional bill text	reformat/version progression — Introduced→Engrossed→Enrolled under one bill ID (but versions are labeled, so partly already changelogged)	USLM XML · per stage	🟢	◑
LAION-5B → Re-LAION-5B	re-release with deletions — Re-LAION removed 2,236 links (a safety scrub) under a refreshed identity	Parquet index · re-release	🟡	✅ (deletion-only)
Kaggle datasets	versioned silent update — immutable numbered versions under one slug; consumers pull "latest"; API resists fetching prior versions	any · per-version	🟡	✅
USAspending.gov	(mostly contrast) — mods are reported as new records by design; only the "Correction Delete Indicator = D" path is true revision	CSV/ZIP/API · quarterly	🟢	🚫
Common Crawl / C4	(mostly contrast) — each monthly crawl is a fresh web cohort; only C4's changing cleaning heuristics are a minor revision angle	WARC/WET · monthly	🟢	🚫

Cross-cutting findings¶

1. Acquirability is the gating constraint, and it sorts cleanly¶

The revision behavior is nearly universal; the ability to get two comparable snapshots is what separates a buildable target from a research curiosity. Three tiers recur across every field:

🟢 Turnkey — the publisher hosts the history. Either a purpose-built vintage store (ALFRED, IMF WEO, ALFRED-fed BEA/BLS/Fed), numbered/dated releases on open FTP (the entire genomics column; ChEMBL; CODATA; PDG), a monthly/daily full re-release (FDA Purple Book, GLEIF, OFAC, WDPA), or Git itself (OSM full-history, Wikidata dumps, Our World in Data, Hugging Face, OurAirports, US Code release points, SEC EDGAR). This is where showcase targets should come from — most of the catalog.
🟡 DIY — stable schema, but you must capture dated copies yourself. The data is a clean keyed file but the publisher serves only "current" (NASA Exoplanet Archive TAP, MPCORB dailies, UN Comtrade, NIST ASD, SIMBAD, Materials Project old versions, World Bank old editions). Snapshots come from your own scheduled pulls or the Wayback Machine. binoc works fine; the collector carries the burden.
🔴 Hard — gated, paywalled, or deleted. SEER (data-use agreement), OMIM (registration), ISO 3166 (paywalled official DB), Sentinel-2 old baselines (actively deleted Oct–Nov 2024), CSD/ICSD (commercial). Worth naming; not first targets.

2. A surprising number of publishers ship their own answer key¶

The most useful pattern for testing binoc: many of these datasets revise silently in the payload but ship a separate, machine-readable record of what changed. That artifact is a ground-truth oracle — run binoc on two snapshots, then check its generated changelog against the publisher's:

Dataset	Publisher-shipped answer key
IANA tz database	the `NEWS` file (per-release retroactive corrections)
Gene Ontology	`go-ontology-changes`
miRBase	`miRNA.diff` / `miRNA.dead`
HGNC	`prev_symbol` column
NCBI Taxonomy / dbSNP	merged-id lists / `RsMergeArch`
NVD	per-CVE change history + "Last Modified"
GeoNames	daily `modifications-YYYY-MM-DD` deltas
MITRE ATT&CK	the detailed version-to-version changelog
US Code (OLRC)	editorial-reclassification tables
labelerrors.com / MJ-COCO	the corrected-label overlay itself
GISTEMP	the dated "Updates to Analysis" log
NASA GISTEMP, Natural Earth, COD	Git/SVN history or update log

These should be the first datasets used to build binoc's quality regression: the desired output already exists in structured form.

3. The dual-axis trap is the recurring curation hazard¶

Several of the most cited datasets revise on one axis and vintage on another, and a naive run hits the wrong one. The discipline, in every case, is to diff the right artifact and join on the stable key:

BRFSS / SEER: the annual new-cohort axis is a vintage trap; the silently re-released historical files (BRFSS) and the retroactively recoded full series (SEER) are the true-revision targets.
Gaia / NASA Exoplanet Archive / ChEMBL / PubChem / MPCORB: all also append new objects heavily — diff on the stable key (source_id, mp-id, CID, designation) and ignore pure additions, which is exactly the add-vs-revise distinction binoc's correspondence engine must surface.
gnomAD / GRCh38 patches: cross-version change is dominated by new samples / added alt-loci, not in-place revision of existing entries — scope tightly or skip.
ACS / TIGER / NHANES / USAspending / Common Crawl: predominantly vintage; include only as contrast, or restrict to the narrow true-revision sliver (ACS 2020 reweighting; the USAspending "D" correction path).

4. Format coverage: what pointing binoc here actually demands¶

The catalog stress-tests the rule families well beyond the CSV/ZIP showcase. By frequency, the formats a "top-100" run must handle: delimited tabular (CSV/TSV, the bulk), then bioinformatics flat formats (VCF, GFF/GTF, FASTA, GenBank, mmCIF/PDB, OBO/OWL, Stockholm) — a large, underserved cluster; XBRL-derived TSV (EDGAR); fixed-width scientific text (GHCN, PDG mass_width, MPCORB, HITRAN .par, CODATA ASCII); gridded binary (NetCDF/HDF/GRIB for climate and reanalysis); geospatial vector (shapefile, GeoJSON, GeoPackage, .osm.pbf); structured documents (USLM/legislative XML, STIX 2.1 JSON, HL7 SPL XML, LDML); and versioned columnar (Parquet/Arrow on the Hugging Face Hub). This intersects the data.gov format landscape findings: XML and geospatial vector are the largest gaps; the genomics flat formats and gridded scientific binary are the largest new demand surfaced by this revision lens specifically. None is out of architectural scope; several (VCF, GTF, NetCDF, mmCIF) would each unlock a whole field's worth of turnkey, answer-key-bearing targets.

The strongest targets, and how they extend the showcase¶

The existing showcase leads with US-government tabular CSVs. This catalog says the highest-signal additions, ranked by signal × acquirability × a documented incident, are:

IANA tz database — the platonic case: a tiny text dataset that retroactively rewrites the past, ships its own NEWS answer key, and has full Git history. The clearest possible "they edited history and you'd never know" story.
ClinVar — stable keys, monthly dated archives, a quantified reclassification rate, and life-or-death stakes. The flagship for a scientific audience.
US Code release points / eCFR — "the same law, silently amended, no diff shipped," in clean USLM XML, with OLRC's reclassification tables as ground truth. The flagship for a legal audience.
SEC EDGAR restatements — a 10-K vs its 10-K/A: same period, restated numbers, real financial stakes, everything retained forever.
OFAC SDN / GLEIF — daily-cadence registries that edit and delist entities with no per-record note; GLEIF even ships delta files.
NOAA GHCN / NASA GISTEMP — the politically loaded climate case: past months' temperatures change with each homogenization run, and GISTEMP's update log is the oracle.
labelerrors.com / ImageNet relabeling — the ML-audience hook: a pre-built before/after over the most famous benchmarks in the field.

Each is a turnkey acquire, each has a citable incident, and several ship the answer key binoc's output can be graded against. They are the natural next wave of showcase targets once the formats they need (text/tz source, VCF, USLM XML, XBRL TSV) are in reach — and they extend the showcase's reach from "US open-data CSVs" to law, science, finance, and machine learning, which is where the "I want changelogs like that for my data" reaction is most likely to land.

Next steps¶

Promote the turnkey, tabular-or-text true-revision targets into the showcase pipeline first (tz, FDA already in; add OFAC, GLEIF, ClinVar once VCF lands, US Code/eCFR once USLM XML lands).
Build the quality regression on the answer-key datasets (§2): they give a structured target to diff binoc's generated changelog against, which the current showcase (verbatim-output-only) lacks.
Treat the dual-axis datasets as correspondence-engine tests (§3): they are the cleanest real-world exercises of add-vs-revise discrimination on a stable key.
Use the format demand (§4) to prioritize parse rules: VCF, GFF/GTF, USLM/legislative XML, and NetCDF each convert a whole field from "interesting but unbuildable" to a stack of turnkey targets.