Revision-prone datasets: a cross-field catalog¶
Research note, not normative documentation
This is a background survey of the datasets binoc is most likely to be pointed at, not a description of how binoc currently behaves or a commitment to support any of them. It is a curation aid for the showcase pipeline and a coverage check for the rule families. Roughly 110 datasets across ten fields were web-verified against primary sources in June 2026; load-bearing claims carry a link to the source. Maintenance status, release URLs, and acquirability drift — re-verify before building a target on any single entry. Poke holes in it.
binoc generates a changelog by diffing two snapshots of the same logical
dataset. So the datasets that matter are not the ones that change — almost
everything changes — but the ones that silently overwrite already-published
records: a value edited, a category reassigned, a column renamed, a
historical series restated, a label corrected, a boundary redrawn, with no
per-record changelog shipped. The single most important distinction in this
whole survey is revision vs. vintage. A revision changes a record that was
already there (BEA restates 2015 GDP; ClinVar flips a variant from pathogenic
to benign; the tz database corrects Mexico's 1921 offset). A vintage is a fresh
cohort wearing last year's name (NHANES samples new people each cycle; the ACS
1-year file is a different population). Diff a revision and binoc produces a
true changelog; diff a vintage and it produces a wall of meaningless churn. The
headline finding: in every field, authoritative reference data revises in
place and ships no changelog — and a surprising fraction of publishers ship a
partial, machine-readable answer key for those revisions (a NEWS file, a
prev_symbol column, a delisting table), which doubles as a ground-truth
oracle for evaluating binoc's output.
What counts: revision, not vintage¶
The showcase already encodes this distinction the hard way. The accepted
targets are all true revisions of a static-or-restated record:
VA veterans by sex and age
(a Gender column relabeled Sex, every number intact);
CDC BRFSS pre-2011 prevalence
(a frozen historical series silently word-edited in 2025); the
FDA Purple Book and
USDA FoodData Central (the same
catalog re-released monthly / semi-annually with no notes). And the targets the
team had to rescope are exactly the vintage traps: the HMDA target's first
draft compared different reporting years (disjoint cohorts of loans) before
being rescoped to the same year restated across snapshot → one-year →
three-year vintages; the NSCH target is parked because year-over-year is a fresh
sample of children, so only the schema (questions appearing/disappearing) is
diffable, not the rows.
This catalog generalizes that lesson. Each entry is tagged with a revision character — the vocabulary the rest of the document uses:
| Tag | Meaning | Canonical example |
|---|---|---|
| silent value edit | a published number/string is overwritten in place | BEA restates a prior quarter's GDP |
| reclassification | a record's category/status flips | ClinVar pathogenic → benign |
| reformat / rename | the value is stable but its label/shape changes | Gender → Sex; HGNC BAI1 → ADGRB1 |
| restatement / backfill | a whole historical span is re-stated at once | Our World in Data recomputes rolling metrics |
| retroactive methodology re-release | a new method re-derives the entire back-series | IHME GBD re-models 1990→present each release |
| accession merge / deprecation | identifiers are merged, retired, or remapped | dbSNP RsMergeArch; GO term obsoletion |
| relabeling | ground-truth labels are corrected on fixed inputs | ImageNet/MNIST label-error fixes |
| continuous editing | the same entities mutate constantly | Wikidata, OSM, OFAC SDN |
| versioned silent update | a fixed name points at moving content | Hugging Face main; Kaggle "latest" |
Legend used in every table below. Verdict — ✅ true-revision · ◑ dual-axis (true-revision on one axis, vintage on another; diff only the right artifacts) · 🚫 vintage trap (listed as a contrast). Acquire — 🟢 turnkey (publisher hosts dated/numbered snapshots, deltas, or Git history) · 🟡 DIY (stable-schema file, but you must capture dated copies yourself / Wayback) · 🔴 hard (gated, paywalled, deleted, or no snapshots).
Public health, epidemiology & medicine¶
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| CDC BRFSS historical files | silent value edit / rename — named in the Lancet 2025 analysis of 114 federal datasets silently edited Jan–Mar 2025 (106 swapped "gender"→"sex"), with DataRescue mirrors | SAS/XPT, CSV · annual + silent re-release | 🟢 | ◑ |
| IHME Global Burden of Disease | retroactive methodology re-release — each edition re-models 1990→present; vitamin-A deaths went 233k→28k for the same years between GBD 2017 and 2019 (PMC9991746) | CSV, API · per-release | 🟢 | ✅ |
| ClinVar | reclassification — ~6% of variants reclassified; 40% of common pathogenic variants downgraded (Sci Rep); stable variant IDs | VCF/XML · monthly archived | 🟢 | ✅ |
| FDA Purple Book | continuous editing — monthly full-DB re-release tags rows U/N/R but ships no narrative changelog | CSV · monthly | 🟢 | ✅ |
| FDA Orange Book | reclassification — products silently moved to/from the Discontinued section; patent/exclusivity rows edited | ~-delimited TXT · monthly |
🟢 | ✅ |
| DailyMed | continuous editing — labels superseded daily under a stable SetID; "as-of-date" archive | HL7 SPL XML · daily delta | 🟢 | ✅ |
| RxNorm | accession merge — RXCUIs go active→obsolete→remapped; a Terminology-Status API exists to chase them | RRF · monthly + weekly | 🟢 | ✅ |
| ICD-10-CM | reclassification / rename — FY2026 alone revised 38 existing codes; stable code keys | tabular · annual (Oct 1) | 🟢 | ✅ |
| WHO ICD-11 (MMS) | reclassification — continuous-maintenance model with annual versioned releases; titles/hierarchy edited | API, spreadsheet · annual versioned | 🟢 | ✅ |
| NCI SEER | retroactive recoding — Apr 2021 dropped ~10k 1973–2000 cases from all DBs via a behavior recode (change log) | SEER*Stat, ASCII · annual | 🔴 | ◑ |
| Our World in Data COVID-19 | restatement/backfill — source corrections back-fill history; rolling metrics recomputed; every edit is a Git commit | CSV/JSON · daily (Git) | 🟢 | ✅ |
| JHU CSSE COVID-19 | restatement/backfill — prior days revised continuously 2020–23; frozen since Mar 2023, full Git history preserved | CSV · frozen | 🟢 | ✅ (historical) |
| CDC WONDER mortality | restatement — provisional counts "continually revised"; cells <10 suppressed (a count can vanish between pulls) | query/TXT export · weekly→annual | 🟡 | ✅ |
| CMS Care Compare | silent value edit — star ratings/thresholds recomputed each period with no per-provider note; stable CCN | CSV · quarterly | 🟢 | ✅ |
| NHANES | (contrast) — a new independent sample each 2-year cycle; positional diff is noise | XPT, CSV · biennial | — | 🚫 |
Economics, macro, finance & trade¶
Statistical agencies revise published history as a matter of routine, and many maintain a real-time / vintage archive precisely so the overwritten numbers survive — which makes before/after snapshots unusually easy.
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| ALFRED (Archival FRED) | silent value edit — FRED overwrites series in place; ALFRED keeps every vintage (>206k revisions tracked for Z.1 alone). A two-vintage download service by design | CSV/XLS/API · continuous | 🟢 | ✅ (gold standard) |
| BEA GDP / NIPA | retroactive benchmark — the 2023 comprehensive update revised GDI back to 1979Q1 | CSV/XLS/API · quarterly + 5-yr | 🟢 | ✅ |
| BLS CES payrolls | benchmark restatement — the preliminary 2025 benchmark was −911,000 jobs, restated across months | CSV/XLS/API · monthly + annual | 🟢 | ✅ |
| BLS CPI (seasonally adjusted) | silent value edit — SA factors recomputed each January, restating ~5 years of SA history (NSA unchanged) | CSV/XLS/API · monthly | 🟢 | ✅ |
| Fed Financial Accounts (Z.1) | continuous editing — "all data subject to revision on an ongoing basis"; major revisions flagged each release | CSV (DDP)/API · quarterly | 🟢 | ✅ |
| IMF World Economic Outlook DB | retroactive restatement — each Apr/Oct vintage rewrites 1980→ history; vintages archived to 2007 | XLS/SDMX/CSV · biannual | 🟢 | ✅ |
| World Bank WDI | restatement / rebasing — Nigeria's 2014 rebase made 2010–12 GDP 60–75% higher; PPP/ref-year rebased silently | CSV/XLS/API · quarterly + annual | 🟡 | ✅ |
| Penn World Table | retroactive methodology — v10.01 changed the investment deflator, altering 1950–2019 capital/TFP series; v11.0 current | XLSX/Stata · numbered versions | 🟢 | ✅ |
| Maddison Project DB | retroactive methodology — the 2023 update revised long-run GDP-pc for 169 countries | XLSX/Stata · versioned (2020, 2023) | 🟢 | ✅ |
| HMDA national loan-level | restatement — the same year published as Snapshot → 1-Year → 3-Year as late filings/resubmissions arrive | CSV (pipe)/API · staged | 🟢 | ✅ |
| OECD Main Economic Indicators | continuous editing — OECD ships a dedicated "Original Release Data and Revisions" (MEI-ORDR) DB to track first-release vs current | CSV/SDMX/XLS · monthly | 🟢 | ✅ |
| USDA WASDE | continuous editing — prior-month balance-sheet estimates revised in place; USDA hosts a historical-revisions tool | PDF/XML/XLS · monthly | 🟢 | ✅ |
| EIA Petroleum Supply (PSM/PSA) | restatement — the annual PSA revises up to 10 years of production history (use PSM/PSA, not the un-revised weekly WPSR) | CSV/XLS/API · monthly + annual | 🟡 | ✅ |
| UN Comtrade | silent value edit — reporters resubmit revised prior-period trade, overwriting earlier figures; no official vintage archive | CSV/JSON/API · continuous | 🔴 | ✅ |
| Census ACS | (mostly contrast) — year-over-year is fresh sampling and 5-yr windows overlap; only the COVID-era 2020 reweighting is same-period revision | CSV/API · annual | 🟡 | 🚫 |
Geospatial, Earth observation, climate & weather¶
Temperature records carry homogenization adjustments that change past months; satellite archives are reprocessed into new "Collections" that overwrite the science values for already-observed dates; boundary files are re-released with silent geometry edits.
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| NOAA GHCN-Monthly v4 | silent value edit — each run re-applies the Pairwise Homogenization Algorithm over the whole record, changing past adjusted months | fixed-width text · monthly | 🟢 | ✅ |
| NASA GISTEMP v4 | restatement — a dated update log records concrete corrections (e.g. a 2025-09 fix to a station off by ~12°C); a ready-made oracle | CSV/NetCDF · monthly | 🟢 | ✅ |
| HadCRUT5 / CRUTEM5 / HadSST4 | methodology re-release — SST bias corrections restate the whole record between versions | NetCDF/CSV ensembles · ~monthly | 🟢 | ✅ |
| Berkeley Earth | silent value edit — the entire record is re-estimated each monthly run | text/NetCDF · monthly | 🟢 | ✅ |
| ERA5 / ERA5.1 | reprocessing — ERA5.1 replaced 2000–2006 to fix a stratospheric cold bias; served as an explicit paired dataset | NetCDF/GRIB · continuous + corrections | 🟡 | ✅ |
| MODIS Collections (C6→C6.1) | collection re-release — recalibration re-derives values for already-observed tiles/dates | HDF-EOS/GeoTIFF · per-collection | 🟡 | ✅ |
| Landsat Collections (C1→C2) | collection re-release — geometry + radiometry re-derived under the same scene IDs | GeoTIFF · per-collection | 🟡 | ✅ |
| Sentinel-2 Collection-1 | reprocessing — same scene/date, new values; old baselines were deleted Oct–Nov 2024, so the acquire window is closing | JPEG2000 (SAFE) · campaign | 🔴 | ✅ |
| Satellite GMSL | reprocessing — altimeter retracking restates the 1993→ trend between versions | NetCDF/CSV · versioned | 🟡 | ✅ |
| NOAA nClimGrid-Monthly | restatement — preliminary replaced by final for the same period | NetCDF/text · monthly | 🟢 | ✅ |
| USGS ComCat earthquakes | silent value edit — a stable event ID's preferred magnitude/location is revised from auto → reviewed → ISC reconciliation | GeoJSON/QuakeML/CSV · continuous | 🟡 | ✅ |
| WDPA / Protected Planet | continuous editing — the same WDPAID's boundary geometry is silently replaced; monthly snapshots back to 2017-07 | Shapefile/File GDB · monthly | 🟢 | ✅ |
| OpenStreetMap (full-history) | continuous editing — the full-history PBF already contains every version of every node/way/relation | .osm.pbf/XML · weekly + full-history |
🟢 | ✅ |
| GeoNames | continuous editing — the same geonameId's coords/population change; daily modifications-* deltas shipped |
TSV · daily | 🟢 | ✅ |
| EPA AQS (AirData) | silent value edit — old samples altered on audit/reanalysis; a "was certified but data changed" status exists | CSV · continuous + annual cert | 🟡 | ✅ |
| Natural Earth | continuous editing — same features re-shaped between versions, but a Git CHANGELOG already exists (weak motivation, good oracle) |
Shapefile/GeoJSON · semver | 🟢 | ✅ (weak) |
| GADM | mixed — major versions re-shape geometries (Kashmir split) but also fold in genuinely new subdivisions | GeoPackage/Shapefile · major versions | 🟡 | ◑ |
| Census TIGER/Line | (mostly contrast) — annual vintages are dominated by legitimately-new boundaries; same-entity geometry shifts are a minority | Shapefile/GeoPackage · annual | 🟢 | 🚫 |
| NOAA Storm Events | (mostly contrast) — NCEI reformats but states it does not change values; revisions are append-only late reports | CSV · monthly | 🟢 | 🚫 |
Genomics & life-science reference¶
The richest field by acquirability: nearly all publish dated/numbered releases on open FTP, and several ship their own diff artifact (a built-in answer key). The differentiator is whether already-present entries change (true-revision) or releases mostly bolt on new sequences (vintage).
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| ClinVar | reclassification — clinical significance flips P↔VUS↔B on a stable variant ID; monthly VCFs archived by year | VCF/XML/TSV · monthly | 🟢 | ✅ (flagship) |
| dbSNP | accession merge — rsIDs merge/deprecate; the RsMergeArch table is the publisher's own map |
VCF/JSON/flat · per-build | 🟡 | ✅ |
| GRCh38 + patches | mixed — GRCh37→GRCh38 is a true coordinate revision; the p1–p14 patches mostly add fix-/alt-loci without changing main-chromosome bases | FASTA/AGP/BED · ~annual patch | 🟢 | ◑ |
| RefSeq | silent value edit — NM_/NP_ sequences revised with a version-suffix bump (NM_005656.1→.6) | FASTA/GenBank/GFF · bi-monthly | 🟢 | ✅ |
| Ensembl / GENCODE | re-annotation — gene/transcript models revised, stable-ID versions bump, IDs retired/merged | GTF/GFF3/FASTA · ~quarterly | 🟢 | ✅ |
| UniProt / Swiss-Prot | re-annotation + accession merge — sequences corrected; UniSave gives per-entry history | flat/FASTA/XML · 8-weekly | 🟢 | ✅ |
| Pfam | accession deprecation — families "killed"/merged into clans; dead_families list shipped |
Stockholm HMM · numbered | 🟢 | ✅ |
| InterPro | restatement — member-DB signatures re-integrated; entries change | XML/TSV · 8-weekly | 🟢 | ✅ |
| PDB (wwPDB) | re-refinement — entries re-versioned; the 2007 remediation + a 2022–23 268-entry re-release transform coordinates | PDB/mmCIF · weekly + campaigns | 🟢 | ✅ |
| NCBI Taxonomy | merge + rename — taxids merged to secondary; names/ranks change; the taxid-changelog tool is an oracle | taxdump (flat) · ~daily | 🟢 | ✅ |
| GTDB | reclassification — organisms renamed/moved across releases (e.g. Shigella folded into E. coli) | TSV/FASTA/trees · numbered | 🟢 | ✅ |
| Gene Ontology | accession deprecation — ~4,173 terms obsoleted in 3 years; the go-ontology-changes file is a ready answer key |
OBO/OWL/GAF · monthly | 🟢 | ✅ |
| HGNC | rename — official gene-symbol changes (BAI1→ADGRB1); a prev_symbol field is built in |
TSV/JSON · continuous | 🟢 | ✅ |
| miRBase | rename/renumber — miRNAs renamed (miR-422b→miR-378) and re-bounded; ships miRNA.diff + miRNA.dead |
FASTA/EMBL/GFF · numbered | 🟢 | ✅ (notorious) |
| gnomAD | (mostly contrast) — cross-version AF changes are driven by new samples, not reprocessing the same variants | VCF/Hail/TSV · major versions | 🟢 | ◑ |
| OMIM | continuous editing — entries/allelic-variant classifications edited nightly under stable MIM numbers; registration-gated, no clean FTP archive | flat/API · continuous | 🔴 | ✅ |
Physical-science reference (chemistry, materials, physics, astronomy)¶
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| CODATA fundamental constants | re-fit — the 2022 adjustment moved α by 4.5× its 2018 uncertainty, shifting 15 dependent constants; archived ASCII per adjustment | ASCII/HTML · ~4-yearly | 🟢 | ✅ |
| IUPAC standard atomic weights | value edit + reclassification — argon went from 39.948±0.001 to the interval [39.792, 39.963] in 2021 | HTML/PDF · biennial-ish | 🟢 | ✅ |
| Particle Data Group RPP | re-fit — the neutron-lifetime world average drifted 885.7 s (≤2010) → 878.6 s (2026) on a stable node; machine-readable mass_width files per year |
web/PDF/CSV · annual/biennial | 🟢 | ✅ |
| HITRAN | methodology re-release — HITRAN2020 completely replaced the CO₂ line list for all 12 isotopologues vs 2016 | .par fixed-width · major editions |
🟡 | ✅ |
| Materials Project | recompute — v2021.05.13 silently changed formation energies for many existing mp-ids via a new correction scheme |
JSON/API/dumps · dated versions | 🟡 | ✅ |
| NASA Exoplanet Archive | reclassification — re-selecting a planet's "default parameter set" changes its headline mass/radius/period | CSV/VOTable/TAP · weekly | 🟡 | ◑ |
| Gaia data releases | re-derivation — the same source_id's astrometry/photometry is re-derived (EDR3→DR3 photometry correction folded in) | VOTable/FITS/TAP · major DR + errata | 🟡 | ◑ |
| ChEMBL | re-curation — ChEMBL_33 re-annotated ~250k existing activities; full dumps kept indefinitely | DB dumps/RDF/SDF · numbered | 🟢 | ✅ |
| DrugBank | corrections — invalid structures/FASTA headers fixed under a stable DBID; academic license required | XML/CSV/SDF · semver | 🟡 | ✅ |
| Crystallography Open Database | continuous editing — every CIF is under SVN; each correction is a new revision (already version-controlled; binoc adds the human summary) | CIF/MySQL/SVN · continuous | 🟢 | ✅ |
| NIST Atomic Spectra DB | re-compilation — energy levels/wavelengths revised across versions; but only the current version is served | web/ASCII export · numbered | 🔴 | ✅ |
| Minor Planet Center MPCORB | re-derivation — a designation's orbital elements re-fit daily as observations arrive; MPC hosts no archive of past dailies | fixed-width/JSON/SQLite · daily | 🟡 | ✅ |
| PubChem Compound | re-standardization — existing CIDs re-canonicalized as the structure pipeline re-runs; dominated by appends | SDF/XML/ASN.1 · rolling + monthly dump | 🟡 | ◑ |
| SIMBAD / VizieR (CDS) | continuous editing — SIMBAD revises an object's coords/cross-IDs as literature is folded in; no dated dumps | TAP/VOTable/ASCII · continuous | 🔴 | ◑ |
| NIST-JANAF tables | (contrast) — historically revised across editions but frozen since 1998; no live before/after | HTML/PDF · frozen | — | 🚫 |
Standards, identifier registries & knowledge bases¶
Chosen because these domains revise in place by construction — there is no append-only trap here. Many ship a publisher-authored changelog that doubles as ground truth.
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| IANA tz database | retroactive correction — 2025a corrected Philippine offsets before 1900 & 1937–90; 2024b corrected Mexico 1921–1997. The NEWS file is a built-in answer key |
text source · ~3–6/yr | 🟢 | ✅ (flagship) |
| OFAC SDN list | value edit + delisting — entities added, removed, and silently edited (aliases, passport numbers) with no per-record changelog | XML/CSV/PIP · ~daily | 🟢 | ✅ (flagship) |
| GLEIF LEI | value edit + status — legal names/addresses revised; status ISSUED→LAPSED→RETIRED; daily delta files shipped | XML/CSV/JSON · daily ×3 | 🟢 | ✅ |
| CVE / NVD | rescore + reclassification — CVSS scores revised, descriptions edited, records flipped to REJECT; per-CVE change history exposed | JSON (CVE 5.0) · hourly | 🟢 | ✅ |
| Unicode CLDR | value edit — a locale's translations/number/date formats change between tagged releases | XML (LDML)/JSON · ~2/yr | 🟢 | ✅ |
| MITRE ATT&CK | reclassification + revoke — techniques revoked/merged/renamed (T1574.002 → renamed T1574.001); official detailed changelog | STIX 2.1 JSON · ~2/yr | 🟢 | ✅ |
| ISO 3166 country codes | reassignment + rename — names change (Turkey→Türkiye, Macedonia→North Macedonia); official DB paywalled, GitHub mirrors carry history | DB/newsletters · ad hoc | 🟡 | ✅ |
| ISO 4217 currency codes | reassignment + delisting — numbered amendments retire/replace codes | XML/PDF · per amendment | 🟢 | ✅ |
| OurAirports | reassignment — a persistent integer ID survives an airport code change or a status→closed; full Git history | CSV · nightly (Git) | 🟢 | ✅ |
| Wikidata | continuous editing — the same entity's statements change (and get vandalized + reverted); weekly JSON dumps | JSON/RDF · weekly + live | 🟢 | ✅ |
| MusicBrainz | continuous editing + merges — MBID redirects record merges; twice-weekly dumps | PostgreSQL/JSON · ~2×/wk | 🟢 | ✅ |
| Public Suffix List | edit + delisting — rules edited/removed under a single file; full Git history | .dat · a few/wk |
🟢 | ✅ |
| IEEE MAC OUI registry | reassignment/rename — org names change on M&A under a fixed prefix | TXT/CSV · ~daily | 🟡 | ✅ |
| IANA Root Zone DB | reassignment + delisting — registry-operator changes, ccTLD retirements | HTML/root.zone · continuous |
🟡 | ✅ |
| DBpedia | continuous editing — re-extracted from Wikipedia each release, so the same entity's facts shift | RDF/TTL · periodic | 🟢 | ✅ |
Government, legal, civic & ML benchmarks¶
| Dataset (publisher) | Revises | Format · cadence | Acq. | Verdict |
|---|---|---|---|---|
| eCFR | continuous editing — the same regulation text is amended in place; the live "current XML" overwrites with no section-level diff exposed | XML/JSON/PDF · daily | 🟢 | ✅ (flagship) |
| US Code (OLRC) | continuous editing + reclassification — sections renumbered in place; OLRC ships editorial-reclassification tables as an answer key | USLM XML · release points | 🟢 | ✅ (flagship) |
| SEC EDGAR Financial Statement Sets | restatement — a fiscal period is refiled (10-K/A) with restated figures; EDGAR keeps the original and every amendment forever | TSV/ZIP (XBRL) · quarterly | 🟢 | ✅ (flagship) |
| labelerrors.com corrected sets | relabeling — given vs. corrected labels keyed to original indices across 10 benchmarks; ≥6% of the ImageNet val set, 2,916 val errors (arXiv). A pre-built gold diff | JSON/CSV overlay · one-shot | 🟢 | ✅ (ML flagship) |
| ImageNet + ReaL/ReLabel | relabeling — same images, single → corrected/multi-label; 30–34% of images have multiple valid labels | label files · multiple relabelings | 🟢 | ✅ |
| MS COCO | relabeling — ~273k annotation errors found; MJ-COCO-2025 is a corrected re-release sharing image IDs | JSON annotations · patched + forks | 🟢 | ✅ |
| Hugging Face Hub datasets | versioned silent update — a dataset is a Git repo; main advances and load_dataset pulls new content unless revision= is pinned |
Parquet/Arrow/CSV · per-commit | 🟢 | ✅ |
| CourtListener / Free Law | silent value edit — opinions corrected/withdrawn/superseded; text re-OCR'd over time, under a stable cluster/opinion ID | JSON bulk/API · rolling | 🟢 | ✅ |
| FEC filings | restatement — amendments (F3 amend-1, -2…) supersede the original for the same committee/period | .FEC/CSV/JSON · nightly |
🟢 | ✅ |
| MIT Election Lab returns | silent value edit — parallel "unofficial" and "official/certified" repos hold the same contest's revised totals | CSV · per cycle (Git) | 🟢 | ✅ |
| USPTO Patent Assignment | reclassification — assignee/role disambiguation re-resolved across annual editions for the same patents | CSV bulk · annual | 🟡 | ✅ |
| Congressional bill text | reformat/version progression — Introduced→Engrossed→Enrolled under one bill ID (but versions are labeled, so partly already changelogged) | USLM XML · per stage | 🟢 | ◑ |
| LAION-5B → Re-LAION-5B | re-release with deletions — Re-LAION removed 2,236 links (a safety scrub) under a refreshed identity | Parquet index · re-release | 🟡 | ✅ (deletion-only) |
| Kaggle datasets | versioned silent update — immutable numbered versions under one slug; consumers pull "latest"; API resists fetching prior versions | any · per-version | 🟡 | ✅ |
| USAspending.gov | (mostly contrast) — mods are reported as new records by design; only the "Correction Delete Indicator = D" path is true revision | CSV/ZIP/API · quarterly | 🟢 | 🚫 |
| Common Crawl / C4 | (mostly contrast) — each monthly crawl is a fresh web cohort; only C4's changing cleaning heuristics are a minor revision angle | WARC/WET · monthly | 🟢 | 🚫 |
Cross-cutting findings¶
1. Acquirability is the gating constraint, and it sorts cleanly¶
The revision behavior is nearly universal; the ability to get two comparable snapshots is what separates a buildable target from a research curiosity. Three tiers recur across every field:
- 🟢 Turnkey — the publisher hosts the history. Either a purpose-built vintage store (ALFRED, IMF WEO, ALFRED-fed BEA/BLS/Fed), numbered/dated releases on open FTP (the entire genomics column; ChEMBL; CODATA; PDG), a monthly/daily full re-release (FDA Purple Book, GLEIF, OFAC, WDPA), or Git itself (OSM full-history, Wikidata dumps, Our World in Data, Hugging Face, OurAirports, US Code release points, SEC EDGAR). This is where showcase targets should come from — most of the catalog.
- 🟡 DIY — stable schema, but you must capture dated copies yourself. The data is a clean keyed file but the publisher serves only "current" (NASA Exoplanet Archive TAP, MPCORB dailies, UN Comtrade, NIST ASD, SIMBAD, Materials Project old versions, World Bank old editions). Snapshots come from your own scheduled pulls or the Wayback Machine. binoc works fine; the collector carries the burden.
- 🔴 Hard — gated, paywalled, or deleted. SEER (data-use agreement), OMIM (registration), ISO 3166 (paywalled official DB), Sentinel-2 old baselines (actively deleted Oct–Nov 2024), CSD/ICSD (commercial). Worth naming; not first targets.
2. A surprising number of publishers ship their own answer key¶
The most useful pattern for testing binoc: many of these datasets revise silently in the payload but ship a separate, machine-readable record of what changed. That artifact is a ground-truth oracle — run binoc on two snapshots, then check its generated changelog against the publisher's:
| Dataset | Publisher-shipped answer key |
|---|---|
| IANA tz database | the NEWS file (per-release retroactive corrections) |
| Gene Ontology | go-ontology-changes |
| miRBase | miRNA.diff / miRNA.dead |
| HGNC | prev_symbol column |
| NCBI Taxonomy / dbSNP | merged-id lists / RsMergeArch |
| NVD | per-CVE change history + "Last Modified" |
| GeoNames | daily modifications-YYYY-MM-DD deltas |
| MITRE ATT&CK | the detailed version-to-version changelog |
| US Code (OLRC) | editorial-reclassification tables |
| labelerrors.com / MJ-COCO | the corrected-label overlay itself |
| GISTEMP | the dated "Updates to Analysis" log |
| NASA GISTEMP, Natural Earth, COD | Git/SVN history or update log |
These should be the first datasets used to build binoc's quality regression: the desired output already exists in structured form.
3. The dual-axis trap is the recurring curation hazard¶
Several of the most cited datasets revise on one axis and vintage on another, and a naive run hits the wrong one. The discipline, in every case, is to diff the right artifact and join on the stable key:
- BRFSS / SEER: the annual new-cohort axis is a vintage trap; the silently re-released historical files (BRFSS) and the retroactively recoded full series (SEER) are the true-revision targets.
- Gaia / NASA Exoplanet Archive / ChEMBL / PubChem / MPCORB: all also
append new objects heavily — diff on the stable key (
source_id,mp-id, CID, designation) and ignore pure additions, which is exactly the add-vs-revise distinction binoc's correspondence engine must surface. - gnomAD / GRCh38 patches: cross-version change is dominated by new samples / added alt-loci, not in-place revision of existing entries — scope tightly or skip.
- ACS / TIGER / NHANES / USAspending / Common Crawl: predominantly vintage; include only as contrast, or restrict to the narrow true-revision sliver (ACS 2020 reweighting; the USAspending "D" correction path).
4. Format coverage: what pointing binoc here actually demands¶
The catalog stress-tests the rule families well beyond the CSV/ZIP showcase. By
frequency, the formats a "top-100" run must handle: delimited tabular (CSV/TSV,
the bulk), then bioinformatics flat formats (VCF, GFF/GTF, FASTA,
GenBank, mmCIF/PDB, OBO/OWL, Stockholm) — a large, underserved cluster;
XBRL-derived TSV (EDGAR); fixed-width scientific text (GHCN, PDG
mass_width, MPCORB, HITRAN .par, CODATA ASCII); gridded binary
(NetCDF/HDF/GRIB for climate and reanalysis); geospatial vector
(shapefile, GeoJSON, GeoPackage, .osm.pbf); structured documents
(USLM/legislative XML, STIX 2.1 JSON, HL7 SPL XML, LDML); and versioned
columnar (Parquet/Arrow on the Hugging Face Hub). This intersects the
data.gov format landscape findings: XML and
geospatial vector are the largest gaps; the genomics flat formats and gridded
scientific binary are the largest new demand surfaced by this revision lens
specifically. None is out of architectural scope; several (VCF, GTF, NetCDF,
mmCIF) would each unlock a whole field's worth of turnkey, answer-key-bearing
targets.
The strongest targets, and how they extend the showcase¶
The existing showcase leads with US-government tabular CSVs. This catalog says the highest-signal additions, ranked by signal × acquirability × a documented incident, are:
- IANA tz database — the platonic case: a tiny text dataset that
retroactively rewrites the past, ships its own
NEWSanswer key, and has full Git history. The clearest possible "they edited history and you'd never know" story. - ClinVar — stable keys, monthly dated archives, a quantified reclassification rate, and life-or-death stakes. The flagship for a scientific audience.
- US Code release points / eCFR — "the same law, silently amended, no diff shipped," in clean USLM XML, with OLRC's reclassification tables as ground truth. The flagship for a legal audience.
- SEC EDGAR restatements — a 10-K vs its 10-K/A: same period, restated numbers, real financial stakes, everything retained forever.
- OFAC SDN / GLEIF — daily-cadence registries that edit and delist entities with no per-record note; GLEIF even ships delta files.
- NOAA GHCN / NASA GISTEMP — the politically loaded climate case: past months' temperatures change with each homogenization run, and GISTEMP's update log is the oracle.
- labelerrors.com / ImageNet relabeling — the ML-audience hook: a pre-built before/after over the most famous benchmarks in the field.
Each is a turnkey acquire, each has a citable incident, and several ship the answer key binoc's output can be graded against. They are the natural next wave of showcase targets once the formats they need (text/tz source, VCF, USLM XML, XBRL TSV) are in reach — and they extend the showcase's reach from "US open-data CSVs" to law, science, finance, and machine learning, which is where the "I want changelogs like that for my data" reaction is most likely to land.
Next steps¶
- Promote the turnkey, tabular-or-text true-revision targets into the showcase pipeline first (tz, FDA already in; add OFAC, GLEIF, ClinVar once VCF lands, US Code/eCFR once USLM XML lands).
- Build the quality regression on the answer-key datasets (§2): they give a structured target to diff binoc's generated changelog against, which the current showcase (verbatim-output-only) lacks.
- Treat the dual-axis datasets as correspondence-engine tests (§3): they are the cleanest real-world exercises of add-vs-revise discrimination on a stable key.
- Use the format demand (§4) to prioritize parse rules: VCF, GFF/GTF, USLM/legislative XML, and NetCDF each convert a whole field from "interesting but unbuildable" to a stack of turnkey targets.