Documentation Platform and Information Design¶
Date: 2026-04-17 Status: Proposed
Context¶
Binoc's documentation has grown organically and now includes:
- A user-facing
README.mddoing double duty as marketing landing page and architectural overview. - An auto-regenerated
docs/tutorial.md(Showboat-verified, regenerated byjust docs— see 2026-03-06-tutorial_regeneration_lifecycle.md). - A long-form
docs/writing_plugins.mdthat mixes step-by-step instruction, task recipes, and reference material in one file. - A release runbook (
docs/release.md). - 28 ADRs under
docs/adr/covering rationale and rejected alternatives. - Worked examples (
docs/examples/fasta-demo/). - Implicit documentation in
test-vectors/— each vector demonstrates a capability and ships amanifest.tomldescribing it. - An
AGENTS.mdat the repo root encoding the project's architectural rules for both human and AI contributors.
There is no docs site. Everything is read on GitHub. There is no rendered
view of the ADR cross-references, no search across the corpus, no
auto-generated reference for the Python API, the Rust SDK, the CLI, or the
changeset JSON schema. The published packages — binoc, binoc-sqlite,
and binoc-sdk (see 2026-04-08-release_surface_and_automated_publishing.md)
— have no documentation home outside their PyPI / crates.io pages.
The information design is also implicitly mode-mixed. writing_plugins.md
opens by teaching, then transitions into a concept reference, then
descends into per-method API tables. New contributors and plugin authors
have to construct the mental model themselves from the ADR backlog,
which is also where most of the architectural reasoning lives.
This ADR commits to a documentation platform and a content-organization discipline before that backlog grows further.
Decision¶
1. Platform: MkDocs with the Material theme¶
mkdocs-material is the documentation generator. Reasons specific to this
repo:
- Toolchain alignment. The project is Rust + Python +
uv+just. MkDocs is pure Python, installable viauvx mkdocs-material, and adds no Node.js dependency.just docs-serveandjust docs-buildslot in next to the existingjust docs. - Markdown in, markdown out. Every existing
docs/*.mdfile becomes a page with no rewriting. The Showboat-regenerateddocs/tutorial.mdis consumed unchanged on each build — no coupling between the two pipelines. - First-class GitHub Pages deploy.
mkdocs gh-deployand the standardactions/deploy-pagesworkflow are both well-supported. - Mermaid, admonitions, code-tab UI built in.
pymdownx.superfences - the Material defaults give us inline architecture diagrams, "Note / Warning / See also" callouts, and tabbed code samples without bespoke authoring tools. This is the substrate the architecture-visuals proposal will sit on top of.
- Used by adjacent projects in the Python data ecosystem (FastAPI, Pydantic, Typer, uv, mkdocstrings itself), which keeps both the contributor pool and the LLM training data familiar with our patterns.
mkdocs.yml lives at the repo root. The site is served from docs/
unchanged; the only required addition is a single-line docs/index.md
that includes README.md via mkdocs-include-markdown-plugin so the
landing page is not a duplicate.
2. Information design: Diátaxis with a binoc-specific mapping¶
The site is organized along Daniele Procida's Diátaxis quadrants. The two axes are learning vs. doing and concrete vs. abstract; mixing modes on a single page is the single most reliable way to produce bad docs.
| Mode | Binoc home | Status |
|---|---|---|
| Tutorial (learning, concrete) | docs/tutorial.md |
Keep as-is; regenerated via Showboat. |
| How-to (doing, concrete) | docs/howto/*.md |
New. Short, task-titled recipes. |
| Reference (doing, abstract) | docs/reference/{cli,python,changeset,sdk}.md |
New, mostly auto-generated. |
| Explanation (learning, abstract) | docs/explanation/architecture.md plus docs/adr/*.md |
New top-level overview; existing ADRs become navigable. |
The existing docs/writing_plugins.md is mode-mixed and is split:
docs/howto/write-a-python-comparator.md,write-a-rust-comparator.md,write-a-transformer.md,write-a-renderer.md— each a copy-pasteable recipe for one task, ending with a working plugin.docs/reference/python.mdanddocs/reference/sdk.md— exhaustive API surface, auto-generated.docs/explanation/plugin-architecture.md— what a plugin is, why the three-axis split exists, when to choose Python vs. Rust, what artifacts buy you. Links into the relevant ADRs rather than repeating them.
3. The audience map drives entry points and cross-linking, not the file tree¶
Each page declares the audience it is primarily written for in its frontmatter. Some pages may also list secondary audiences, because a well-cut explanation or reference page often serves more than one kind of reader. The five recurring audiences are:
- Data steward / archivist. Lands on a how-to. "Diff two snapshots of a federal dataset and produce a CHANGELOG."
- Pipeline integrator. Lands on the changeset JSON reference. Cares about schema stability and exit codes.
- Domain-format plugin author (Python or Rust). Lands on a how-to, reaches into the SDK reference and the plugin-architecture explanation.
- Core contributor. Lands on
AGENTS.md, the architecture explanation, and the ADR index. - AI agent / LLM-driven workflow. A real audience now. The site's structure should let an agent route a user's query to the single correct page rather than synthesizing from many. This is a primary reason to enforce the Diátaxis split.
The top-level navigation in mkdocs.yml is organized by Diátaxis mode
(Tutorial / How-to / Reference / Explanation), not by audience. Audience
metadata is used for three lighter-weight routing aids instead:
- a single Start here page organized by role
- section-index guidance ("if you are X, start with Y")
- per-page introductory copy and related links
This keeps the global structure stable and task-shaped while still giving first-time readers a role-based way in.
4. Reference is generated, not written¶
Reference pages decay if hand-written. Three generators:
- Python API:
mkdocstrings[python]reads docstrings from thebinocpackage and rendersdocs/reference/python.mdat site-build time. This is the canonical API surface forbinoc-python. - Rust SDK:
cargo doc --no-deps --package binoc-sdkis built separately and published under/sdk/on the same site. The MkDocs build copies the generated HTML into the output tree as a static subpath. This is the only Rust crate published to crates.io; it is also the only one that gets a dedicated reference site. Internal crates (binoc-core,binoc-stdlib, etc.) do not get hosted reference docs to discourage external dependencies on unstable surfaces, consistent with 2026-04-08-release_surface_and_automated_publishing.md. - CLI: a small Rust binary in
binoc-cliemits Markdown for every subcommand and option (usingclap_markdownor equivalent), producingdocs/reference/cli.md. Thejust docsrecipe runs this generator alongside Showboat. CLI reference is a build artifact, not an authored file.
The fourth reference page — the changeset JSON schema — is also
generated, from a JSON Schema emitted by binoc-sdk via a schemars
derive gated behind an opt-in schema cargo feature (see Open Question 1).
This page is the contract for pipeline integrators. It needs to be
stable and exhaustive, which is exactly what generation gives.
5. ADRs are first-class explanation content¶
The 28 existing ADRs are already the most thorough explanation layer in
the project. The previous design where they're an internal-only backlog
underserves them. They become a top-level section of the site
(Explanation → Architectural Decisions), with docs/adr/README.md
auto-extended at build time by a small script that reads each ADR's
front matter (Date, Status) and produces the index entry — the current
hand-maintained index.md is a candidate for replacement here.
ADR cross-references already use relative markdown links and continue to work unchanged. New rule: when an ADR is canonical for a concept, the long-form explanation page links to it rather than restating the rationale. ARID over DRY — short prerequisite restatements are fine; full parallel explanations are not.
A separate "Architecture overview" explanation page (docs/explanation/architecture.md)
is the single entry point to the architecture story, and it is the
natural home for the diagrammatic visuals proposed elsewhere (see
the project's architecture-visuals plan, tracked outside this
repository). That overview links into the ADR set; the ADRs are the
long-form record.
6. The docs build has multiple regeneratable upstreams; just orchestrates them¶
The docs site is a consumer of generated markdown, not a participant in
its generation. Several upstreams produce input files, all coordinated by
just with cache-aware recipes so a clean rebuild is cheap and a no-op
rebuild is free:
- Showboat regenerates
docs/tutorial.md(and, in time, executable blocks in how-tos) by re-running embedded shell. Boundary set by 2026-03-06-tutorial_regeneration_lifecycle.md. The default authoring path for runnable code samples; reach for custom generators only when Showboat is genuinely insufficient. - CLI markdown is emitted from
binoc-cliintodocs/reference/cli.md. - Python API is rendered into the site at MkDocs build time by
mkdocstrings, sourced frombinoc-pythondocstrings. - Rust SDK reference is built by
cargo doc --no-deps --package binoc-sdkand copied into the site under/sdk/. - ADR index (
docs/adr/README.md) is regenerated from the front matter ofdocs/adr/*.md. - Test-vector gallery is emitted from shared workspace manifests and
committed snapshot layouts into
docs/explanation/test-vectors-gallery.md.
Each upstream is a just recipe (just docs-tutorial, just docs-cli,
just docs-sdk, just docs-adr-index, just docs-vectors) with
explicit input dependencies, fronted by an aggregating just docs that
runs only what's stale. The MkDocs build itself (just docs-build →
mkdocs build --strict) is a separate recipe that depends on just docs
and never invokes a generator directly.
CI runs just docs && just docs-build on every PR (PR fails on broken
links or stale generated files); the main-branch workflow additionally
deploys the site. --strict is non-negotiable: given how heavily the
ADRs cross-reference each other, broken-link CI is the single most
valuable guardrail the platform adds.
7. Authoring conventions¶
These are project-specific norms that Diátaxis does not address:
- Mermaid is the default for inline diagrams. Hand-authored SVG only
when mermaid is genuinely insufficient (animations, custom layouts).
Material's
pymdownx.superfencesrenders mermaid natively. - Site-level "in active design" banner. Nothing in binoc is stable
yet, so per-page stability badges would be uniformly "experimental"
and add noise. Instead, the site renders a single compact banner on
every page: "Binoc is in a collaborative design phase. The CLI is ready to use; internals are unstable and expected to change. Feedback and collaboration welcome: [link]."
This frames the project as malleable rather than unsafe and points
contributors at the input channels. Per-page stable/experimental
badges are deferred until at least one surface is genuinely stable
(likely the changeset JSON schema first, then
binoc-sdk). - Every page is page one. Most readers arrive from search, not the navigation. Each page begins with what it's for and who it's for, and links to its prerequisites inline rather than assuming the reader has read upstream pages.
- Task-oriented titles in How-to. "Diff a zip of CSVs against a SQLite database" beats "The SQLite plugin." How-to titles are written for Google.
- Code samples are runnable. Where possible, how-tos cite snippets
from
test-vectors/ordocs/examples/rather than embedding hand- written code that drifts. The Showboat pattern (executable blocks) may be extended to how-tos in a follow-up. Caveat:test-vectors/ships source trees, not built artifacts (see 2026-04-16-test_vector_materialization.md), so how-tos that demo a vector point attest-vectors-materialized/…and ask the reader to runjust materializefirst. This is a known cost of keeping opaque binaries out of source control; revisit if it becomes an onboarding friction point. - One primary audience per page declared in frontmatter, with optional secondary audiences when the page genuinely serves multiple roles. Audience data is used for routing cues, not as a second site taxonomy.
8. Versioning: latest only, for now¶
Each published package versions independently, so a unified docs version
would be a fiction. We ship latest only — what's on main — and revisit
when a real user hits a version-skew problem. The Status line and
Date already carried by every ADR provide adequate within-corpus
versioning.
9. Deployment¶
GitHub Pages via the workflow-based deployment path: the docs workflow
builds site/ with just docs-build, uploads it with
actions/upload-pages-artifact, and a gated deploy job publishes it
with actions/deploy-pages. Site URL:
https://harvard-lil.github.io/binoc/. PR builds run just docs plus a
git diff --exit-code -- docs/ staleness check and just docs-build for
link/lint validation, but do not deploy. No gh-pages branch and no
local deploy command — the workflow is the only path to production, so
contributors can't accidentally publish from a laptop. The existing
release.md runbook gains a one-line note that docs deploy is automatic
on push to main and is independent of package releases.
Alternatives Considered¶
mdBook. Excellent for Rust-only projects shipped via cargo. A
poor fit here because binoc's primary user-facing distribution is
pip install binoc, the bulk of the ecosystem (entry points, plugin
discovery, the CLI bridge) is Python, and mdBook has no story for
mkdocstrings-equivalent Python API generation. We would still need a
second generator for the Python surface, which defeats the unification.
Sphinx + Furo + MyST. The default for "serious" Python docs and strong on cross-reference tooling (intersphinx). Rejected because the project is markdown-native and adopting Sphinx would force MyST or RST on every existing file. The depth of cross-referencing Sphinx provides is overkill for a project whose main reference surfaces are CLI, a small Python API, and a Rust SDK with its own native generator.
Docusaurus / Starlight. Strong sites in the JS ecosystem. Both introduce a Node.js toolchain and a JSX/MDX authoring substrate that the project does not otherwise need, and both are more product-marketing oriented than this project's content actually warrants. Reconsider only if a marketing landing page becomes a separate need from the documentation site.
Read the Docs hosting. A reasonable host for any of the above. The
project already deploys plenty of artifacts via GitHub Actions
(publish.yml, soon docs.yml) so adding RTD is a second deploy
target with no clear advantage. GitHub Pages is sufficient.
Skip the platform; keep reading on GitHub. Tempting because the markdown is already there. Rejected because: there is no search across the corpus; ADR cross-references are not visualized; no place to host generated reference; no link validation in CI; and pipeline integrators have nowhere to find the changeset JSON schema. The project has outgrown read-on-GitHub.
Diátaxis as a soft suggestion rather than enforced structure.
Tried implicitly already (the existing writing_plugins.md is the
artifact). Mode-mixing produced a single 570-line page that serves
nobody's primary need well. Enforcing the four directories is cheap
discipline that prevents the next 570-line file.
One unified Rust API site (all crates). Hosting cargo doc for
the entire workspace. Rejected per 2026-04-08-release_surface_and_automated_publishing.md:
publishing reference for unpublished crates encourages external
dependencies on unstable internals. Only binoc-sdk gets a hosted
reference page.
Consequences¶
- A real docs URL.
https://harvard-lil.github.io/binoc/becomes the canonical reference, taking pressure off the README to do everything. - The README slims down. It stays a marketing landing page and a
pointer to the docs site. Architectural overview moves to
docs/explanation/architecture.md. docs/writing_plugins.mdis split into four how-to recipes, one reference page (or two — Python vs. SDK), and one explanation page. This is a real authoring task, not a redirect.- CLI, Python API, and changeset JSON schema all become generated
reference. New code in
binoc-cliandbinoc-pythonto emit the generators' inputs is the first non-prose work this ADR creates. - CI gains a docs build job that fails on broken links. ADR authors get immediate feedback on cross-reference typos.
- The 28 ADRs become navigable, with search and a generated index. Their value as the project's reasoning corpus increases sharply when they stop being a list-of-files.
- Plugin authors get a real reference, not a paragraph in a long
guide. This is the highest-impact downstream effect: the
binoc-sdkaudience is precisely the audience most underserved by the current setup. - The architecture-visuals proposal has a substrate to land on: mermaid renders out of the box, and animations or interactives can be embedded as raw HTML in markdown.
Bootstrap: single migration pass to the Diátaxis layout¶
The migration ran as one pass rather than incremental restructuring of
the existing prose: the old tutorial / writing_plugins / release pages
were moved to a temporary docs/legacy/ holding directory, the new
Diátaxis frame was scaffolded under docs/howto/, docs/reference/,
and docs/explanation/, and content was distilled from the legacy
sources into the new files. After the new pages landed, docs/legacy/
was removed. README, docs/adr/, and AGENTS.md were already
mode-correct and stayed in place.
Page inventory and provenance¶
Each page below was authored from the listed source(s). Tags: lift = move material with light editing; compose = synthesize across sources; split = extract one section from a larger legacy file; new = net-new authoring; generated = machine-emitted.
Landing + tutorial
| File | Sources | Tag |
|---|---|---|
docs/index.md |
README (lead, example, quick start) | compose |
docs/tutorial.md |
legacy/tutorial.md trimmed to actual tutorial scope (architecture sections move out) | compose; Showboat regen |
How-to (task-titled recipes; one focused job each)
| File | Sources | Tag |
|---|---|---|
docs/howto/diff-two-snapshots.md |
README quick start + legacy/tutorial | compose |
docs/howto/save-and-render-changesets.md |
output_routing_and_cli_ux ADR + README | compose |
docs/howto/extract-changed-data.md |
provenance_and_extract ADR + README extract section | compose |
docs/howto/install-and-use-plugins.md |
README plugins section + plugin_discovery ADR | compose |
docs/howto/write-a-python-comparator.md |
legacy/writing_plugins (Python comparator) | split |
docs/howto/write-a-python-transformer.md |
legacy/writing_plugins (Python transformer) | split |
docs/howto/write-a-python-renderer.md |
legacy/writing_plugins + binoc-html model plugin | split |
docs/howto/write-a-rust-comparator.md |
legacy/writing_plugins (Rust) + binoc-sqlite model plugin | split |
docs/howto/write-a-rust-transformer.md |
legacy/writing_plugins + binoc-row-reorder model plugin | split |
docs/howto/publish-a-plugin.md |
legacy/writing_plugins (packaging + entry points) | split |
docs/howto/test-a-plugin-with-vectors.md |
plugin_test_vector_harness + test_vector_materialization ADRs | compose |
docs/howto/cut-a-release.md |
legacy/release.md + release_surface_and_automated_publishing ADR | rename |
docs/howto/contribute-to-binoc.md |
AGENTS.md + legacy/tutorial dev-setup section + README development | compose |
Reference (stable shape; mostly generated)
| File | Sources | Tag |
|---|---|---|
docs/reference/cli.md |
clap_markdown emitter in binoc-cli |
generated |
docs/reference/python.md |
mkdocstrings against binoc-python |
generated |
docs/reference/sdk.md |
one-page link into the cargo doc subpath at /sdk/ |
stub |
docs/reference/changeset-schema.md |
schemars-derived schema (schema feature on binoc-sdk, rendered via scripts/build_changeset_schema_page.py) |
generated |
docs/reference/dataset-config.md |
config keys scattered across legacy/tutorial + ADRs | new |
docs/reference/plugin-discovery.md |
legacy/writing_plugins entry-point spec + plugin_discovery ADR | compose |
Explanation (the architectural narrative; ADRs remain the long-form record)
| File | Sources | Tag |
|---|---|---|
docs/explanation/architecture.md |
README "Why" + Workspace Layout + AGENTS.md key rules + the most-cross-referenced ADRs | new (entry point) |
docs/explanation/why-binoc-exists.md |
README "Why It Exists" + m×n×o framing from the architecture-visuals plan | new |
docs/explanation/vocabulary.md |
terminology ADR | lift |
docs/explanation/plugin-model.md |
plugin_sdk_and_abi + stdlib_boundary + plugin_discovery ADRs | compose |
docs/explanation/ir-and-changesets.md |
full_comparison_tree_and_content_hashes + transient_fields_on_wire + opportunistic_itemref_metadata ADRs | compose |
docs/explanation/artifacts-and-composition.md |
published_artifacts_for_cross_plugin_composition + transformer_composition_and_artifact_flow ADRs | compose |
docs/explanation/dispatch-model.md |
transformer_dispatch_refinement + transformer_scope_yagni + media_type_detection ADRs | compose |
docs/explanation/significance-classification.md |
renderer_config ADR + terminology clerical/substantive section | compose |
docs/explanation/extract-and-provenance.md |
provenance_and_extract + cross_phase_data_cache ADRs | compose |
docs/explanation/test-vectors.md |
snapshot_testing_for_test_vectors + test_vector_materialization + test_vector_defaults_and_plugin_vectors + plugin_test_vector_harness ADRs | compose |
docs/explanation/test-vectors-gallery.md |
test-vectors/*/manifest.toml + committed snapshot trees |
generated |
docs/explanation/security-and-trust.md |
security_posture_and_auditing ADR | lift |
docs/adr/* and docs/adr/README.md stay where they are; the site nav
exposes them as Explanation → Architectural decisions. Compose
pages link to ADR sources rather than restating rationale (ARID over
DRY); the ADRs remain the canonical record.
The README was slimmed alongside the migration to a one-paragraph
product description, install line, link to the docs site, and link to
the ADR index. Its previous architectural and quick-start content moved
into docs/index.md, the tutorial, and the explanation set.
Platform + nav + CI¶
mkdocs.yml lives at the repo root. The nav follows the audience
sections (Users / Plugin Developers / Core Developers) with Diátaxis
modes (How-to / Reference / Explanation) nested inside each, plus a
top-level Tutorial, Examples, and Start here. The include-markdown,
mkdocstrings, pymdownx.superfences (with mermaid), and admonition
extensions are enabled; --strict mode catches broken links and orphan
pages.
The justfile exposes docs-serve, docs-build, docs-tutorial,
docs-cli, docs-adr-index, docs-schema, docs-vectors, docs-sdk,
and docs-plugins, aggregated by just docs. Each generator declares
its inputs so re-running is a no-op when nothing changed.
CI in .github/workflows/docs.yml runs just docs (with a
git diff --exit-code -- docs/ drift check) and just docs-build on
every PR; on main, it additionally uploads site/ as a Pages artifact
and publishes via actions/deploy-pages. GitHub Pages is configured
with Source: GitHub Actions.
Open Questions¶
- Changeset JSON schema source. ~~Hand-written schema vs. generated
from
serde_jsontypes viaschemars.~~ Resolved: generated from the Rust IR types viaschemars, gated behind an opt-inschemafeature onbinoc-sdkso downstream users of the SDK don't pay for a dependency they don't need.just docs-schemainvokes thegen-changeset-schemabinary (Rust) to emitdocs/reference/changeset-schema.json, then runsscripts/build_changeset_schema_page.pyto renderdocs/reference/changeset-schema.mdwith a table per type. Rejected alternatives:- Hand-written schema. Low dependency cost but requires manual sync every time the IR changes — the exact failure mode the Diátaxis Reference quadrant is supposed to rule out.
schemarsas a[dev-dependencies]entry with#[cfg(test)]derives. Doesn't compose with#[derive]on type definitions outsidetests/, and routing emission through a test that writes files is a code smell.- Separate
binoc-schema-gencrate with wrapper types. Duplicates the field list, defeating the point of generation.
- Plugin pages owned by external repos. When a plugin lives in its
own repo (a real possibility for
binoc-sqliteand future plugins), should its docs be linked, mirrored, or absent from this site? Tentative default: linked from a "Plugin index" page; mirroring is too much coupling. - Search beyond the corpus. Material's built-in lunr-based search is good. If the corpus grows to need ranked relevance or vector search, revisit (Algolia DocSearch, Meilisearch).
- Test-vector gallery. Resolved for the shared workspace vectors:
scripts/build_test_vector_gallery.pyrenders a manifest-first gallery fromtest-vectors/*/manifest.tomlplus committed snapshot layouts intodocs/explanation/test-vectors-gallery.md, andjust docs-vectorsparticipates injust docs. Inline rendered diffs remain a follow-up if manifests-only proves insufficient.