Skip to content

Documentation Platform and Information Design

Date: 2026-04-17 Status: Proposed

Context

Binoc's documentation has grown organically and now includes:

  • A user-facing README.md doing double duty as marketing landing page and architectural overview.
  • An auto-regenerated docs/tutorial.md (Showboat-verified, regenerated by just docs — see 2026-03-06-tutorial_regeneration_lifecycle.md).
  • A long-form docs/writing_plugins.md that mixes step-by-step instruction, task recipes, and reference material in one file.
  • A release runbook (docs/release.md).
  • 28 ADRs under docs/adr/ covering rationale and rejected alternatives.
  • Worked examples (docs/examples/fasta-demo/).
  • Implicit documentation in test-vectors/ — each vector demonstrates a capability and ships a manifest.toml describing it.
  • An AGENTS.md at the repo root encoding the project's architectural rules for both human and AI contributors.

There is no docs site. Everything is read on GitHub. There is no rendered view of the ADR cross-references, no search across the corpus, no auto-generated reference for the Python API, the Rust SDK, the CLI, or the changeset JSON schema. The published packages — binoc, binoc-sqlite, and binoc-sdk (see 2026-04-08-release_surface_and_automated_publishing.md) — have no documentation home outside their PyPI / crates.io pages.

The information design is also implicitly mode-mixed. writing_plugins.md opens by teaching, then transitions into a concept reference, then descends into per-method API tables. New contributors and plugin authors have to construct the mental model themselves from the ADR backlog, which is also where most of the architectural reasoning lives.

This ADR commits to a documentation platform and a content-organization discipline before that backlog grows further.

Decision

1. Platform: MkDocs with the Material theme

mkdocs-material is the documentation generator. Reasons specific to this repo:

  • Toolchain alignment. The project is Rust + Python + uv + just. MkDocs is pure Python, installable via uvx mkdocs-material, and adds no Node.js dependency. just docs-serve and just docs-build slot in next to the existing just docs.
  • Markdown in, markdown out. Every existing docs/*.md file becomes a page with no rewriting. The Showboat-regenerated docs/tutorial.md is consumed unchanged on each build — no coupling between the two pipelines.
  • First-class GitHub Pages deploy. mkdocs gh-deploy and the standard actions/deploy-pages workflow are both well-supported.
  • Mermaid, admonitions, code-tab UI built in. pymdownx.superfences
  • the Material defaults give us inline architecture diagrams, "Note / Warning / See also" callouts, and tabbed code samples without bespoke authoring tools. This is the substrate the architecture-visuals proposal will sit on top of.
  • Used by adjacent projects in the Python data ecosystem (FastAPI, Pydantic, Typer, uv, mkdocstrings itself), which keeps both the contributor pool and the LLM training data familiar with our patterns.

mkdocs.yml lives at the repo root. The site is served from docs/ unchanged; the only required addition is a single-line docs/index.md that includes README.md via mkdocs-include-markdown-plugin so the landing page is not a duplicate.

2. Information design: Diátaxis with a binoc-specific mapping

The site is organized along Daniele Procida's Diátaxis quadrants. The two axes are learning vs. doing and concrete vs. abstract; mixing modes on a single page is the single most reliable way to produce bad docs.

Mode Binoc home Status
Tutorial (learning, concrete) docs/tutorial.md Keep as-is; regenerated via Showboat.
How-to (doing, concrete) docs/howto/*.md New. Short, task-titled recipes.
Reference (doing, abstract) docs/reference/{cli,python,changeset,sdk}.md New, mostly auto-generated.
Explanation (learning, abstract) docs/explanation/architecture.md plus docs/adr/*.md New top-level overview; existing ADRs become navigable.

The existing docs/writing_plugins.md is mode-mixed and is split:

  • docs/howto/write-a-python-comparator.md, write-a-rust-comparator.md, write-a-transformer.md, write-a-renderer.md — each a copy-pasteable recipe for one task, ending with a working plugin.
  • docs/reference/python.md and docs/reference/sdk.md — exhaustive API surface, auto-generated.
  • docs/explanation/plugin-architecture.md — what a plugin is, why the three-axis split exists, when to choose Python vs. Rust, what artifacts buy you. Links into the relevant ADRs rather than repeating them.

3. The audience map drives entry points and cross-linking, not the file tree

Each page declares the audience it is primarily written for in its frontmatter. Some pages may also list secondary audiences, because a well-cut explanation or reference page often serves more than one kind of reader. The five recurring audiences are:

  • Data steward / archivist. Lands on a how-to. "Diff two snapshots of a federal dataset and produce a CHANGELOG."
  • Pipeline integrator. Lands on the changeset JSON reference. Cares about schema stability and exit codes.
  • Domain-format plugin author (Python or Rust). Lands on a how-to, reaches into the SDK reference and the plugin-architecture explanation.
  • Core contributor. Lands on AGENTS.md, the architecture explanation, and the ADR index.
  • AI agent / LLM-driven workflow. A real audience now. The site's structure should let an agent route a user's query to the single correct page rather than synthesizing from many. This is a primary reason to enforce the Diátaxis split.

The top-level navigation in mkdocs.yml is organized by Diátaxis mode (Tutorial / How-to / Reference / Explanation), not by audience. Audience metadata is used for three lighter-weight routing aids instead:

  • a single Start here page organized by role
  • section-index guidance ("if you are X, start with Y")
  • per-page introductory copy and related links

This keeps the global structure stable and task-shaped while still giving first-time readers a role-based way in.

4. Reference is generated, not written

Reference pages decay if hand-written. Three generators:

  • Python API: mkdocstrings[python] reads docstrings from the binoc package and renders docs/reference/python.md at site-build time. This is the canonical API surface for binoc-python.
  • Rust SDK: cargo doc --no-deps --package binoc-sdk is built separately and published under /sdk/ on the same site. The MkDocs build copies the generated HTML into the output tree as a static subpath. This is the only Rust crate published to crates.io; it is also the only one that gets a dedicated reference site. Internal crates (binoc-core, binoc-stdlib, etc.) do not get hosted reference docs to discourage external dependencies on unstable surfaces, consistent with 2026-04-08-release_surface_and_automated_publishing.md.
  • CLI: a small Rust binary in binoc-cli emits Markdown for every subcommand and option (using clap_markdown or equivalent), producing docs/reference/cli.md. The just docs recipe runs this generator alongside Showboat. CLI reference is a build artifact, not an authored file.

The fourth reference page — the changeset JSON schema — is also generated, from a JSON Schema emitted by binoc-sdk via a schemars derive gated behind an opt-in schema cargo feature (see Open Question 1). This page is the contract for pipeline integrators. It needs to be stable and exhaustive, which is exactly what generation gives.

5. ADRs are first-class explanation content

The 28 existing ADRs are already the most thorough explanation layer in the project. The previous design where they're an internal-only backlog underserves them. They become a top-level section of the site (Explanation → Architectural Decisions), with docs/adr/README.md auto-extended at build time by a small script that reads each ADR's front matter (Date, Status) and produces the index entry — the current hand-maintained index.md is a candidate for replacement here.

ADR cross-references already use relative markdown links and continue to work unchanged. New rule: when an ADR is canonical for a concept, the long-form explanation page links to it rather than restating the rationale. ARID over DRY — short prerequisite restatements are fine; full parallel explanations are not.

A separate "Architecture overview" explanation page (docs/explanation/architecture.md) is the single entry point to the architecture story, and it is the natural home for the diagrammatic visuals proposed elsewhere (see the project's architecture-visuals plan, tracked outside this repository). That overview links into the ADR set; the ADRs are the long-form record.

6. The docs build has multiple regeneratable upstreams; just orchestrates them

The docs site is a consumer of generated markdown, not a participant in its generation. Several upstreams produce input files, all coordinated by just with cache-aware recipes so a clean rebuild is cheap and a no-op rebuild is free:

  • Showboat regenerates docs/tutorial.md (and, in time, executable blocks in how-tos) by re-running embedded shell. Boundary set by 2026-03-06-tutorial_regeneration_lifecycle.md. The default authoring path for runnable code samples; reach for custom generators only when Showboat is genuinely insufficient.
  • CLI markdown is emitted from binoc-cli into docs/reference/cli.md.
  • Python API is rendered into the site at MkDocs build time by mkdocstrings, sourced from binoc-python docstrings.
  • Rust SDK reference is built by cargo doc --no-deps --package binoc-sdk and copied into the site under /sdk/.
  • ADR index (docs/adr/README.md) is regenerated from the front matter of docs/adr/*.md.
  • Test-vector gallery is emitted from shared workspace manifests and committed snapshot layouts into docs/explanation/test-vectors-gallery.md.

Each upstream is a just recipe (just docs-tutorial, just docs-cli, just docs-sdk, just docs-adr-index, just docs-vectors) with explicit input dependencies, fronted by an aggregating just docs that runs only what's stale. The MkDocs build itself (just docs-buildmkdocs build --strict) is a separate recipe that depends on just docs and never invokes a generator directly.

CI runs just docs && just docs-build on every PR (PR fails on broken links or stale generated files); the main-branch workflow additionally deploys the site. --strict is non-negotiable: given how heavily the ADRs cross-reference each other, broken-link CI is the single most valuable guardrail the platform adds.

7. Authoring conventions

These are project-specific norms that Diátaxis does not address:

  • Mermaid is the default for inline diagrams. Hand-authored SVG only when mermaid is genuinely insufficient (animations, custom layouts). Material's pymdownx.superfences renders mermaid natively.
  • Site-level "in active design" banner. Nothing in binoc is stable yet, so per-page stability badges would be uniformly "experimental" and add noise. Instead, the site renders a single compact banner on every page: "Binoc is in a collaborative design phase. The CLI is ready to use; internals are unstable and expected to change. Feedback and collaboration welcome: [link]." This frames the project as malleable rather than unsafe and points contributors at the input channels. Per-page stable/experimental badges are deferred until at least one surface is genuinely stable (likely the changeset JSON schema first, then binoc-sdk).
  • Every page is page one. Most readers arrive from search, not the navigation. Each page begins with what it's for and who it's for, and links to its prerequisites inline rather than assuming the reader has read upstream pages.
  • Task-oriented titles in How-to. "Diff a zip of CSVs against a SQLite database" beats "The SQLite plugin." How-to titles are written for Google.
  • Code samples are runnable. Where possible, how-tos cite snippets from test-vectors/ or docs/examples/ rather than embedding hand- written code that drifts. The Showboat pattern (executable blocks) may be extended to how-tos in a follow-up. Caveat: test-vectors/ ships source trees, not built artifacts (see 2026-04-16-test_vector_materialization.md), so how-tos that demo a vector point at test-vectors-materialized/… and ask the reader to run just materialize first. This is a known cost of keeping opaque binaries out of source control; revisit if it becomes an onboarding friction point.
  • One primary audience per page declared in frontmatter, with optional secondary audiences when the page genuinely serves multiple roles. Audience data is used for routing cues, not as a second site taxonomy.

8. Versioning: latest only, for now

Each published package versions independently, so a unified docs version would be a fiction. We ship latest only — what's on main — and revisit when a real user hits a version-skew problem. The Status line and Date already carried by every ADR provide adequate within-corpus versioning.

9. Deployment

GitHub Pages via the workflow-based deployment path: the docs workflow builds site/ with just docs-build, uploads it with actions/upload-pages-artifact, and a gated deploy job publishes it with actions/deploy-pages. Site URL: https://harvard-lil.github.io/binoc/. PR builds run just docs plus a git diff --exit-code -- docs/ staleness check and just docs-build for link/lint validation, but do not deploy. No gh-pages branch and no local deploy command — the workflow is the only path to production, so contributors can't accidentally publish from a laptop. The existing release.md runbook gains a one-line note that docs deploy is automatic on push to main and is independent of package releases.

Alternatives Considered

mdBook. Excellent for Rust-only projects shipped via cargo. A poor fit here because binoc's primary user-facing distribution is pip install binoc, the bulk of the ecosystem (entry points, plugin discovery, the CLI bridge) is Python, and mdBook has no story for mkdocstrings-equivalent Python API generation. We would still need a second generator for the Python surface, which defeats the unification.

Sphinx + Furo + MyST. The default for "serious" Python docs and strong on cross-reference tooling (intersphinx). Rejected because the project is markdown-native and adopting Sphinx would force MyST or RST on every existing file. The depth of cross-referencing Sphinx provides is overkill for a project whose main reference surfaces are CLI, a small Python API, and a Rust SDK with its own native generator.

Docusaurus / Starlight. Strong sites in the JS ecosystem. Both introduce a Node.js toolchain and a JSX/MDX authoring substrate that the project does not otherwise need, and both are more product-marketing oriented than this project's content actually warrants. Reconsider only if a marketing landing page becomes a separate need from the documentation site.

Read the Docs hosting. A reasonable host for any of the above. The project already deploys plenty of artifacts via GitHub Actions (publish.yml, soon docs.yml) so adding RTD is a second deploy target with no clear advantage. GitHub Pages is sufficient.

Skip the platform; keep reading on GitHub. Tempting because the markdown is already there. Rejected because: there is no search across the corpus; ADR cross-references are not visualized; no place to host generated reference; no link validation in CI; and pipeline integrators have nowhere to find the changeset JSON schema. The project has outgrown read-on-GitHub.

Diátaxis as a soft suggestion rather than enforced structure. Tried implicitly already (the existing writing_plugins.md is the artifact). Mode-mixing produced a single 570-line page that serves nobody's primary need well. Enforcing the four directories is cheap discipline that prevents the next 570-line file.

One unified Rust API site (all crates). Hosting cargo doc for the entire workspace. Rejected per 2026-04-08-release_surface_and_automated_publishing.md: publishing reference for unpublished crates encourages external dependencies on unstable internals. Only binoc-sdk gets a hosted reference page.

Consequences

  • A real docs URL. https://harvard-lil.github.io/binoc/ becomes the canonical reference, taking pressure off the README to do everything.
  • The README slims down. It stays a marketing landing page and a pointer to the docs site. Architectural overview moves to docs/explanation/architecture.md.
  • docs/writing_plugins.md is split into four how-to recipes, one reference page (or two — Python vs. SDK), and one explanation page. This is a real authoring task, not a redirect.
  • CLI, Python API, and changeset JSON schema all become generated reference. New code in binoc-cli and binoc-python to emit the generators' inputs is the first non-prose work this ADR creates.
  • CI gains a docs build job that fails on broken links. ADR authors get immediate feedback on cross-reference typos.
  • The 28 ADRs become navigable, with search and a generated index. Their value as the project's reasoning corpus increases sharply when they stop being a list-of-files.
  • Plugin authors get a real reference, not a paragraph in a long guide. This is the highest-impact downstream effect: the binoc-sdk audience is precisely the audience most underserved by the current setup.
  • The architecture-visuals proposal has a substrate to land on: mermaid renders out of the box, and animations or interactives can be embedded as raw HTML in markdown.

Bootstrap: single migration pass to the Diátaxis layout

The migration ran as one pass rather than incremental restructuring of the existing prose: the old tutorial / writing_plugins / release pages were moved to a temporary docs/legacy/ holding directory, the new Diátaxis frame was scaffolded under docs/howto/, docs/reference/, and docs/explanation/, and content was distilled from the legacy sources into the new files. After the new pages landed, docs/legacy/ was removed. README, docs/adr/, and AGENTS.md were already mode-correct and stayed in place.

Page inventory and provenance

Each page below was authored from the listed source(s). Tags: lift = move material with light editing; compose = synthesize across sources; split = extract one section from a larger legacy file; new = net-new authoring; generated = machine-emitted.

Landing + tutorial

File Sources Tag
docs/index.md README (lead, example, quick start) compose
docs/tutorial.md legacy/tutorial.md trimmed to actual tutorial scope (architecture sections move out) compose; Showboat regen

How-to (task-titled recipes; one focused job each)

File Sources Tag
docs/howto/diff-two-snapshots.md README quick start + legacy/tutorial compose
docs/howto/save-and-render-changesets.md output_routing_and_cli_ux ADR + README compose
docs/howto/extract-changed-data.md provenance_and_extract ADR + README extract section compose
docs/howto/install-and-use-plugins.md README plugins section + plugin_discovery ADR compose
docs/howto/write-a-python-comparator.md legacy/writing_plugins (Python comparator) split
docs/howto/write-a-python-transformer.md legacy/writing_plugins (Python transformer) split
docs/howto/write-a-python-renderer.md legacy/writing_plugins + binoc-html model plugin split
docs/howto/write-a-rust-comparator.md legacy/writing_plugins (Rust) + binoc-sqlite model plugin split
docs/howto/write-a-rust-transformer.md legacy/writing_plugins + binoc-row-reorder model plugin split
docs/howto/publish-a-plugin.md legacy/writing_plugins (packaging + entry points) split
docs/howto/test-a-plugin-with-vectors.md plugin_test_vector_harness + test_vector_materialization ADRs compose
docs/howto/cut-a-release.md legacy/release.md + release_surface_and_automated_publishing ADR rename
docs/howto/contribute-to-binoc.md AGENTS.md + legacy/tutorial dev-setup section + README development compose

Reference (stable shape; mostly generated)

File Sources Tag
docs/reference/cli.md clap_markdown emitter in binoc-cli generated
docs/reference/python.md mkdocstrings against binoc-python generated
docs/reference/sdk.md one-page link into the cargo doc subpath at /sdk/ stub
docs/reference/changeset-schema.md schemars-derived schema (schema feature on binoc-sdk, rendered via scripts/build_changeset_schema_page.py) generated
docs/reference/dataset-config.md config keys scattered across legacy/tutorial + ADRs new
docs/reference/plugin-discovery.md legacy/writing_plugins entry-point spec + plugin_discovery ADR compose

Explanation (the architectural narrative; ADRs remain the long-form record)

File Sources Tag
docs/explanation/architecture.md README "Why" + Workspace Layout + AGENTS.md key rules + the most-cross-referenced ADRs new (entry point)
docs/explanation/why-binoc-exists.md README "Why It Exists" + m×n×o framing from the architecture-visuals plan new
docs/explanation/vocabulary.md terminology ADR lift
docs/explanation/plugin-model.md plugin_sdk_and_abi + stdlib_boundary + plugin_discovery ADRs compose
docs/explanation/ir-and-changesets.md full_comparison_tree_and_content_hashes + transient_fields_on_wire + opportunistic_itemref_metadata ADRs compose
docs/explanation/artifacts-and-composition.md published_artifacts_for_cross_plugin_composition + transformer_composition_and_artifact_flow ADRs compose
docs/explanation/dispatch-model.md transformer_dispatch_refinement + transformer_scope_yagni + media_type_detection ADRs compose
docs/explanation/significance-classification.md renderer_config ADR + terminology clerical/substantive section compose
docs/explanation/extract-and-provenance.md provenance_and_extract + cross_phase_data_cache ADRs compose
docs/explanation/test-vectors.md snapshot_testing_for_test_vectors + test_vector_materialization + test_vector_defaults_and_plugin_vectors + plugin_test_vector_harness ADRs compose
docs/explanation/test-vectors-gallery.md test-vectors/*/manifest.toml + committed snapshot trees generated
docs/explanation/security-and-trust.md security_posture_and_auditing ADR lift

docs/adr/* and docs/adr/README.md stay where they are; the site nav exposes them as Explanation → Architectural decisions. Compose pages link to ADR sources rather than restating rationale (ARID over DRY); the ADRs remain the canonical record.

The README was slimmed alongside the migration to a one-paragraph product description, install line, link to the docs site, and link to the ADR index. Its previous architectural and quick-start content moved into docs/index.md, the tutorial, and the explanation set.

Platform + nav + CI

mkdocs.yml lives at the repo root. The nav follows the audience sections (Users / Plugin Developers / Core Developers) with Diátaxis modes (How-to / Reference / Explanation) nested inside each, plus a top-level Tutorial, Examples, and Start here. The include-markdown, mkdocstrings, pymdownx.superfences (with mermaid), and admonition extensions are enabled; --strict mode catches broken links and orphan pages.

The justfile exposes docs-serve, docs-build, docs-tutorial, docs-cli, docs-adr-index, docs-schema, docs-vectors, docs-sdk, and docs-plugins, aggregated by just docs. Each generator declares its inputs so re-running is a no-op when nothing changed.

CI in .github/workflows/docs.yml runs just docs (with a git diff --exit-code -- docs/ drift check) and just docs-build on every PR; on main, it additionally uploads site/ as a Pages artifact and publishes via actions/deploy-pages. GitHub Pages is configured with Source: GitHub Actions.

Open Questions

  1. Changeset JSON schema source. ~~Hand-written schema vs. generated from serde_json types via schemars.~~ Resolved: generated from the Rust IR types via schemars, gated behind an opt-in schema feature on binoc-sdk so downstream users of the SDK don't pay for a dependency they don't need. just docs-schema invokes the gen-changeset-schema binary (Rust) to emit docs/reference/changeset-schema.json, then runs scripts/build_changeset_schema_page.py to render docs/reference/changeset-schema.md with a table per type. Rejected alternatives:
    • Hand-written schema. Low dependency cost but requires manual sync every time the IR changes — the exact failure mode the Diátaxis Reference quadrant is supposed to rule out.
    • schemars as a [dev-dependencies] entry with #[cfg(test)] derives. Doesn't compose with #[derive] on type definitions outside tests/, and routing emission through a test that writes files is a code smell.
    • Separate binoc-schema-gen crate with wrapper types. Duplicates the field list, defeating the point of generation.
  2. Plugin pages owned by external repos. When a plugin lives in its own repo (a real possibility for binoc-sqlite and future plugins), should its docs be linked, mirrored, or absent from this site? Tentative default: linked from a "Plugin index" page; mirroring is too much coupling.
  3. Search beyond the corpus. Material's built-in lunr-based search is good. If the corpus grows to need ranked relevance or vector search, revisit (Algolia DocSearch, Meilisearch).
  4. Test-vector gallery. Resolved for the shared workspace vectors: scripts/build_test_vector_gallery.py renders a manifest-first gallery from test-vectors/*/manifest.toml plus committed snapshot layouts into docs/explanation/test-vectors-gallery.md, and just docs-vectors participates in just docs. Inline rendered diffs remain a follow-up if manifests-only proves insufficient.