Polyglot Text Extraction: A Survey for binoc's Fallback Projection¶
Research / survey document — 2026-06-15
Status: survey / research input, not a decision record. This document presents tradeoffs and a recommended phased path to inform a later design decision. It does not commit binoc to any tool.
1. Why binoc needs this¶
binoc generates changelogs/diffs for datasets by comparing snapshots. Today, when a file has a native model plugin (SQLite, Parquet, Excel, DBF, Avro, HTML, statistical binaries, …) binoc produces a rich semantic diff. When a file has no native parser, the diff degrades to an opaque "the bytes changed" signal, which is not human-readable.
We want a polyglot fallback: a two-tier degradation so that every file produces some human-readable projection of its content for diffing.
- Text-extraction tier. If the file is a recognizable document/format that isn't natively modeled, fall back to extracting its text (PDF → text, HTML → text, DOCX → text, spreadsheet → cell text, …) so the diff shows readable content changes.
strings(1)tier. For truly unreadable / unknown bytes, fall back to astrings(1)-style extraction of printable runs so that something shows up in the diff (embedded labels, paths, version strings, etc.).
This survey covers "dump the text from anything" libraries so we can choose what to build on for each tier.
2. binoc's constraints (the lens for everything below)¶
These constraints are what make this a non-obvious choice; a tool that is great for RAG ingestion can be a poor fit here.
- Rust-first workspace. binoc is a Rust workspace (
binoc-core,binoc-sdk,binoc-stdlib, Rust model-plugins undermodel-plugins/). A pure-Rust crate has the lowest integration friction. - Python plugins are viable but cost packaging. binoc ships a PyO3 layer
(
binoc-python/) and the public install path ispip install binoc/uvx binoc. So a Python extractor can live as a plugin — but it drags Python runtime + transitive dependency weight into a tool that is otherwise a self-contained CLI/library. - Determinism is mandatory. binoc diffs the extractor's output. If the same input bytes can yield different text (reordered blocks, nondeterministic OCR, floating-point layout heuristics, model-version drift), the extractor will manufacture phantom diffs. This is the single most important axis below.
- Version-pinnable / reproducible. The extraction toolchain must pin to an exact version so a re-run months later yields byte-identical text. Tools that auto-download models at runtime (whose weights can change) are a reproducibility hazard.
- Additive over a byte-hash equality oracle. Extraction is never the source of truth for "did the file change" — a content hash already answers that. Extraction is only a human-readable projection layered on top. This lowers the bar (we don't need perfect extraction) but it does not excuse non-determinism: a projection that reshuffles itself run-to-run still produces noisy, untrustworthy diffs.
- Low build/deploy friction preferred. A JVM, a GraalVM build step, a Pandoc binary,
a Tesseract install, or a multi-hundred-MB model download are each a real cost for a
CLI/library that today installs with a single
pip/uvx. - License must be MIT/Apache-2.0 compatible. This rules out copyleft-for-the-extractor and source-available licenses (AGPL, SSPL, Elastic License 2.0) for anything we link or vendor.
3. Comparison table¶
| Tool | Lang / runtime | Formats | Native vs delegated (system deps) | Build / deploy cost | Determinism | Metadata | OCR | License | Maintenance (as of 2026-06) |
|---|---|---|---|---|---|---|---|---|---|
| Apache Tika | JVM (Java) | 1000+ | Delegates to POI, PDFBox, etc. Needs JVM; OCR needs Tesseract | High (JVM at runtime) | Mostly deterministic for text; OCR/auto-detect can vary | Yes (rich) | Yes (Tesseract) | Apache-2.0 | Very active; 3.3.x stable, 4.0 line in progress |
| Extractous | Rust core + FFI | ~most Tika formats | Compiles Tika to native libs via GraalVM at build time → no JVM at runtime; OCR via Tesseract | High build (GraalVM AOT), moderate deploy | Inherits Tika text behavior; OCR nondeterminism if used | Yes | Yes (Tesseract) | Apache-2.0 | Slower cadence: latest 0.3.0 (Dec 2024), ~1.8k stars |
| Kreuzberg | Rust core, 14 lang bindings | 96 | Native Rust extractors; OCR optional (Tesseract / PaddleOCR / EasyOCR / VLM); ONNX for embeddings | Moderate (OCR/ONNX optional) | Text deterministic-ish; OCR/VLM paths nondeterministic | Yes | Yes (multiple) | Elastic License 2.0 ⚠️ | Very active; v5 RC (Jun 2026), ~8.5k stars |
| unstructured | Python | Many (PDF, HTML, Office, …) | Heavy Python deps; layout models; OCR (Tesseract) | High (deps + models) | Low (ML layout/partition heuristics) | Yes (elements) | Yes | Apache-2.0 core, but AGPL/LGPL transitive deps ⚠️ | Active |
| Docling (IBM) | Python | PDF, Office, images, … | Vision-language + layout models (Granite-Docling 258M), auto-downloaded; OCR | Very high (model downloads, optional GPU) | Low (model inference) | Yes (rich structure) | Yes | MIT (lib) / Apache-2.0 (models) | Very active; ~30k+ stars |
| textract (Py) | Python | Many (shells out) | Delegates to antiword, pdftotext, Tesseract, … external binaries | Medium-high (many CLIs) | Depends on each sub-tool | Limited | Yes (via others) | MIT | Largely stale; textract-py3 fork also frozen |
| pdf-extract (Rust) | Rust | PDF (text) | Native (lopdf) |
Low | Generally deterministic (single-threaded text) | Minimal | No | MIT/Apache-2.0 | Active (0.10.x) |
| lopdf / pdf (Rust) | Rust | PDF (low-level) | Native | Low | Deterministic | Yes (objects) | No | MIT (lopdf) | Active |
| html2text (Rust) | Rust | HTML | Native (html5ever) |
Low | Deterministic | No | No | MIT | Active (0.16.x, 2026) |
| scraper (Rust) | Rust | HTML | Native (html5ever) |
Low | Deterministic | No | No | MIT/ISC | Active |
| calamine (Rust) | Rust | XLS/XLSX/ODS/… | Native | Low | Deterministic | Some | No | MIT/Apache-2.0 | Active |
| quick-xml (Rust) | Rust | XML | Native | Low | Deterministic | No | No | MIT | Active |
| docx-rs (Rust) | Rust | DOCX (writer-centric) | Native | Low | Deterministic | Some | No | MIT/Apache-2.0 | Active (2026) |
| dotext (Rust) | Rust | DOCX/ODT/… (read) | Native | Low | Deterministic | No | No | MIT | Stale (~2017) |
| rust-strings | Rust | any bytes | Native | Very low | Fully deterministic | No | No | MIT | Active (0.6.x) |
4. Per-tool notes¶
Apache Tika — the reference¶
Apache Tika is the canonical "extract text + metadata
from over a thousand file types" toolkit. It is a JVM library that delegates to
format-specific parsers (Apache POI for Office, PDFBox for PDF, etc.) and can OCR via
Tesseract. Current stable is the 3.3.x line (3.2.3 released Sept 2025), with a 4.0 line
in progress; Tika 2.x and Java 8 reached EOL in April 2025
(formats list,
releases,
roadmap).
Apache-2.0 licensed and the gold standard for coverage. The cost for binoc is the
JVM at runtime, which is a heavy dependency for a pip install CLI. Tika's plain-text
extraction is largely deterministic, but its auto-detection and OCR paths introduce
variability. We are unlikely to embed Tika directly, but it matters because Extractous
and (historically) much of this space is Tika under the hood.
Extractous — Tika without the JVM¶
Extractous is a Rust crate (with Python bindings) that gets Tika's coverage without a runtime JVM by compiling Tika to native shared libraries via GraalVM ahead-of-time compilation at build time, then calling it over FFI (README, crates.io). It does text + metadata and OCR via Tesseract, and is Apache-2.0 — license-clean for binoc. This is the most attractive "broad coverage, Rust-native, permissive license" option on paper.
Caveats: - Build complexity. The GraalVM AOT step is non-trivial; the build script installs a GraalVM JDK. This is a real CI/packaging cost and complicates reproducible, pinned builds. - Maintenance cadence. Latest release is 0.3.0 (Dec 2024) with ~1.8k stars (lib.rs) — healthy download numbers but a slower release pace than the alternatives. Worth confirming the project is still actively maintained before betting on it. - Determinism inherits Tika's behavior; plain text should be stable for a pinned Tika/GraalVM build, but OCR (if enabled) is not. For binoc we'd keep OCR off.
Kreuzberg — survey carefully (license is the catch)¶
Kreuzberg is, technically, the most exciting option: a Rust core with SIMD + parallelism, bindings for ~14 languages, 96 formats across documents/office/images/web/email/archives/academic/code (306 languages via tree-sitter), sync and async APIs, optional OCR via Tesseract / PaddleOCR / EasyOCR / VLM, and claims of "10–100× faster than Python alternatives." It is very active — v5.0.0-rc (June 2026), ~8.5k stars (format support, lib.rs, v4 announcement).
However — the license is the catch. The current Kreuzberg is licensed under the Elastic License 2.0 (ELv2), a source-available license with commercial-use restrictions, not an OSI-approved open-source license (LICENSE). Search results suggest earlier versions were MIT and the project moved to ELv2 in the v4/v5 era — a license regression of the same kind that drove the Elasticsearch/OpenSearch fork (context).
For binoc this is close to disqualifying for anything we link or vendor, given the MIT/Apache-2.0 requirement. ELv2 forbids providing the software "as a managed service" and other uses; even if binoc's CLI use might be permissible, taking an ELv2 dependency into an Apache/MIT project is a licensing-hygiene problem and a supply-chain risk (the license can change again). Recommendation: do not depend on Kreuzberg unless/until it returns to a permissive license, or unless legal confirms ELv2 is acceptable for our use. We should re-check the license on each release because the older MIT lineage means the situation is genuinely in flux. (If invoked purely as a separate CLI subprocess the analysis differs, but that's a heavier integration than a crate and still inherits the license-volatility risk.)
unstructured.io — ML partitioning, AGPL transitive deps¶
unstructured is a Python ETL toolkit
that partitions documents into typed "elements" for LLM pipelines. The core is Apache-2.0,
but it pulls layout/detection models and transitive dependencies that include
AGPLv3+ / LGPL components (e.g. ultralytics)
(dependency-license issue).
Its partitioning is ML-heuristic, so output ordering/segmentation is not guaranteed
stable across versions or even runs — a poor fit for a diff oracle. Heavy deps + model
downloads + license tangle make this a weak fit for binoc.
Docling (IBM) — model-driven, heaviest cost, lowest determinism¶
Docling is IBM Research's document-understanding toolkit; it uses vision-language + layout models (Granite-Docling 258M, Idefics3-based) that are auto-downloaded and run inference to reconstruct structure, tables, reading order, etc. (model card, IBM Research blog). The library is MIT and the models Apache-2.0, so license is fine — but everything else is the opposite of what binoc wants: large model downloads, optional GPU, and model-inference non-determinism plus model-version drift. Excellent for rich ingestion; wrong tool for a reproducible byte-diff projection.
textract (Python) and equivalents¶
textract is the classic "extract text from any
document, no muss no fuss" Python package, implemented by shelling out to external binaries
(antiword, pdftotext, Tesseract, …). It is MIT but largely stale; the
textract-py3 fork that fixed dependency pinning is itself now frozen and points users
back upstream (textract-py3). Determinism
varies per sub-tool, and the external-binary sprawl is a deploy headache. Not recommended
as a foundation. The actively-maintained "extract from anything" successors today are
Extractous, Kreuzberg, Docling, unstructured, and Microsoft's MarkItDown
(2025 benchmark).
Native Rust per-format crates — the deterministic, low-friction core¶
These are the building blocks for the text-extraction tier if we want maximum determinism and minimal deps:
- PDF:
pdf-extract(built onlopdf) extracts text from PDFs natively, MIT/Apache, active (0.10.x). For lower-level control,lopdfand thepdfcrate expose the object model. Newer entrants emphasize deterministic page ordering explicitly (e.g.unpdfdocuments an internal reorder buffer emitting pages in ascending order) — a useful pattern if we parallelize (pdf-extract listing). - HTML:
html2text(HTML → plain text viahtml5ever, active, MIT, 0.16.x in 2026) orscraperfor DOM traversal. Both deterministic. (Note: there is a same-namedhtml2textRust NIF for Elixir — the relevant crate isjugglerchris/rust-html2text.) - Spreadsheets:
calaminereads XLS/XLSX/ODS and yields cell values as text — deterministic, MIT/Apache. (binoc already has a native Excel model plugin, so this is more relevant for spreadsheet-shaped formats outside that plugin.) - XML:
quick-xml— fast, deterministic, MIT. - DOCX/Office:
docx-rsis writer-centric but can read;dotextreads DOCX/ODT/… but is stale (~2017) and not recommended. DOCX/PPTX/XLSX are ZIP-of-XML, soquick-xml+ the zip crate is a viable deterministic path we fully control.
rust-strings — the strings(1) tier¶
rust-strings is a small, active, MIT Rust
library that extracts printable strings from arbitrary binary data, with configurable
minimum length, buffer size, and ASCII / UTF-16LE / UTF-16BE encodings — i.e. exactly the
strings(1)-equivalent for the unreadable-bytes tier (lib.rs).
It is fully deterministic (pure byte scan), has near-zero build/deploy cost, and is
license-clean. This is the obvious, low-risk choice for tier 2.
5. Determinism — the key risk¶
Because binoc diffs the extractor's output, determinism is not a nice-to-have; a nondeterministic extractor actively manufactures false changes. Three distinct failure modes:
- Run-to-run nondeterminism (same binary, same input). Parallel block extraction without a reorder buffer, hash-map iteration order, thread races, or ML sampling can reorder or alter output between runs. This is the worst case: binoc would report diffs where the bytes are identical. Risk: high in unstructured, Docling, and any OCR/VLM path; mitigated in native Rust crates that emit in document order.
- Version-drift nondeterminism (different tool/model version, same input). Even if a single version is deterministic, an upgraded layout model, OCR engine, or Tika parser can change the text. binoc must therefore pin the extractor version and treat an extractor upgrade as a deliberate, reviewable event (potentially re-baselining snapshots). Risk: severe for model-download tools (Docling, unstructured), where the "version" includes downloaded weights that can change out from under a pinned package.
- Heuristic nondeterminism (reading-order / layout reconstruction). Tools that infer
reading order from geometry (multi-column PDFs, tables) can flip block order on
near-ties. This is deterministic for a fixed input+version but brittle: a
one-pixel layout change in the source can cascade into a large text reordering, inflating
the diff. Native text-stream extraction (e.g.
pdf-extractreading the content stream) tends to be more stable than vision-based layout inference.
Implication. The determinism ranking lines up almost perfectly against the coverage
ranking: the broadest tools (Docling, unstructured, OCR-heavy paths) are the least
deterministic, and the most deterministic tools (native Rust per-format crates,
rust-strings) are the narrowest. Since extraction is only an additive projection over a
byte-hash oracle, binoc can tolerate imperfect extraction — but it should not
tolerate nondeterministic extraction, because that defeats the projection's purpose.
Concretely: keep OCR off (it is the largest single source of nondeterminism), pin
versions hard, and prefer text-stream extraction over layout/model inference.
6. Recommendation (phased, as input to a later decision)¶
A staged path that front-loads determinism and low friction, deferring broad coverage until it is justified:
Phase 1 — strings(1) tier now (lowest risk, highest leverage).
Adopt rust-strings for the unreadable-bytes
fallback. It is pure-Rust, fully deterministic, MIT, and trivially pinnable. This
immediately makes every file produce some readable projection and establishes the
"additive projection" plumbing in binoc-core/binoc-stdlib.
Phase 2 — native Rust text extraction for the highest-value formats.
Add deterministic, dependency-light extractors for the formats most worth reading:
- PDF via pdf-extract / lopdf (watch page-ordering
determinism; emit in page order).
- HTML via html2text / scraper.
- XML via quick-xml; spreadsheet-shaped data via
calamine; Office (DOCX/PPTX/XLSX) via
quick-xml + zip where a dedicated reader isn't trustworthy (avoid stale dotext).
These are MIT/Apache, deterministic, and add no system deps — the best fit for binoc's
constraints. Keep OCR out of scope.
Phase 3 — broad coverage if and when the long tail justifies it. If demand for many exotic formats appears, evaluate a Tika-class engine: - Extractous is the leading candidate on license grounds (Apache-2.0) and gives Tika coverage with no runtime JVM. The price is the GraalVM build-time AOT step (a real reproducible-build / CI cost) and a slower release cadence to monitor. Run it with OCR disabled and pin the exact Extractous + bundled-Tika version. - Kreuzberg is technically the most capable Rust-core option but is currently Elastic License 2.0, which is not MIT/Apache-compatible. Do not take it as a linked/vendored dependency under that license. Re-check on each release: its MIT lineage means the license could move back, and if it does it becomes the strongest candidate. - A Python-plugin route (Docling / unstructured via the PyO3 layer) is possible but fights nearly every constraint — model downloads, AGPL/LGPL transitive deps (unstructured), and inference non-determinism — so it should be a last resort, scoped to formats nothing else covers, and never on the determinism-critical path.
Cross-cutting guardrails (any phase). Pin every extractor version; treat extractor upgrades as reviewable, potentially re-baselining events; keep OCR disabled by default; prefer text-stream over layout/model inference; and remember the byte-hash oracle remains the source of truth — text extraction is only a human-readable lens, so when in doubt, favor the more deterministic tool over the more complete one.
Sources¶
- Apache Tika — repo: https://github.com/apache/tika · formats: https://tika.apache.org/3.0.0/formats.html · releases: https://github.com/apache/tika/releases · roadmap: https://cwiki.apache.org/confluence/display/TIKA/Tika+Roadmap+--+2.x,+3.x+and+Beyond
- Extractous — repo: https://github.com/yobix-ai/extractous · core README: https://github.com/yobix-ai/extractous/blob/main/extractous-core/README.md · crate: https://crates.io/crates/extractous · lib.rs: https://lib.rs/crates/extractous
- Kreuzberg — repo: https://github.com/kreuzberg-dev/kreuzberg · formats: https://docs.kreuzberg.dev/reference/formats/ · features: https://docs.kreuzberg.dev/features/ · lib.rs: https://lib.rs/crates/kreuzberg · v4 announcement: https://dev.to/t_ivanova/announcing-kreuzberg-v4-55ia · LICENSE: https://github.com/kreuzberg-dev/kreuzberg/blob/main/LICENSE
- Elastic License context — https://socket.dev/blog/developers-burned-by-elasticsearchs-license-change-arent-going-back · https://www.elastic.co/blog/licensing-change
- unstructured.io — repo: https://github.com/Unstructured-IO/unstructured · dependency-license issue: https://github.com/Unstructured-IO/unstructured/issues/3894
- Docling (IBM) — project: https://github.com/docling-project · model card: https://huggingface.co/ibm-granite/granite-docling-258M · IBM Research blog: https://research.ibm.com/blog/docling-generative-AI
- textract — repo: https://github.com/deanmalmgren/textract · py3 fork: https://github.com/KyleKing/textract-py3
- 2025 Python extraction benchmark — https://dev.to/nhirschfeld/i-benchmarked-4-python-text-extraction-libraries-2025-4e7j
- pdf-extract — https://docs.rs/pdf-extract · https://lib.rs/crates/pdf-extract · lopdf: https://github.com/J-F-Liu/lopdf
- html2text — https://github.com/jugglerchris/rust-html2text · https://lib.rs/crates/html2text
- scraper — https://crates.io/crates/scraper
- calamine — https://crates.io/crates/calamine
- quick-xml — https://crates.io/crates/quick-xml
- docx-rs — https://crates.io/crates/docx-rs · dotext: https://github.com/anvie/dotext
- rust-strings — https://crates.io/crates/rust-strings · https://lib.rs/crates/rust-strings