Skip to main content

Document Processing Changelog

Current release: 3.1.0 (February 2026)

3.1.0 — February 2026

Production hardening on the 3.x line.

Added
  • Best reference image per person. Cross-document ranking surfaces a single best-quality face image per person, driving the reference-face-database use case.
  • Deduplication. The same document appearing across multiple inputs is collapsed into a single record.
  • Resumable runs. Long batch runs can be safely interrupted and resumed — already-processed documents are skipped on the next run.
  • Provenance tracking. Each extracted field is traceable back to the document region it came from, supporting audit and explainability.
  • AWS RDS IAM authentication. Optional short-lived IAM tokens for password-less database connections.
Changed
  • More consistent normalisation of document types, country codes, and dates across the many formats seen in real archives.
  • More robust reading of the Machine-Readable Zone on low-quality scans, via a multi-stage fallback strategy.
  • More reliable recovery of the national identity number from document text, with check-digit validation.

3.0.0 — December 2025

A major rewrite centred on a new in-house document-detection model.

Added
  • In-house document detection. A new detection model reliably identifies the document type and the relevant regions (face, Machine-Readable Zone) on each page, significantly improving extraction accuracy on difficult scans.
  • Fully offline operation. All models are bundled in the container image — no runtime downloads, suitable for air-gapped deployments.
  • Structured JSON logging, ready to ingest into Datadog, Splunk, ELK, or CloudWatch.
Changed
  • Re-architected extraction engine, enabling broader document-type and field coverage.
  • Improved OCR accuracy and speed on cropped ID documents.

2.0.0 — 2025

A breaking change that moved the service from a file-only pipeline to database- and storage-backed processing.

Added
  • Database persistence. Extracted results can be written to a database, in addition to CSV.
  • Configurable sources and targets. Read documents from local disk or S3, and write results to a database, S3 CSV, or local CSV — all configured through environment variables.
  • Parallel, GPU-accelerated processing tuned for very large archives (hundreds of thousands of documents).
  • Improved record ranking and selection across a person's documents.
Changed
  • Breaking: results are now written to a database or object storage rather than to local files only, and the service is configured through environment variables. Review the Configuration page when upgrading from 1.x.

1.0.0 — 2024

Initial release.

Added
  • Field extraction from scanned ID documents, combining document text and Machine-Readable Zone data.
  • Face image quality assessment and selection of the best reference image per document.
  • Support for scanned TIFF archives and identity-verification report PDFs.
  • Structured error reporting for documents that could not be processed.