Document Processing Changelog

Current release: 3.1.0 (February 2026)

3.1.0 — February 2026

Production hardening on the 3.x line.

Added

Best reference image per person. Cross-document ranking surfaces a single best-quality face image per person, driving the reference-face-database use case.
Deduplication. The same document appearing across multiple inputs is collapsed into a single record.
Resumable runs. Long batch runs can be safely interrupted and resumed — already-processed documents are skipped on the next run.
Provenance tracking. Each extracted field is traceable back to the document region it came from, supporting audit and explainability.
AWS RDS IAM authentication. Optional short-lived IAM tokens for password-less database connections.

Changed

More consistent normalisation of document types, country codes, and dates across the many formats seen in real archives.
More robust reading of the Machine-Readable Zone on low-quality scans, via a multi-stage fallback strategy.
More reliable recovery of the national identity number from document text, with check-digit validation.

A major rewrite centred on a new in-house document-detection model.

Added

In-house document detection. A new detection model reliably identifies the document type and the relevant regions (face, Machine-Readable Zone) on each page, significantly improving extraction accuracy on difficult scans.
Fully offline operation. All models are bundled in the container image — no runtime downloads, suitable for air-gapped deployments.
Structured JSON logging, ready to ingest into Datadog, Splunk, ELK, or CloudWatch.

Changed

Re-architected extraction engine, enabling broader document-type and field coverage.
Improved OCR accuracy and speed on cropped ID documents.

A breaking change that moved the service from a file-only pipeline to database- and storage-backed processing.

Added

Database persistence. Extracted results can be written to a database, in addition to CSV.
Configurable sources and targets. Read documents from local disk or S3, and write results to a database, S3 CSV, or local CSV — all configured through environment variables.
Parallel, GPU-accelerated processing tuned for very large archives (hundreds of thousands of documents).
Improved record ranking and selection across a person's documents.

Changed

Breaking: results are now written to a database or object storage rather than to local files only, and the service is configured through environment variables. Review the Configuration page when upgrading from 1.x.

Initial release.

Added

Field extraction from scanned ID documents, combining document text and Machine-Readable Zone data.
Face image quality assessment and selection of the best reference image per document.
Support for scanned TIFF archives and identity-verification report PDFs.
Structured error reporting for documents that could not be processed.