Document Processing Changelog
Current release: 3.1.0 (February 2026)
3.1.0 — February 2026
Production hardening on the 3.x line.
Added
- Best reference image per person. Cross-document ranking surfaces a single best-quality face image per person, driving the reference-face-database use case.
- Deduplication. The same document appearing across multiple inputs is collapsed into a single record.
- Resumable runs. Long batch runs can be safely interrupted and resumed — already-processed documents are skipped on the next run.
- Provenance tracking. Each extracted field is traceable back to the document region it came from, supporting audit and explainability.
- AWS RDS IAM authentication. Optional short-lived IAM tokens for password-less database connections.
Changed
- More consistent normalisation of document types, country codes, and dates across the many formats seen in real archives.
- More robust reading of the Machine-Readable Zone on low-quality scans, via a multi-stage fallback strategy.
- More reliable recovery of the national identity number from document text, with check-digit validation.
3.0.0 — December 2025
A major rewrite centred on a new in-house document-detection model.
Added
- In-house document detection. A new detection model reliably identifies the document type and the relevant regions (face, Machine-Readable Zone) on each page, significantly improving extraction accuracy on difficult scans.
- Fully offline operation. All models are bundled in the container image — no runtime downloads, suitable for air-gapped deployments.
- Structured JSON logging, ready to ingest into Datadog, Splunk, ELK, or CloudWatch.
Changed
- Re-architected extraction engine, enabling broader document-type and field coverage.
- Improved OCR accuracy and speed on cropped ID documents.
2.0.0 — 2025
A breaking change that moved the service from a file-only pipeline to database- and storage-backed processing.
Added
- Database persistence. Extracted results can be written to a database, in addition to CSV.
- Configurable sources and targets. Read documents from local disk or S3, and write results to a database, S3 CSV, or local CSV — all configured through environment variables.
- Parallel, GPU-accelerated processing tuned for very large archives (hundreds of thousands of documents).
- Improved record ranking and selection across a person's documents.
Changed
- Breaking: results are now written to a database or object storage rather than to local files only, and the service is configured through environment variables. Review the Configuration page when upgrading from 1.x.
1.0.0 — 2024
Initial release.
Added
- Field extraction from scanned ID documents, combining document text and Machine-Readable Zone data.
- Face image quality assessment and selection of the best reference image per document.
- Support for scanned TIFF archives and identity-verification report PDFs.
- Structured error reporting for documents that could not be processed.