Document Processing
Mobai Document Processing is an on-premise batch pipeline that bootstraps a biometric reference face-image database from an institution's existing archive of KYC/AML identity documents. It is built for any organisation that holds KYC data — financial institutions, insurers, public-sector bodies, and other regulated institutions — and wants to adopt biometric authentication, clean up its identity-document archive, or both.
Rather than re-onboarding every customer to capture a face image, the service mines the passports, national ID cards, and residence permits already collected during onboarding, and selects the best-quality reference face image for each person — ready to seed face verification or biometric login.
In effect, institutions can passively enrol their existing customers into biometric authentication — building the reference database from records they already hold, with no action required from the customer.
Because it reads and validates every ID document along the way, the service also returns quality-control metadata about your KYC archive — surfacing which records are backed by low-quality scans, unreadable documents, or documents worth refreshing. This is useful even to institutions that aren't adopting biometrics yet but want a structured, validated view of an identity-document archive that has grown over years.
Customers for whom no usable reference image can be found are flagged, so the institution knows exactly who needs a KYC refresh and re-onboarding to reach full biometric coverage.
The service runs entirely inside the customer's own infrastructure as a Docker container. Source documents and extracted results never leave the customer's network.
What it does
Point the service at your document archive, and for every document it can read it returns:
- a structured identity record — the key details from the document (name, date of birth, document and ID numbers, nationality, issuing country, dates, document type);
- the best available face image for that person, with a score indicating how suitable it is as a biometric reference; and
- a clear status for the record — including a flag when no usable face image could be obtained, so you know which customers need a KYC refresh and re-onboarding.
Archives are processed in bulk, and the service can be re-run as new documents arrive — only documents it hasn't already handled are picked up. Results are written to the destination you configure.
Sample output
For each document the service successfully processes, it produces a structured record that combines the extracted fields with the best face image found on the document. A representative record looks like:
{
"file_name": "20240315_archive_001234.tif",
"first_name": "Aasmund",
"last_name": "Specimen",
"birth_date": "1985-04-23",
"gender": "M",
"nationality": "NOR",
"id_number": "23048512345",
"document_type": "PASSPORT",
"document_number": "CCC002251",
"issuing_country": "NOR",
"issue_date": "2020-01-01",
"expiry_date": "2030-01-01",
"face_image": "<base64-encoded image>",
"face_image_quality_score": 0.84
}
Any field that could not be extracted from the source is returned as null, so partial results are still produced for documents where individual fields are unreadable.
Best face per person
When the same person appears across multiple submissions in the archive — for example, an old passport scan followed by a later ID-card renewal — the service also surfaces a best-face view containing one entry per person. Each entry holds the single highest-quality face image found across all of that person's documents, along with a reference to the source document it came from:
{
"id_number": "23048512345",
"best_source_file": "20240315_archive_001234.tif",
"document_type": "PASSPORT",
"issuing_country": "NOR",
"image_taken": "2020-01-01",
"quality_score": 0.84,
"face_image": "<base64-encoded image>"
}
This view is the primary output used by customers building a biometric reference database from their archive.
Output formats
The same data is delivered in whichever format you configure: rows in a PostgreSQL database, a single CSV file on S3, or a CSV file on local disk. See Configuration for the available targets and how to choose between them.
Who it is for
Document Processing is designed for organisations that:
- Want to bootstrap a reference face-image database from existing archive material — for example, to enable biometric login or face verification for already-enrolled customers — without re-onboarding every user.
- Also want quality-control insight into their KYC archive as a by-product — knowing which customers' documents are low-quality, unreadable, or otherwise need a KYC refresh — even if they aren't adopting biometrics yet.
- Require an air-gapped or on-premise deployment for regulatory, contractual, or data-sovereignty reasons.
Core capabilities
| Capability | What you get |
|---|---|
| Structured records | Every readable document parsed into a consistent, validated record of the key identity fields. |
| Best reference face image | The highest-quality face image per person, with a quality score indicating its suitability for biometric use. |
| KYC quality signals | A clear flag for customers without a usable reference image, plus indicators that help you spot records due for a refresh. |
| Broad coverage | Passports, national ID cards, residence permits and more, across roughly 190 countries. |
| Flexible output | Results delivered to a database or CSV, in whichever format suits your pipeline. |
| Built for large archives | Processes archives in bulk and can be safely re-run as new documents arrive. |
| Stays on your infrastructure | Runs fully on-premise; documents and results never leave your network. |
Delivery model
Document Processing is delivered as a Docker image that runs on the customer's own GPU-equipped hardware. There is no SaaS endpoint, no Mobai-hosted API call, and no outbound data transfer required for processing. The deployment topology is typically:
- Document Processing container (this service) — reads from the document archive, performs analysis, writes results.
- Database — a PostgreSQL instance (existing or new) for structured results.
- Object storage (optional) — an S3 bucket for source documents and for archived CSV outputs.
A reference docker-compose configuration is provided that brings up the service together with a self-contained PostgreSQL and pgAdmin for evaluation purposes.
Supported documents
Document Processing detects and reads the identity documents most commonly held in KYC/AML archives. The document detector classifies each page, and the pipeline applies document-specific business logic — tailored field extraction, validity rules, and face-image handling — to:
- Passports — standard passports, foreigner passports, and travel documents.
- National ID cards — national identity cards.
- Residence permit cards — residence permit cards.
- Driver's licences — name and date extraction from the licence layout.
- Bank cards with an identity section — including the printed national identity number.
The pipeline is built to work across nationalities. The Machine-Readable Zone is read to the ICAO 9303 standard for any issuing state, issuing-country and nationality are recognised against a reference set of roughly 190 countries, and the OCR is multi-lingual.
We continually extend support for additional document types and market-specific fields. If you need coverage for a particular document type or market, contact us and we'll scope it for your deployment.
Supported input formats
Document Processing reads the common file formats used to store archived documents — multi-page PDF and TIFF files, as well as PNG, JPEG, WEBP, BMP, and other common image formats. If your archive uses a less common format, contact us and we'll confirm whether it's supported.