Skip to main content

Frequently Asked Questions

Does any document data leave our network?

No. Document Processing runs entirely on the customer's infrastructure as a Docker container. The service does not call any Mobai-hosted API at runtime and does not transmit document content, extracted fields, or face images outside the customer's network. The only outbound traffic is to your own configured S3 bucket and your own PostgreSQL instance.

What document formats are supported?

The common file formats used to store archived documents — multi-page PDF and TIFF files (both embedded-text and scanned), plus PNG, JPEG, WEBP, BMP, and other common image formats. If your archive uses a less common format, contact us and we'll confirm whether it's supported.

Do we need a GPU?

Yes. The OCR, document detection, face detection, and face-quality models all run on the GPU. A CPU-only deployment is not supported — throughput would be many orders of magnitude lower. See Deployment → System requirements for sizing.

How fast is processing?

On a reference workstation (RTX 4080 Super, Ryzen 9 7900, 32 GB RAM) the pipeline processes ~1000 documents (mixed PDF + multi-page TIFF) in about 83 minutes with 4 parallel workers. Throughput scales near-linearly with workers up to the GPU's saturation point. See Deployment → Sizing and benchmarks.

Can we run the service on a schedule rather than continuously?

Yes. The container exits cleanly when the work queue is drained, which makes it a natural fit for cron, Kubernetes CronJob, AWS Batch, or any job scheduler. Combined with SKIP_PROCESSED_FILES=true, scheduled runs will pick up new documents as they appear in the archive and ignore everything already processed.

What happens if a run is interrupted halfway through?

Nothing is lost. Records already written to the target are durable; the next run with SKIP_PROCESSED_FILES=true will skip everything that has already been processed and continue from where the previous run left off. See Operations → Resumable runs.

Can we run multiple workers in parallel?

Yes for DATA_TARGET=db. The PostgreSQL target supports concurrent worker writes safely.

No for DATA_TARGET=s3 or DATA_TARGET=local. CSV append is not concurrency-safe; use WORKERS=1 for these targets. If you need parallel processing, use the db target.

How is the "best face image" selected?

For each person (identified by their national ID number), the service ranks every detected face by an industry-standard face image quality metric according to ISO/IEC 29794-5 and retains the highest-scoring image as the reference.

Which document types are supported?

Document Processing reads the identity documents most commonly held in KYC/AML archives — passports, national ID cards, residence permit cards, driver's licences, and bank cards with an identity section. For each, the pipeline applies document-specific business logic for field extraction, validity checks, and face-image handling. See Supported documents for details.

Do you support documents from other nationalities?

Yes. The pipeline works across nationalities: the Machine-Readable Zone is read to the ICAO 9303 standard for any issuing state, issuing-country and nationality are recognised against a reference set of roughly 190 countries, and the OCR is multi-lingual.

We continually extend support for additional document types and market-specific fields. If you need coverage for a particular document type or market, contact us — we'll scope it for your deployment.