Configuration
Document Processing is configured entirely through environment variables.
Source configuration
The source defines where input documents come from. Set DATA_SOURCE to either local or s3.
DATA_SOURCE=local
| Variable | Required | Description |
|---|---|---|
DATA_SOURCE | yes | Set to local. |
LOCAL_DIRECTORY_TO_PROCESS | yes | Absolute path inside the container to the root of the archive. The directory is scanned recursively for files with supported extensions (.pdf, .tif, .tiff). |
Mount the host archive into the container at the path you set in LOCAL_DIRECTORY_TO_PROCESS.
DATA_SOURCE=s3
| Variable | Required | Description |
|---|---|---|
DATA_SOURCE | yes | Set to s3. |
ACCESS_KEY_ID | yes | AWS access key with s3:ListBucket and s3:GetObject permissions on the source. |
SECRET_ACCESS_KEY | yes | Matching AWS secret key. |
REGION_NAME | yes | Bucket region (e.g. eu-north-1). |
BUCKET_NAME | yes | Source bucket name. |
S3_DIRECTORY_TO_PROCESS | yes | S3 prefix (folder) containing the archive. Trailing slash is added automatically if missing. |
Files are fetched via pre-signed URLs (valid for 1 hour) that the service automatically renews on retry if a download fails.
Target configuration
The target defines where extracted results are written. Set DATA_TARGET to db, s3, or local.
DATA_TARGET=db
PostgreSQL target. Best choice for production: supports concurrent worker writes, enables fast querying, and is the only target that supports the full structured output.
| Variable | Required | Description |
|---|---|---|
DATA_TARGET | yes | Set to db. |
DB_HOST | yes | PostgreSQL host. |
DB_PORT | no | Defaults to 5432. |
DB_NAME | yes | Database name. |
DB_USER | yes | Username for read/write access to the mobai schema. |
DB_PASSWORD | conditional | Required for password authentication. Omit to use AWS RDS IAM authentication (see below). |
DB_OWNER_USER | no | Owner role used to create/migrate the mobai schema. Defaults to DB_USER. |
DB_READER_USER | no | Read-only role used when querying processed-file state. Defaults to DB_USER. |
DB_REGION | no | AWS region for IAM authentication. Defaults to eu-north-1. |
RDS IAM authentication
If DB_PASSWORD is unset, the service authenticates to RDS using short-lived IAM auth tokens generated via boto3's generate_db_auth_token.
DATA_TARGET=s3
Appends extracted records to a single CSV object (records.csv) under the configured S3 prefix.
| Variable | Required | Description |
|---|---|---|
DATA_TARGET | yes | Set to s3. |
ACCESS_KEY_ID | yes | AWS access key with s3:GetObject and s3:PutObject on the target. |
SECRET_ACCESS_KEY | yes | Matching secret key. |
REGION_NAME | yes | Bucket region. |
BUCKET_NAME | yes | Target bucket. May be the same as the source bucket. |
S3_DIRECTORY_TO_SAVE | yes | S3 prefix (folder) where records.csv is written. |
Important: CSV append is not concurrency-safe. Use
DATA_TARGET=s3withWORKERS=1only, or useDATA_TARGET=dbif you need parallel processing.
DATA_TARGET=local
Writes results to a local CSV file. Useful for evaluation or for environments without an S3/PostgreSQL target.
| Variable | Required | Description |
|---|---|---|
DATA_TARGET | yes | Set to local. |
LOCAL_DIRECTORY_TO_SAVE | yes | Absolute path inside the container where records.csv is written. Mount a host volume here to persist results. |
Same concurrency caveat applies — use
WORKERS=1withlocaltargets.
Runtime behaviour
| Variable | Default | Description |
|---|---|---|
WORKERS | 1 | Number of parallel worker processes. See Deployment → Sizing. |
LOG_LEVEL | INFO | Logger level: DEBUG, INFO, WARNING, ERROR. INFO is recommended for production. |
SKIP_PROCESSED_FILES | true | When true, the service consults the target for already-processed file names and removes them from the work list. Set to false for a full re-process. |
Example configurations
Evaluation — local source, local CSV target
DATA_SOURCE=local
LOCAL_DIRECTORY_TO_PROCESS=/app/tests/assets
DATA_TARGET=local
LOCAL_DIRECTORY_TO_SAVE=/app/output
WORKERS=1
LOG_LEVEL=INFO
SKIP_PROCESSED_FILES=false
Production — S3 archive, PostgreSQL results, parallel workers
DATA_SOURCE=s3
ACCESS_KEY_ID=<from-secret-manager>
SECRET_ACCESS_KEY=<from-secret-manager>
REGION_NAME=eu-north-1
BUCKET_NAME=customer-kyc-archive
S3_DIRECTORY_TO_PROCESS=2024-archive/
DATA_TARGET=db
DB_HOST=prod-kyc.cluster-xxx.eu-north-1.rds.amazonaws.com
DB_NAME=kyc_extracts
DB_USER=document_processor
DB_REGION=eu-north-1
# IAM auth — no DB_PASSWORD set
WORKERS=4
LOG_LEVEL=INFO
SKIP_PROCESSED_FILES=true
Hybrid — S3 archive, S3 CSV results, single worker
DATA_SOURCE=s3
ACCESS_KEY_ID=<...>
SECRET_ACCESS_KEY=<...>
REGION_NAME=eu-north-1
BUCKET_NAME=customer-kyc-archive
S3_DIRECTORY_TO_PROCESS=2024-archive/
DATA_TARGET=s3
S3_DIRECTORY_TO_SAVE=2024-results/
WORKERS=1
LOG_LEVEL=INFO
SKIP_PROCESSED_FILES=true