Skip to main content

Configuration

Document Processing is configured entirely through environment variables.

Source configuration

The source defines where input documents come from. Set DATA_SOURCE to either local or s3.

DATA_SOURCE=local

VariableRequiredDescription
DATA_SOURCEyesSet to local.
LOCAL_DIRECTORY_TO_PROCESSyesAbsolute path inside the container to the root of the archive. The directory is scanned recursively for files with supported extensions (.pdf, .tif, .tiff).

Mount the host archive into the container at the path you set in LOCAL_DIRECTORY_TO_PROCESS.

DATA_SOURCE=s3

VariableRequiredDescription
DATA_SOURCEyesSet to s3.
ACCESS_KEY_IDyesAWS access key with s3:ListBucket and s3:GetObject permissions on the source.
SECRET_ACCESS_KEYyesMatching AWS secret key.
REGION_NAMEyesBucket region (e.g. eu-north-1).
BUCKET_NAMEyesSource bucket name.
S3_DIRECTORY_TO_PROCESSyesS3 prefix (folder) containing the archive. Trailing slash is added automatically if missing.

Files are fetched via pre-signed URLs (valid for 1 hour) that the service automatically renews on retry if a download fails.

Target configuration

The target defines where extracted results are written. Set DATA_TARGET to db, s3, or local.

DATA_TARGET=db

PostgreSQL target. Best choice for production: supports concurrent worker writes, enables fast querying, and is the only target that supports the full structured output.

VariableRequiredDescription
DATA_TARGETyesSet to db.
DB_HOSTyesPostgreSQL host.
DB_PORTnoDefaults to 5432.
DB_NAMEyesDatabase name.
DB_USERyesUsername for read/write access to the mobai schema.
DB_PASSWORDconditionalRequired for password authentication. Omit to use AWS RDS IAM authentication (see below).
DB_OWNER_USERnoOwner role used to create/migrate the mobai schema. Defaults to DB_USER.
DB_READER_USERnoRead-only role used when querying processed-file state. Defaults to DB_USER.
DB_REGIONnoAWS region for IAM authentication. Defaults to eu-north-1.

RDS IAM authentication

If DB_PASSWORD is unset, the service authenticates to RDS using short-lived IAM auth tokens generated via boto3's generate_db_auth_token.

DATA_TARGET=s3

Appends extracted records to a single CSV object (records.csv) under the configured S3 prefix.

VariableRequiredDescription
DATA_TARGETyesSet to s3.
ACCESS_KEY_IDyesAWS access key with s3:GetObject and s3:PutObject on the target.
SECRET_ACCESS_KEYyesMatching secret key.
REGION_NAMEyesBucket region.
BUCKET_NAMEyesTarget bucket. May be the same as the source bucket.
S3_DIRECTORY_TO_SAVEyesS3 prefix (folder) where records.csv is written.

Important: CSV append is not concurrency-safe. Use DATA_TARGET=s3 with WORKERS=1 only, or use DATA_TARGET=db if you need parallel processing.

DATA_TARGET=local

Writes results to a local CSV file. Useful for evaluation or for environments without an S3/PostgreSQL target.

VariableRequiredDescription
DATA_TARGETyesSet to local.
LOCAL_DIRECTORY_TO_SAVEyesAbsolute path inside the container where records.csv is written. Mount a host volume here to persist results.

Same concurrency caveat applies — use WORKERS=1 with local targets.

Runtime behaviour

VariableDefaultDescription
WORKERS1Number of parallel worker processes. See Deployment → Sizing.
LOG_LEVELINFOLogger level: DEBUG, INFO, WARNING, ERROR. INFO is recommended for production.
SKIP_PROCESSED_FILEStrueWhen true, the service consults the target for already-processed file names and removes them from the work list. Set to false for a full re-process.

Example configurations

Evaluation — local source, local CSV target

DATA_SOURCE=local
LOCAL_DIRECTORY_TO_PROCESS=/app/tests/assets

DATA_TARGET=local
LOCAL_DIRECTORY_TO_SAVE=/app/output

WORKERS=1
LOG_LEVEL=INFO
SKIP_PROCESSED_FILES=false

Production — S3 archive, PostgreSQL results, parallel workers

DATA_SOURCE=s3
ACCESS_KEY_ID=<from-secret-manager>
SECRET_ACCESS_KEY=<from-secret-manager>
REGION_NAME=eu-north-1
BUCKET_NAME=customer-kyc-archive
S3_DIRECTORY_TO_PROCESS=2024-archive/

DATA_TARGET=db
DB_HOST=prod-kyc.cluster-xxx.eu-north-1.rds.amazonaws.com
DB_NAME=kyc_extracts
DB_USER=document_processor
DB_REGION=eu-north-1
# IAM auth — no DB_PASSWORD set

WORKERS=4
LOG_LEVEL=INFO
SKIP_PROCESSED_FILES=true

Hybrid — S3 archive, S3 CSV results, single worker

DATA_SOURCE=s3
ACCESS_KEY_ID=<...>
SECRET_ACCESS_KEY=<...>
REGION_NAME=eu-north-1
BUCKET_NAME=customer-kyc-archive
S3_DIRECTORY_TO_PROCESS=2024-archive/

DATA_TARGET=s3
S3_DIRECTORY_TO_SAVE=2024-results/

WORKERS=1
LOG_LEVEL=INFO
SKIP_PROCESSED_FILES=true