Problem overview
There are 1000s of restaurant invoices from its suppliers - Gordon Foods and Sysco - that need to analyzed. Each supplier has own layout, however in many cases multiple invoices are in one PDF document and one invoice is on multiple pages, as well some invoices represent sells and some returns. In order, to be analyzed all PDFs need to be parsed for seller and buyer information, date, invoice number, all items: its SKU, quantity, price, etc. Architecture need to support parsing existing (archived) documents, reparse, and parse new coming ones. Documents would be stored and landed in S3 bucket.
Architecture
- S3 bucket with
puttrigger to SQS queue - SQS queue that buffer pointers to documents to be processed
- Lambda function that is triggered from SQS queue and parse single document and output results in DynamoDB table under PK
supplier-name:invoice-number(to be unique) - Lambda function that extract set of parsed documents to S3 as multi object JSON file
Structure of parsing Lambda function:
- written in NodeJS (Typescript) because parsing PDF in Node require very light weight dependencies
- build as config driven (per supplier) list of fields definition - where it could be located, how formatted
