Problem overview

There are 1000s of restaurant invoices from its suppliers - Gordon Foods and Sysco - that need to analyzed. Each supplier has own layout, however in many cases multiple invoices are in one PDF document and one invoice is on multiple pages, as well some invoices represent sells and some returns. In order, to be analyzed all PDFs need to be parsed for seller and buyer information, date, invoice number, all items: its SKU, quantity, price, etc. Architecture need to support parsing existing (archived) documents, reparse, and parse new coming ones. Documents would be stored and landed in S3 bucket.

Architecture

S3 bucket with put trigger to SQS queue
SQS queue that buffer pointers to documents to be processed
Lambda function that is triggered from SQS queue and parse single document and output results in DynamoDB table under PK supplier-name:invoice-number (to be unique)
Lambda function that extract set of parsed documents to S3 as multi object JSON file

Structure of parsing Lambda function:

written in NodeJS (Typescript) because parsing PDF in Node require very light weight dependencies
build as config driven (per supplier) list of fields definition - where it could be located, how formatted

Pipeline for PDF invoices parsing

Problem overview

Architecture

Structure of parsing Lambda function: