Problem Overview
The objective was to analyze thousands of restaurant invoices from major suppliers like Gordon Food Service (GFS) and Sysco. The project faced several structural challenges:
- Fragmented PDF Layouts: Multiple invoices often existed within a single PDF file, while single invoices frequently spanned multiple pages.
- Transaction Variety: The system had to distinguish between standard sales invoices and credit returns.
- Granular Extraction: Required high-fidelity parsing of seller/buyer metadata, dates, invoice numbers, and line-item details (SKU, quantity, unit price, and descriptions).
- Lifecycle Management: The architecture needed to support three distinct modes: processing new arrivals, re-parsing existing data, and bulk-processing archived documents.
Architecture & Workflow
- Ingestion Layer: An S3 bucket serves as the landing zone. A
putevent triggers a message to an SQS queue, ensuring a decoupled and resilient ingestion flow. - Orchestration Layer: SQS buffers document pointers, protecting the system from spikes in document volume and allowing for easy retries.
- Processing Layer (Parser Lambda):
- Triggered by SQS to process individual documents.
- Outputs parsed data to DynamoDB using a Composite Key strategy (
PK: supplier-name:invoice-number) to ensure uniqueness and prevent duplicates.
- Extraction Layer (Export Lambda): A secondary function that aggregates parsed records from DynamoDB and exports them back to S3 as multi-object JSON files for downstream analytics.
Technical Implementation of the Parser
- Runtime: Built with Node.js (TypeScript) to leverage lightweight PDF parsing dependencies, ensuring low cold-start latency.
- Configuration-Driven Design: Implemented a per-supplier configuration schema. This defines field locations, regex patterns, and formatting rules, allowing the engine to scale to new suppliers without changing the core logic.
- State Logic: The parser is designed to handle the "stitching" of multi-page invoices and the "splitting" of bulk PDF files by identifying header and footer markers dynamically.
Technologies Used
- Cloud: AWS (S3, SQS, Lambda, DynamoDB).
- Runtime: Node.js, TypeScript.
- Data Strategy: Single Table Design, Config-driven ETL.
Role
Lead Architect & Backend Developer
