Tomasz Dubiel

Problem Overview

The objective was to analyze thousands of restaurant invoices from major suppliers like Gordon Food Service (GFS) and Sysco. The project faced several structural challenges:

Fragmented PDF Layouts: Multiple invoices often existed within a single PDF file, while single invoices frequently spanned multiple pages.
Transaction Variety: The system had to distinguish between standard sales invoices and credit returns.
Granular Extraction: Required high-fidelity parsing of seller/buyer metadata, dates, invoice numbers, and line-item details (SKU, quantity, unit price, and descriptions).
Lifecycle Management: The architecture needed to support three distinct modes: processing new arrivals, re-parsing existing data, and bulk-processing archived documents.

Architecture & Workflow

Ingestion Layer: An S3 bucket serves as the landing zone. A put event triggers a message to an SQS queue, ensuring a decoupled and resilient ingestion flow.
Orchestration Layer: SQS buffers document pointers, protecting the system from spikes in document volume and allowing for easy retries.
Processing Layer (Parser Lambda):
- Triggered by SQS to process individual documents.
- Outputs parsed data to DynamoDB using a Composite Key strategy (PK: supplier-name:invoice-number) to ensure uniqueness and prevent duplicates.
Extraction Layer (Export Lambda): A secondary function that aggregates parsed records from DynamoDB and exports them back to S3 as multi-object JSON files for downstream analytics.

Technical Implementation of the Parser

Runtime: Built with Node.js (TypeScript) to leverage lightweight PDF parsing dependencies, ensuring low cold-start latency.
Configuration-Driven Design: Implemented a per-supplier configuration schema. This defines field locations, regex patterns, and formatting rules, allowing the engine to scale to new suppliers without changing the core logic.
State Logic: The parser is designed to handle the "stitching" of multi-page invoices and the "splitting" of bulk PDF files by identifying header and footer markers dynamically.

Technologies Used

Cloud: AWS (S3, SQS, Lambda, DynamoDB).
Runtime: Node.js, TypeScript.
Data Strategy: Single Table Design, Config-driven ETL.

Role

Lead Architect & Backend Developer

Pipeline for PDF invoices parsing

Problem Overview

Architecture & Workflow

Technical Implementation of the Parser

Technologies Used

Role