Pipeline for PDF invoices parsing

Problem Overview

The objective was to analyze thousands of restaurant invoices from major suppliers like Gordon Food Service (GFS) and Sysco. The project faced several structural challenges:

  • Fragmented PDF Layouts: Multiple invoices often existed within a single PDF file, while single invoices frequently spanned multiple pages.
  • Transaction Variety: The system had to distinguish between standard sales invoices and credit returns.
  • Granular Extraction: Required high-fidelity parsing of seller/buyer metadata, dates, invoice numbers, and line-item details (SKU, quantity, unit price, and descriptions).
  • Lifecycle Management: The architecture needed to support three distinct modes: processing new arrivals, re-parsing existing data, and bulk-processing archived documents.

Architecture & Workflow

  • Ingestion Layer: An S3 bucket serves as the landing zone. A put event triggers a message to an SQS queue, ensuring a decoupled and resilient ingestion flow.
  • Orchestration Layer: SQS buffers document pointers, protecting the system from spikes in document volume and allowing for easy retries.
  • Processing Layer (Parser Lambda):
    • Triggered by SQS to process individual documents.
    • Outputs parsed data to DynamoDB using a Composite Key strategy (PK: supplier-name:invoice-number) to ensure uniqueness and prevent duplicates.
  • Extraction Layer (Export Lambda): A secondary function that aggregates parsed records from DynamoDB and exports them back to S3 as multi-object JSON files for downstream analytics.

Technical Implementation of the Parser

  • Runtime: Built with Node.js (TypeScript) to leverage lightweight PDF parsing dependencies, ensuring low cold-start latency.
  • Configuration-Driven Design: Implemented a per-supplier configuration schema. This defines field locations, regex patterns, and formatting rules, allowing the engine to scale to new suppliers without changing the core logic.
  • State Logic: The parser is designed to handle the "stitching" of multi-page invoices and the "splitting" of bulk PDF files by identifying header and footer markers dynamically.

Technologies Used

  • Cloud: AWS (S3, SQS, Lambda, DynamoDB).
  • Runtime: Node.js, TypeScript.
  • Data Strategy: Single Table Design, Config-driven ETL.

Role

Lead Architect & Backend Developer