Tomasz Dubiel

Problem Overview

The legacy system managed over 700 PySpark jobs with complex execution and data-source dependencies. Orchestration was handled by AWS Step Functions, which managed the lifecycle of EMR clusters — provisioning nodes, monitoring job submission, and terminating clusters upon completion. This heavyweight architecture led to significant operational inefficiencies.

Challenges

High Operational Cost: Monthly execution costs exceeded $350,000.
SLA Instability: Prolonged execution times and frequent delays, particularly during recovery from failures.
Reliability Issues: High failure rates at the EMR cluster level, individual job level, and due to data dependency gaps.
Maintenance Overhead: Difficulties in root-cause analysis led to extended "Mean Time to Repair" (MTTR).

Optimization Strategy

Modular Architecture: Re-engineered jobs into single-responsibility components to follow the "do one thing well" principle.
Config-Driven Design: Documented and refactored code to support a metadata-driven architecture for easier scaling.
Right-Sizing Compute: Identified that the majority of jobs were simple SQL transformations involving small datasets (MBs to low GBs), making EMR’s distributed overhead unnecessary.
"Peel the Onion" Methodology: Systematically migrated simple PySpark jobs to DuckDB running on AWS Lambda for high-performance, low-cost processing.
Advanced Orchestration: Refactored Step Functions to execute Lambdas with high concurrency while maintaining strict order and dependency logic (the "loop-in-loop" method).
Hybrid Migration: Transitioned larger-scale jobs that exceeded Lambda limits to AWS Glue to further phase out EMR dependency.

Results

94% Cost Reduction: Slashed monthly execution costs from $350K to $20K.
Performance Gains: Reduced processing times from hours to minutes (and seconds for many individual jobs).
Improved Reliability: Dramatically lowered failure rates by moving to serverless, managed compute.
Enhanced Observability: Streamlined monitoring and simplified the architecture, significantly reducing maintenance costs and resolution times.

Technologies used

AWS Services: AWS Lambda, Step Function, SQS, Glue, Athena, EMR, S3, DynamoDB, SNS.
Frameworks, Libraries: Python, Polars, DuckDB, PySpark, SQL, Serverless

Role