Problem Overview
The legacy system managed over 700 PySpark jobs with complex execution and data-source dependencies. Orchestration was handled by AWS Step Functions, which managed the lifecycle of EMR clusters — provisioning nodes, monitoring job submission, and terminating clusters upon completion. This heavyweight architecture led to significant operational inefficiencies.
Challenges
- High Operational Cost: Monthly execution costs exceeded $350,000.
- SLA Instability: Prolonged execution times and frequent delays, particularly during recovery from failures.
- Reliability Issues: High failure rates at the EMR cluster level, individual job level, and due to data dependency gaps.
- Maintenance Overhead: Difficulties in root-cause analysis led to extended "Mean Time to Repair" (MTTR).
Optimization Strategy
- Modular Architecture: Re-engineered jobs into single-responsibility components to follow the "do one thing well" principle.
- Config-Driven Design: Documented and refactored code to support a metadata-driven architecture for easier scaling.
- Right-Sizing Compute: Identified that the majority of jobs were simple SQL transformations involving small datasets (MBs to low GBs), making EMR’s distributed overhead unnecessary.
- "Peel the Onion" Methodology: Systematically migrated simple PySpark jobs to DuckDB running on AWS Lambda for high-performance, low-cost processing.
- Advanced Orchestration: Refactored Step Functions to execute Lambdas with high concurrency while maintaining strict order and dependency logic (the "loop-in-loop" method).
- Hybrid Migration: Transitioned larger-scale jobs that exceeded Lambda limits to AWS Glue to further phase out EMR dependency.
Results
- 94% Cost Reduction: Slashed monthly execution costs from $350K to $20K.
- Performance Gains: Reduced processing times from hours to minutes (and seconds for many individual jobs).
- Improved Reliability: Dramatically lowered failure rates by moving to serverless, managed compute.
- Enhanced Observability: Streamlined monitoring and simplified the architecture, significantly reducing maintenance costs and resolution times.
Technologies used
- AWS Services: AWS Lambda, Step Function, SQS, Glue, Athena, EMR, S3, DynamoDB, SNS.
- Frameworks, Libraries: Python, Polars, DuckDB, PySpark, SQL, Serverless
Involvement
Leading the team of 3, main architect and developer.
