Optimization of large set of ETL pipelines

Problem Overview

The legacy system managed over 700 PySpark jobs with complex execution and data-source dependencies. Orchestration was handled by AWS Step Functions, which managed the lifecycle of EMR clusters — provisioning nodes, monitoring job submission, and terminating clusters upon completion. This heavyweight architecture led to significant operational inefficiencies.

Challenges

  • High Operational Cost: Monthly execution costs exceeded $350,000.
  • SLA Instability: Prolonged execution times and frequent delays, particularly during recovery from failures.
  • Reliability Issues: High failure rates at the EMR cluster level, individual job level, and due to data dependency gaps.
  • Maintenance Overhead: Difficulties in root-cause analysis led to extended "Mean Time to Repair" (MTTR).

Optimization Strategy

  • Modular Architecture: Re-engineered jobs into single-responsibility components to follow the "do one thing well" principle.
  • Config-Driven Design: Documented and refactored code to support a metadata-driven architecture for easier scaling.
  • Right-Sizing Compute: Identified that the majority of jobs were simple SQL transformations involving small datasets (MBs to low GBs), making EMR’s distributed overhead unnecessary.
  • "Peel the Onion" Methodology: Systematically migrated simple PySpark jobs to DuckDB running on AWS Lambda for high-performance, low-cost processing.
  • Advanced Orchestration: Refactored Step Functions to execute Lambdas with high concurrency while maintaining strict order and dependency logic (the "loop-in-loop" method).
  • Hybrid Migration: Transitioned larger-scale jobs that exceeded Lambda limits to AWS Glue to further phase out EMR dependency.

Results

  • 94% Cost Reduction: Slashed monthly execution costs from $350K to $20K.
  • Performance Gains: Reduced processing times from hours to minutes (and seconds for many individual jobs).
  • Improved Reliability: Dramatically lowered failure rates by moving to serverless, managed compute.
  • Enhanced Observability: Streamlined monitoring and simplified the architecture, significantly reducing maintenance costs and resolution times.

Technologies used

  • AWS Services: AWS Lambda, Step Function, SQS, Glue, Athena, EMR, S3, DynamoDB, SNS.
  • Frameworks, Libraries: Python, Polars, DuckDB, PySpark, SQL, Serverless

Involvement

Leading the team of 3, main architect and developer.