Saurabh Chhajed

Lead Data & AI Engineer

Hyderabad, Telangana, India

Professional Summary

  • Seasoned Big Data Analytics and Machine Learning Engineer with 15+ years of expertise in developing and leading teams to build cloud-based and on-premise big data platforms and ML solutions, specializing in distributed processing on AWS, GCP and other cloud platforms.
  • Good knowledge of application development in Big Data Ecosystem (Hadoop, MapReduce, Spark, Trino, Hive, Airflow) and Cloud-Native Computing Technologies (Docker, Kubernetes) etc.
  • Hands on experience developing web and large scale data processing applications in Java, Scala and Python.
  • Good knowledge on the end to end machine learning model training, optimization, monitoring and deployments, both offline and online models for recommender systems.
  • Good knowledge of computer algorithms and data structures and various data processing frameworks and design patterns.

Professional Experience

Lead Data/AI Engineer

Thoughtworks India Pvt Ltd, Hyderabad, India

Oct 2024 - Present

Client: Leading U.S. Home Improvement Retailer

  • Spearheaded architecture and delivery of a scalable Promotion Analytics Engine, driving campaign insights and analytics based on 20+ financial parameters.
  • Developed robust, production-grade, DBT models based ETL pipelines combining PySpark and BigQuery for batch and near-real-time transformations.
  • Integrated LightGBM regression models via BigQuery ML to enable accurate volume forecasting.
  • Led a team of 10 engineers, overseeing design, architecture, and code reviews ensuring quality and scalability.
  • Collaborated across various stakeholders – Product, Data Science, and Architects.

Lead Data Engineer (LMTS)

Salesforce India Pvt Ltd, Hyderabad, India

June 2022 - Sept 2024

Salesforce's Unified Intelligence Platform (UIP) is an enterprise-scale internal data lake and analytics ecosystem, facilitating petabyte-scale data ingestion, exploration, transformation, and visualization.

AWS EMR | S3 | Airflow | Spark | Scala | Python | Kubernetes | Docker | Trino | Iceberg | Jupyter Notebooks

  • Led the architecture and development of a metadata-driven ingestion pipeline processing petabytes of data, integrating Kafka, Spark, Scala, Trino, and Airflow for scalable batch and streaming ingestion.
  • Designed GDPR-compliant data leak scanners with sampling techniques, cutting scanning costs by 40%.
  • Built a high-throughput Leak Management Pipeline cleaning leaked PII data across 3000+ record types and several PBs of data.
  • Engineered an advanced tokenization service securing sensitive identifiers while enabling efficient analytics.
  • Developed Airflow Operators and frameworks to streamline ingestion and exploration workflows.
  • Implemented system monitoring and alerting with Grafana and PagerDuty for real-time system visibility.
  • Created a workload analytics dashboard using Apache Superset, optimizing Spark cluster resource utilization by ~20%.
  • Led a team of 5 engineers, driving design reviews, code quality initiatives, and operational improvements.

Lead Data/ML Engineer

American Express (via Impetus Technologies Inc.), Phoenix, Arizona, USA

Dec 2014 - June 2022

  • Architected and built a merchant recommender system using an ensemble of CatBoost, collaborative filtering, and Word2Vec models on Spark, enhancing Amex marketing personalization.
  • Developed end-to-end ML pipelines: feature extraction, model training, hyperparameter tuning (distributed grid search), monitoring, and deployment to online scoring systems.
  • Reduced hyperparameter tuning time by 40% through distributed Spark-based optimization.
  • Designed and deployed microservices-based model serving architecture for real-time, geo-personalized merchant recommendations.
  • Built Model/Feature Monitoring solutions to track GINI, PSI, and accuracy metrics, ensuring model health.
  • Spearheaded development of an online offer personalization engine with Hadoop, Hive, MapRDB, and Elasticsearch, improving campaign launch speed by 50%.
  • Collaborated cross-functionally with Product, Data Science, and Marketing teams using Agile/SAFe methodologies.

Application Developer

JP Morgan Chase & Co., India

Aug 2012 - Dec 2014

  • Developed a multi-clustered distributed data management platform for high availability, low-latency processing.
  • Built real-time Search and Analytics solutions using ELK Stack (Elasticsearch, Logstash, Kibana).
  • Designed real-time order update systems using distributed caching (Gemfire).
  • Evangelized code quality tools (Sonar, Jira, Crucible), improving team code health.
  • Conducted a Hadoop PoC for analyzing cross-application usage patterns.
  • Worked on Messaging products, providing end to end integration between many business-critical applications involving app. 2-3 million message exchanges daily.
  • Exposure to Multithreading and Java performance tuning methodologies involving GC algorithms and tuning.

Systems Engineer

General Electric (GE) Company (TCS), India

Nov 2009 - July 2012

  • Led re-architecture of a large-scale ASP/IIS application to a Java/Spring microservices framework.
  • Designed and developed RESTful APIs consumed by multiple clients.
  • Improved critical business process execution time by 50%, saving $30,000.
  • Optimized database queries and developed complex PL/SQL procedures.

Technical Skills

Big Data Platforms

  • Spark Core/ML/Streaming
  • Trino
  • Hive
  • Hadoop
  • Apache Iceberg
  • Airflow

Cloud Platforms

  • AWS (EMR, EC2, S3, IAM etc.)
  • GCP (BigQuery, Dataproc, Cloud Composer, Storage etc.)

Programming

  • Java
  • Scala
  • Python
  • SQL
  • PL/SQL
  • Shell Scripting

Distributed Systems

  • Kubernetes
  • Docker
  • Microservices Architecture

Data Security

  • Tokenization
  • GDPR Compliance
  • Leak Management Pipelines

Additional Technologies

  • Elasticsearch
  • HBase
  • MapRDB
  • Pivotal Gemfire
  • Apache Superset
  • DBT

DevOps & CI/CD

  • Jenkins
  • Maven
  • Gradle
  • Terraform
  • Git
  • JIRA
  • Hashicorp Vault

Education

Institute of Engineering and Technology, Indore, MP

B.E. in Computer Science Engineering

2009

Top 1% of batch

Certifications & Publications

  • GCP Certified Professional Data Engineer
  • Cloudera Certified Hadoop Developer
  • MapR Certified Spark Developer
  • Authored a book on ELK (Elasticsearch, Logstash and Kibana) for PacktPub
  • Performed technical review for multiple Bigdata and ML books for PacktPub

Awards & Accolades

  • Outstanding Employee of the Year among 1500+ in the Big Data group
  • DZone Most Valuable Blogger (Top 5%) for contributions on web tech, big data, and open source
  • Top Performer in TCS Initial Learning Program (400+ trainees)
  • 5× TCS Gems "On the Spot" awardee for exceptional performance and client recognition