Unlocking Data Lakes with Apache Iceberg: The Open Table Format Revolutionizing Analytics

Why Apache Iceberg is the Game-Changer Your Data Lake Needs

Still relying on traditional data lakes with unreliable reads, lack of ACID guarantees, and painful schema evolution?

Welcome to Apache Iceberg—an open table format purpose-built for modern, petabyte-scale analytics. With native support for engines like Apache Spark, Trino, Presto, Hive, and Flink, Iceberg solves the biggest challenges in big data processing.

In this post, we’ll break down:

What Apache Iceberg is and why it matters
How its architecture works (with a visual!)
Key features that make it future-ready
Real-world use cases
A call to action for your next data platform decision

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

How to Train and Score Catboost Model on Spark

How to Train and Score CatBoost Model on Spark

A practical, developer-focused guide to distributed training and inference with CatBoost on Apache Spark, including code examples and best practices.

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.