HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

4 min read

How to Train and Score Catboost Model on Spark

How to Train and Score CatBoost Model on Spark

A practical, developer-focused guide to distributed training and inference with CatBoost on Apache Spark, including code examples and best practices.

2 min read

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

2 min read

Spark – How to Run Spark Applications on Windows

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

1 min read

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

~1 min read

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

2 min read

How to Use MultiThreadedMapper in MapReduce

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

2 min read

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

1 min read

What is RDD in Spark ? and Why do we need it ?

What is RDD in Spark? And Why Do We Need It?

A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.

2 min read