HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud
With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.
How to Train and Score Catboost Model on Spark
How to Train and Score CatBoost Model on Spark
A practical, developer-focused guide to distributed training and inference with CatBoost on Apache Spark, including code examples and best practices.
Git : How to Split Sub-Directory to Separate Repository
GIT Split Sub-Directory to Repositories
If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!
Spark – How to Run Spark Applications on Windows
Spark – How to Run Spark Applications on Windows
A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.
What does Skipped Stage means in Spark WebUI ?
Skipped Stages in Spark UI
You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.
Dataframe Operations in Spark using Scala
DataFrame Operations in Spark using Scala
A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.
How to Use MultiThreadedMapper in MapReduce
How to Use MultiThreadedMapper in MapReduce
A practical, developer-focused guide to using Hadoop’s MultithreadedMapper
for parallelizing map tasks and improving performance in CPU-bound jobs.
How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.
How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.
My Book on ELK Stack : Learning ELK Stack
Learning ELK Stack
I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.
What is RDD in Spark ? and Why do we need it ?
What is RDD in Spark? And Why Do We Need It?
A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.