Big Data

Spark – How to Run Spark Applications on Windows

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

1 min read

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

~1 min read

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

2 min read

How to Use MultiThreadedMapper in MapReduce

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

2 min read

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

1 min read

What is RDD in Spark ? and Why do we need it ?

What is RDD in Spark? And Why Do We Need It?

A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.

2 min read

What is Apache HCatalog ?

What is Apache HCatalog?

A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.

2 min read

Unix Job Control Commands – bg, fg, Ctrl+Z,jobs

Unix Job Control Commands: bg, fg, Ctrl+Z, jobs

A practical guide for developers and data engineers to manage long-running jobs in Unix, especially useful when working with Hadoop or other big data tools.

2 min read

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

3 min read

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.

2 min read

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

3 min read

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

1 min read

Hive Strict Mode

![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)

What is Hive Strict Mode?

Hive Strict Mode (hive.mapred.mode=strict) enables Hive to restrict certain performance intensive operations. Such as:

~1 min read

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

3 min read

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

2 min read

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

— Chip & Dan Heath, Authors of Made to Stick, Switch


What is the ELK stack?

The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.

3 min read

Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).

3 min read

Top 10 Hadoop Shell Commands to manage HDFS

So you already know what Hadoop is? Why it is used for? What problems you can solve with it? And you want to know how you can deal with files on HDFS? Don’t worry, you are in the right place.

2 min read
Back to Top ↑

Technology

Spark – How to Run Spark Applications on Windows

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

1 min read

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

2 min read

How to Use MultiThreadedMapper in MapReduce

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

2 min read

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

1 min read

Unix Job Control Commands – bg, fg, Ctrl+Z,jobs

Unix Job Control Commands: bg, fg, Ctrl+Z, jobs

A practical guide for developers and data engineers to manage long-running jobs in Unix, especially useful when working with Hadoop or other big data tools.

2 min read

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

3 min read

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.

2 min read

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

3 min read

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

1 min read

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

3 min read

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

2 min read

Java : What does finalize do and How?

Understanding the finalize Method in Java

finalize method in the Object class is often a point of discussion regarding whether it should be used or not. Below are some important pointers on the finalize method:

2 min read

How-To : Generate Restful API Documentation with Swagger ?

Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters, and models are tightly integrated into the server code, allowing APIs to always stay in sync.

2 min read

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

— Chip & Dan Heath, Authors of Made to Stick, Switch


What is the ELK stack?

The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.

3 min read

Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).

3 min read

Top 10 Hadoop Shell Commands to manage HDFS

So you already know what Hadoop is? Why it is used for? What problems you can solve with it? And you want to know how you can deal with files on HDFS? Don’t worry, you are in the right place.

2 min read
Back to Top ↑

Java

How to Use MultiThreadedMapper in MapReduce

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

2 min read

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

1 min read

What is Apache HCatalog ?

What is Apache HCatalog?

A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.

2 min read

Hashmap Performance Improvements in Java 8

HashMap Performance Improvements in Java 8

A developer-focused look at how Java 8 improved the performance of HashMap under high-collision scenarios, with code examples and practical explanations.

2 min read

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

3 min read

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

3 min read

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

1 min read

Hive Strict Mode

![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)

What is Hive Strict Mode?

Hive Strict Mode (hive.mapred.mode=strict) enables Hive to restrict certain performance intensive operations. Such as:

~1 min read

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

3 min read

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

2 min read

Java : What does finalize do and How?

Understanding the finalize Method in Java

finalize method in the Object class is often a point of discussion regarding whether it should be used or not. Below are some important pointers on the finalize method:

2 min read

How-To : Generate Restful API Documentation with Swagger ?

Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters, and models are tightly integrated into the server code, allowing APIs to always stay in sync.

2 min read

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

— Chip & Dan Heath, Authors of Made to Stick, Switch


What is the ELK stack?

The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.

3 min read

Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).

3 min read
Back to Top ↑

Hive

What is Apache HCatalog ?

What is Apache HCatalog?

A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.

2 min read

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

3 min read

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

1 min read

Hive Strict Mode

![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)

What is Hive Strict Mode?

Hive Strict Mode (hive.mapred.mode=strict) enables Hive to restrict certain performance intensive operations. Such as:

~1 min read

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

3 min read

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

2 min read
Back to Top ↑

Spark

Spark – How to Run Spark Applications on Windows

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

1 min read

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

~1 min read

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

2 min read

What is RDD in Spark ? and Why do we need it ?

What is RDD in Spark? And Why Do We Need It?

A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.

2 min read
Back to Top ↑

Scala

Spark – How to Run Spark Applications on Windows

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

1 min read

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

~1 min read

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

2 min read
Back to Top ↑

Kafka

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

3 min read
Back to Top ↑

Security

Back to Top ↑

HBase

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.

2 min read
Back to Top ↑

Git

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

2 min read
Back to Top ↑

Repository

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

2 min read
Back to Top ↑

Tools

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

2 min read
Back to Top ↑

big-data

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

4 min read
Back to Top ↑

cloud

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

4 min read
Back to Top ↑

storage

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

4 min read
Back to Top ↑