Big Data

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

What is RDD in Spark ? and Why do we need it ?

What is RDD in Spark? And Why Do We Need It?

A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.

What is Apache HCatalog ?

What is Apache HCatalog?

A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.

Unix Job Control Commands – bg, fg, Ctrl+Z,jobs

Unix Job Control Commands: `bg`, `fg`, `Ctrl+Z`, `jobs`

A practical guide for developers and data engineers to manage long-running jobs in Unix, especially useful when working with Hadoop or other big data tools.

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

Hive Strict Mode

![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)

What is Hive Strict Mode?

Hive Strict Mode (hive.mapred.mode=strict) enables Hive to restrict certain performance intensive operations. Such as:

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

— Chip & Dan Heath, Authors of Made to Stick, Switch

What is the ELK stack?

The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.

Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).

Top 10 Hadoop Shell Commands to manage HDFS

So you already know what Hadoop is? Why it is used for? What problems you can solve with it? And you want to know how you can deal with files on HDFS? Don’t worry, you are in the right place.

Back to Top ↑

Technology

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

Unix Job Control Commands – bg, fg, Ctrl+Z,jobs

Unix Job Control Commands: `bg`, `fg`, `Ctrl+Z`, `jobs`

A practical guide for developers and data engineers to manage long-running jobs in Unix, especially useful when working with Hadoop or other big data tools.

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

Java : What does finalize do and How?

Understanding the `finalize` Method in Java

finalize method in the Object class is often a point of discussion regarding whether it should be used or not. Below are some important pointers on the finalize method:

How-To : Generate Restful API Documentation with Swagger ?

Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters, and models are tightly integrated into the server code, allowing APIs to always stay in sync.

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

— Chip & Dan Heath, Authors of Made to Stick, Switch

What is the ELK stack?

The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.

Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).

String Interning – What ,Why and When ?

What is String Interning

String Interning is a method of storing only one copy of each distinct String Value, which must be immutable.

SOAP Webservices Using Apache CXF : Adding Custom Object as Header in Outgoing Requests

What is Apache CXF?

Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX-RS. These services can speak a variety of protocols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS etc.

Top 10 Hadoop Shell Commands to manage HDFS

So you already know what Hadoop is? Why it is used for? What problems you can solve with it? And you want to know how you can deal with files on HDFS? Don’t worry, you are in the right place.

Back to Top ↑

Java

How to Use MultiThreadedMapper in MapReduce

A practical, developer-focused guide to using Hadoop’s MultithreadedMapper for parallelizing map tasks and improving performance in CPU-bound jobs.

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

What is Apache HCatalog ?

What is Apache HCatalog?

A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.

Hashmap Performance Improvements in Java 8

HashMap Performance Improvements in Java 8

A developer-focused look at how Java 8 improved the performance of HashMap under high-collision scenarios, with code examples and practical explanations.

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

Hive Strict Mode

![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)

What is Hive Strict Mode?

Hive Strict Mode (hive.mapred.mode=strict) enables Hive to restrict certain performance intensive operations. Such as:

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

What is POODLE vulnerability and How does it affect you ?

What is POODLE?

It stands for “Padding Oracle On Downgraded Legacy Encryption.” This means a protocol downgrade that allows exploits on an outdated form of encryption. It was first explained in Google Security Advisory.

Java : What does finalize do and How?

Understanding the `finalize` Method in Java

finalize method in the Object class is often a point of discussion regarding whether it should be used or not. Below are some important pointers on the finalize method:

How-To : Generate Restful API Documentation with Swagger ?

Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters, and models are tightly integrated into the server code, allowing APIs to always stay in sync.

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

— Chip & Dan Heath, Authors of Made to Stick, Switch

What is the ELK stack?

The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.

Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).

String Interning – What ,Why and When ?

What is String Interning

String Interning is a method of storing only one copy of each distinct String Value, which must be immutable.

SOAP Webservices Using Apache CXF : Adding Custom Object as Header in Outgoing Requests

What is Apache CXF?

Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX-RS. These services can speak a variety of protocols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS etc.

Back to Top ↑

Hive

What is Apache HCatalog ?

What is Apache HCatalog?

A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -

How-To : Use HCatalog with Pig

Using HCatalog with Pig

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:

Hive Strict Mode

![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)

What is Hive Strict Mode?

Hive Strict Mode (hive.mapred.mode=strict) enables Hive to restrict certain performance intensive operations. Such as:

How-To : Connect HiveServer2 service with JDBC Client ?

HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

How-To : Configure MySQL Metastore for Hive ?

How to Configure MySQL Metastore for Hive

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

Back to Top ↑

Spark

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

What is RDD in Spark ? and Why do we need it ?

What is RDD in Spark? And Why Do We Need It?

A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.

Back to Top ↑

Scala

Spark – How to Run Spark Applications on Windows

A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.

What does Skipped Stage means in Spark WebUI ?

Skipped Stages in Spark UI

You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.

Dataframe Operations in Spark using Scala

DataFrame Operations in Spark using Scala

A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

Back to Top ↑

Kafka

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.

Back to Top ↑

big-data

Unlocking Data Lakes with Apache Iceberg: The Open Table Format Revolutionizing Analytics

Why Apache Iceberg is the Game-Changer Your Data Lake Needs

Still relying on traditional data lakes with unreliable reads, lack of ACID guarantees, and painful schema evolution?

Welcome to Apache Iceberg—an open table format purpose-built for modern, petabyte-scale analytics. With native support for engines like Apache Spark, Trino, Presto, Hive, and Flink, Iceberg solves the biggest challenges in big data processing.

In this post, we’ll break down:

What Apache Iceberg is and why it matters
How its architecture works (with a visual!)
Key features that make it future-ready
Real-world use cases
A call to action for your next data platform decision

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

Back to Top ↑

Security

What is POODLE vulnerability and How does it affect you ?

What is POODLE?

It stands for “Padding Oracle On Downgraded Legacy Encryption.” This means a protocol downgrade that allows exploits on an outdated form of encryption. It was first explained in Google Security Advisory.

Back to Top ↑

HBase

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.

Back to Top ↑

Git

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

Back to Top ↑

Repository

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

Back to Top ↑

Tools

Git : How to Split Sub-Directory to Separate Repository

GIT Split Sub-Directory to Repositories

If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!

Back to Top ↑

cloud

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

Back to Top ↑

storage

HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud

With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.

Back to Top ↑

Apache-Iceberg

Unlocking Data Lakes with Apache Iceberg: The Open Table Format Revolutionizing Analytics

Why Apache Iceberg is the Game-Changer Your Data Lake Needs

Still relying on traditional data lakes with unreliable reads, lack of ACID guarantees, and painful schema evolution?

Welcome to Apache Iceberg—an open table format purpose-built for modern, petabyte-scale analytics. With native support for engines like Apache Spark, Trino, Presto, Hive, and Flink, Iceberg solves the biggest challenges in big data processing.

In this post, we’ll break down:

What Apache Iceberg is and why it matters
How its architecture works (with a visual!)
Key features that make it future-ready
Real-world use cases
A call to action for your next data platform decision

Back to Top ↑

Data-Engineering

Unlocking Data Lakes with Apache Iceberg: The Open Table Format Revolutionizing Analytics

Why Apache Iceberg is the Game-Changer Your Data Lake Needs

Still relying on traditional data lakes with unreliable reads, lack of ACID guarantees, and painful schema evolution?

Welcome to Apache Iceberg—an open table format purpose-built for modern, petabyte-scale analytics. With native support for engines like Apache Spark, Trino, Presto, Hive, and Flink, Iceberg solves the biggest challenges in big data processing.

In this post, we’ll break down:

What Apache Iceberg is and why it matters
How its architecture works (with a visual!)
Key features that make it future-ready
Real-world use cases
A call to action for your next data platform decision

Back to Top ↑

Big Data

Spark – How to Run Spark Applications on Windows

Skipped Stages in Spark UI

DataFrame Operations in Spark using Scala

How to Use MultiThreadedMapper in MapReduce

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

Learning ELK Stack

What is RDD in Spark? And Why Do We Need It?

What is Apache HCatalog?

Unix Job Control Commands: bg, fg, Ctrl+Z, jobs

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

What is Coprocessor in HBase?

Using HCatalog with Pig

What is Hive Strict Mode?

How to Configure MySQL Metastore for Hive

What is the ELK stack?

What is Apache Pig?

Technology

Spark – How to Run Spark Applications on Windows

DataFrame Operations in Spark using Scala

How to Use MultiThreadedMapper in MapReduce

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

Learning ELK Stack

Unix Job Control Commands: bg, fg, Ctrl+Z, jobs

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

What is Coprocessor in HBase?

Using HCatalog with Pig

How to Configure MySQL Metastore for Hive

Understanding the finalize Method in Java

What is the ELK stack?

What is Apache Pig?

What is String Interning

What is Apache CXF?

Java

How to Use MultiThreadedMapper in MapReduce

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

Learning ELK Stack

What is Apache HCatalog?

HashMap Performance Improvements in Java 8

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

Using HCatalog with Pig

What is Hive Strict Mode?

How to Configure MySQL Metastore for Hive

What is POODLE?

Understanding the finalize Method in Java

What is the ELK stack?

What is Apache Pig?

What is String Interning

What is Apache CXF?

Hive

What is Apache HCatalog?

Using HCatalog with Pig

What is Hive Strict Mode?

How to Configure MySQL Metastore for Hive

Spark

Spark – How to Run Spark Applications on Windows

Skipped Stages in Spark UI

DataFrame Operations in Spark using Scala

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

What is RDD in Spark? And Why Do We Need It?

Scala

Spark – How to Run Spark Applications on Windows

Skipped Stages in Spark UI

DataFrame Operations in Spark using Scala

How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

Kafka

Integrate Kafka with HDFS using Camus (Twitter Stream Example)

How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)

big-data

Unix Job Control Commands: `bg`, `fg`, `Ctrl+Z`, `jobs`

Unix Job Control Commands: `bg`, `fg`, `Ctrl+Z`, `jobs`

Understanding the `finalize` Method in Java

Understanding the `finalize` Method in Java