- Big Data 20
- Technology 20
- Java 19
- Hive 6
- Spark 5
- Scala 4
- Kafka 2
- Security 1
- HBase 1
- Git 1
- Repository 1
- Tools 1
- big-data 1
- cloud 1
- storage 1
Big Data
Spark – How to Run Spark Applications on Windows
Spark – How to Run Spark Applications on Windows
A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.
What does Skipped Stage means in Spark WebUI ?
Skipped Stages in Spark UI
You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.
Dataframe Operations in Spark using Scala
DataFrame Operations in Spark using Scala
A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.
How to Use MultiThreadedMapper in MapReduce
How to Use MultiThreadedMapper in MapReduce
A practical, developer-focused guide to using Hadoop’s MultithreadedMapper
for parallelizing map tasks and improving performance in CPU-bound jobs.
How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.
How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.
My Book on ELK Stack : Learning ELK Stack
Learning ELK Stack
I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.
What is RDD in Spark ? and Why do we need it ?
What is RDD in Spark? And Why Do We Need It?
A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.
What is Apache HCatalog ?
What is Apache HCatalog?
A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.
Unix Job Control Commands – bg, fg, Ctrl+Z,jobs
Unix Job Control Commands: bg
, fg
, Ctrl+Z
, jobs
A practical guide for developers and data engineers to manage long-running jobs in Unix, especially useful when working with Hadoop or other big data tools.
How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)
Integrate Kafka with HDFS using Camus (Twitter Stream Example)
A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.
How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)
How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)
A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.
How-to : Write a CoProcessor in HBase
What is Coprocessor in HBase?
Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.
Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY
In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -
How-To : Use HCatalog with Pig
Using HCatalog with Pig
This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:
Hive Strict Mode
![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)
What is Hive Strict Mode?
Hive Strict Mode (hive.mapred.mode=strict
) enables Hive to restrict certain performance intensive operations. Such as:
How-To : Connect HiveServer2 service with JDBC Client ?
HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
How-To : Configure MySQL Metastore for Hive ?
How to Configure MySQL Metastore for Hive
Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).
How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?
Once we know something, we find it hard to imagine what it was like not to know it.
— Chip & Dan Heath, Authors of Made to Stick, Switch
What is the ELK stack?
The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.
Hadoop : Getting Started with Pig
What is Apache Pig?
Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).
Top 10 Hadoop Shell Commands to manage HDFS
So you already know what Hadoop is? Why it is used for? What problems you can solve with it? And you want to know how you can deal with files on HDFS? Don’t worry, you are in the right place.
Technology
Spark – How to Run Spark Applications on Windows
Spark – How to Run Spark Applications on Windows
A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.
Dataframe Operations in Spark using Scala
DataFrame Operations in Spark using Scala
A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.
How to Use MultiThreadedMapper in MapReduce
How to Use MultiThreadedMapper in MapReduce
A practical, developer-focused guide to using Hadoop’s MultithreadedMapper
for parallelizing map tasks and improving performance in CPU-bound jobs.
How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.
How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.
My Book on ELK Stack : Learning ELK Stack
Learning ELK Stack
I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.
Unix Job Control Commands – bg, fg, Ctrl+Z,jobs
Unix Job Control Commands: bg
, fg
, Ctrl+Z
, jobs
A practical guide for developers and data engineers to manage long-running jobs in Unix, especially useful when working with Hadoop or other big data tools.
How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)
Integrate Kafka with HDFS using Camus (Twitter Stream Example)
A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.
How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)
How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)
A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.
How-to : Write a CoProcessor in HBase
What is Coprocessor in HBase?
Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.
Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY
In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -
How-To : Use HCatalog with Pig
Using HCatalog with Pig
This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:
How-To : Connect HiveServer2 service with JDBC Client ?
HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
How-To : Configure MySQL Metastore for Hive ?
How to Configure MySQL Metastore for Hive
Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).
Java : What does finalize do and How?
Understanding the finalize
Method in Java
finalize
method in the Object
class is often a point of discussion regarding whether it should be used or not. Below are some important pointers on the finalize
method:
How-To : Generate Restful API Documentation with Swagger ?
Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters, and models are tightly integrated into the server code, allowing APIs to always stay in sync.
How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?
Once we know something, we find it hard to imagine what it was like not to know it.
— Chip & Dan Heath, Authors of Made to Stick, Switch
What is the ELK stack?
The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.
Hadoop : Getting Started with Pig
What is Apache Pig?
Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).
String Interning – What ,Why and When ?
What is String Interning
String Interning is a method of storing only one copy of each distinct String Value, which must be immutable.
SOAP Webservices Using Apache CXF : Adding Custom Object as Header in Outgoing Requests
What is Apache CXF?
Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX-RS. These services can speak a variety of protocols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS etc.
Top 10 Hadoop Shell Commands to manage HDFS
So you already know what Hadoop is? Why it is used for? What problems you can solve with it? And you want to know how you can deal with files on HDFS? Don’t worry, you are in the right place.
Java
How to Use MultiThreadedMapper in MapReduce
How to Use MultiThreadedMapper in MapReduce
A practical, developer-focused guide to using Hadoop’s MultithreadedMapper
for parallelizing map tasks and improving performance in CPU-bound jobs.
How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.
How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.
My Book on ELK Stack : Learning ELK Stack
Learning ELK Stack
I am writing this post to announce the general availability of my book on ELK stack titled “ Learning ELK Stack “ with PacktPub publications. Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.
What is Apache HCatalog ?
What is Apache HCatalog?
A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.
Hashmap Performance Improvements in Java 8
HashMap Performance Improvements in Java 8
A developer-focused look at how Java 8 improved the performance of HashMap
under high-collision scenarios, with code examples and practical explanations.
How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)
Integrate Kafka with HDFS using Camus (Twitter Stream Example)
A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.
How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)
How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)
A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.
Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY
In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -
How-To : Use HCatalog with Pig
Using HCatalog with Pig
This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:
Hive Strict Mode
![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)
What is Hive Strict Mode?
Hive Strict Mode (hive.mapred.mode=strict
) enables Hive to restrict certain performance intensive operations. Such as:
How-To : Connect HiveServer2 service with JDBC Client ?
HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
How-To : Configure MySQL Metastore for Hive ?
How to Configure MySQL Metastore for Hive
Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).
What is POODLE vulnerability and How does it affect you ?
What is POODLE?
It stands for “Padding Oracle On Downgraded Legacy Encryption.” This means a protocol downgrade that allows exploits on an outdated form of encryption. It was first explained in Google Security Advisory.
Java : What does finalize do and How?
Understanding the finalize
Method in Java
finalize
method in the Object
class is often a point of discussion regarding whether it should be used or not. Below are some important pointers on the finalize
method:
How-To : Generate Restful API Documentation with Swagger ?
Swagger is a specification and complete framework implementation for describing, producing, consuming, and visualizing RESTful web services. The goal of Swagger is to enable client and documentation systems to update at the same pace as the server. The documentation of methods, parameters, and models are tightly integrated into the server code, allowing APIs to always stay in sync.
How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?
Once we know something, we find it hard to imagine what it was like not to know it.
— Chip & Dan Heath, Authors of Made to Stick, Switch
What is the ELK stack?
The ELK stack consists of open source tools ElasticSearch, Logstash, and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.
Hadoop : Getting Started with Pig
What is Apache Pig?
Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. Its simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Scripts are converted into MapReduce Jobs which run on data stored in HDFS (refer to the diagram below).
String Interning – What ,Why and When ?
What is String Interning
String Interning is a method of storing only one copy of each distinct String Value, which must be immutable.
SOAP Webservices Using Apache CXF : Adding Custom Object as Header in Outgoing Requests
What is Apache CXF?
Apache CXF is an open source services framework. CXF helps you build and develop services using frontend programming APIs, like JAX-WS and JAX-RS. These services can speak a variety of protocols such as SOAP, XML/HTTP, RESTful HTTP, or CORBA and work over a variety of transports such as HTTP, JMS etc.
Hive
What is Apache HCatalog ?
What is Apache HCatalog?
A developer-friendly introduction to Apache HCatalog, its architecture, features, and how it fits into the Hadoop ecosystem.
Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY
In Apache Hive HQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behave differently in Hive. Let’s get started -
How-To : Use HCatalog with Pig
Using HCatalog with Pig
This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig:
Hive Strict Mode
![Sort By vs Order By vs Group By vs Cluster By in Hive]/assets/uploads/2015/01/images.jpg)
What is Hive Strict Mode?
Hive Strict Mode (hive.mapred.mode=strict
) enables Hive to restrict certain performance intensive operations. Such as:
How-To : Connect HiveServer2 service with JDBC Client ?
HiveServer2 (HS2) is a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
How-To : Configure MySQL Metastore for Hive ?
How to Configure MySQL Metastore for Hive
Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).
Spark
Spark – How to Run Spark Applications on Windows
Spark – How to Run Spark Applications on Windows
A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.
What does Skipped Stage means in Spark WebUI ?
Skipped Stages in Spark UI
You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.
Dataframe Operations in Spark using Scala
DataFrame Operations in Spark using Scala
A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.
How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.
How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.
What is RDD in Spark ? and Why do we need it ?
What is RDD in Spark? And Why Do We Need It?
A developer-friendly guide to understanding Resilient Distributed Datasets (RDDs) in Apache Spark, their properties, and why they are fundamental for fast, fault-tolerant distributed computing.
Scala
Spark – How to Run Spark Applications on Windows
Spark – How to Run Spark Applications on Windows
A step-by-step, developer-friendly guide to running Apache Spark applications on Windows, including configuration, environment setup, and troubleshooting tips.
What does Skipped Stage means in Spark WebUI ?
Skipped Stages in Spark UI
You must have come across various scenarios where you see a DAG like below, where you see a few stages shows greyed out with a text (skipped) after the stage name. What does this mean? Did Spark ignore one of your stage due to an error? or this is due to something else? Well, it’s actually a good thing. It means that particular stage in the lineage DAG doesn’t need to be re-evaluated as its already evaluated and cached. This will save computation time for that stage. If you want to see what data frame for that stage was stored in the cache, you can check in Storage tab in Spark UI.
Dataframe Operations in Spark using Scala
DataFrame Operations in Spark using Scala
A comprehensive, developer-friendly guide to common DataFrame operations in Apache Spark using Scala, with code examples and explanations for each join type.
How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.
How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.
Kafka
How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)
Integrate Kafka with HDFS using Camus (Twitter Stream Example)
A step-by-step guide to building a Kafka-to-HDFS data pipeline using Camus and a Twitter stream. This guide is aimed at developers looking for a practical, detailed walkthrough.
How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)
How-To: Write a Kafka Producer using Twitter Stream (Twitter HBC Client)
A step-by-step guide to building a Kafka producer that streams live tweets using Twitter’s Hosebird Client (HBC) and publishes them to a Kafka topic. This is a practical, developer-focused walkthrough with code, configuration, and troubleshooting tips.
Security
What is POODLE vulnerability and How does it affect you ?
What is POODLE?
It stands for “Padding Oracle On Downgraded Legacy Encryption.” This means a protocol downgrade that allows exploits on an outdated form of encryption. It was first explained in Google Security Advisory.
HBase
How-to : Write a CoProcessor in HBase
What is Coprocessor in HBase?
Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.
Git
Git : How to Split Sub-Directory to Separate Repository
GIT Split Sub-Directory to Repositories
If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!
Repository
Git : How to Split Sub-Directory to Separate Repository
GIT Split Sub-Directory to Repositories
If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!
Tools
Git : How to Split Sub-Directory to Separate Repository
GIT Split Sub-Directory to Repositories
If you regret putting that git sub-directory inside a git repository and thinking about moving it out of the current repository to its own repository, you have come to the right place!
big-data
HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud
With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.
cloud
HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud
With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.
storage
HDFS vs Object Storage: Deep Dive for Big Data Processing in the Cloud
With the rise of cloud-native data architectures and serverless computing, data engineers face a crucial decision when designing big data pipelines: HDFS or Object Storage? This decision impacts performance, scalability, fault tolerance, and cost. In this post, we’ll conduct a deep technical analysis of both storage systems and explore how they align with modern big data processing patterns on the cloud.