What is Apache HCatalog ?

What is HCatalog ?

Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data  from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed to external systems to access the metadata. (more…)

Hashmap Performance Improvements in Java 8

Problem Statement :

Until Java 7, java.util.Hashmap implementations always suffered with the problem of Hash Collision, i.e. when multiple hashCode() values end up in the same bucket, values are placed in a Linked List implementation, which reduces Hashmap performance from O(1) to O(n).

Solution :

Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries.This will improve collision performance for any key type that implements Comparable. (more…)

Unix Job Control Commands – bg, fg, Ctrl+Z,jobs

Since Hadoop jobs are often long running, its difficult for newbies to manage the processes in Unix unless they know some useful Unix commands to do so, so that they can increase their efficiency.

In this post, I will explain some of the commands that are very useful while executing some long running jobs .We will see how to execute a job in background, bring it back to foreground, stopping the execution and starting it back and kill a job. (more…)

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

Simple String Example for Setting up Camus for Kafka-HDFS Data Pipeline

I came across Camus while building a Lambda Architecture framework recently. I couldn’t find a good Illustration of getting started with Kafk-HDFS pipeline ,  In this post we will see how we can use Camus to build a Kafka-HDFS data pipeline using a twitter stream produced by Kafka Producer as mentioned in last post .

What is Camus?

Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. It includes the following features: (more…)

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

Twitter opensourced it’s  Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses  and produces a kafka stream out of it, which can be utilized later for counting the terms, or putting that data from Kafka to Storm (Kafka-Storm pipeline ) or HDFS ( as we will see in next post for  using Camus API ).

You can download and run complete Sample here (more…)

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive, like SQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY,  ORDER BY, DISTRIBUTE BY and CLUSTER BY behaves differently in Hive.

Sort By vs Order By vs Group By vs Cluster By in Hive
Sort By vs Order By vs Group By vs Cluster By in Hive

SORT BY

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.

Outcome : N or more sorted files with overlapping ranges. (more…)

%d bloggers like this: