Category Archives: Java

My Book on ELK Stack : Learning ELK Stack

What is Apache HCatalog ?

What is HCatalog ?

Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data  from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed … Continue Reading ››

Hashmap Performance Improvements in Java 8

Problem Statement :

Until Java 7, java.util.Hashmap implementations always suffered with the problem of Hash Collision, i.e. when multiple hashCode() values end up in the same bucket, values are placed in a Linked List implementation, which reduces Hashmap performance from O(1) to O(n).

Solution :

Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store … Continue Reading ››

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

{{unknown}}

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

Twitter opensourced it's  Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses  and produces a kafka stream out of it, which can be … Continue Reading ››

Hive : SORT BY vs ORDER BY vs DISTRIBUTE BY vs CLUSTER BY

In Apache Hive, It's always a matter of confusion that how SORT BY,  ORDER BY, DISTRIBUTE BY and CLUSTER BY differs. I have compiled a set of differences between these based on attributes like  how will final output look like and ordering of data in output -

SORT BY

Sort By vs Order By … <a href=Continue Reading ››