What is HCatalog ?Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed … Continue Reading ››
Problem Statement :Until Java 7, java.util.Hashmap implementations always suffered with the problem of Hash Collision, i.e. when multiple
hashCode()values end up in the same bucket, values are placed in a Linked List implementation, which reduces Hashmap performance from O(1) to O(n).
Solution :Improve the performance of
java.util.HashMapunder high hash-collision conditions by using balanced trees rather than linked lists to store … Continue Reading ››
Twitter opensourced it's Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses and produces a kafka stream out of it, which can be … Continue Reading ››
In Apache Hive, It's always a matter of confusion that how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY differs. I have compiled a set of differences between these based on attributes like how will final output look like and ordering of data in output -