Category Archives: Big Data

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)


How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

Twitter opensourced it's  Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses  and produces a kafka stream out of it, which can be … Continue Reading ››


In Apache Hive, It's always a matter of confusion that how SORT BY,  ORDER BY, DISTRIBUTE BY and CLUSTER BY differs. I have compiled a set of differences between these based on attributes like  how will final output look like and ordering of data in output -


Sort By vs Order By … <a href=Continue Reading ››

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase ?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.hbase_logo You can think of them like either Aspects in Java  where it intercepts code before and after some critical operations and … Continue Reading ››

How-To : Setup Development Environment for Hadoop MapReduce

This post is intended for folks who are looking out for a quick start on developing a basic Hadoop MapReduce application. We will see how to set up a basic MR application for WordCount using Java, Maven and Eclipse and run a basic MR program in local mode , which is easy for debugging at an early stage. Assuming JDK 1.6+ is … Continue Reading ››

How-To : Use HCatalog with Pig

 Using HCatalog with Pig :-

This post is a step by step guide on running HCatalog and using HCatalog with Apache Pig :- Assumptions : Pig and Hive are installed and tested with basic modes. It requires Hive Metastore and it's databse to be properly configured ( Refer to Post ) Versions Tested With :-  HCatalog … Continue Reading ››