Since Hadoop jobs are often long running, its difficult for newbies to manage the processes in Unix unless they know some useful Unix commands to do so, so that they can increase their efficiency.
In this post, I will explain some of the commands that are very useful while executing some long running jobs .We will see how to execute a job in background, bring it back to foreground, stopping the execution and starting it back and kill a job. (more…)
Simple String Example for Setting up Camus for Kafka-HDFS Data Pipeline
I came across Camus while building a Lambda Architecture framework recently. I couldn’t find a good Illustration of getting started with Kafk-HDFS pipeline , In this post we will see how we can use Camus to build a Kafka-HDFS data pipeline using a twitter stream produced by Kafka Producer as mentioned in last post .
What is Camus?
Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. It includes the following features: (more…)
Twitter opensourced it’s Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses and produces a kafka stream out of it, which can be utilized later for counting the terms, or putting that data from Kafka to Storm (Kafka-Storm pipeline ) or HDFS ( as we will see in next post for using Camus API ).
In Apache Hive, like SQL, you can decide to order or sort your data differently based on ordering and distribution requirement. In this post we will look at how SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY behaves differently in Hive.
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.
Ordering : It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.
Outcome : N or more sorted files with overlapping ranges. (more…)
Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.
You can think of them like either Aspects in Java where it intercepts code before and after some critical operations and executes some user supplied behavior or Triggers or Stored Procedures in RDBMS which gets executed at run time and near to the data.