Tag: big data

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

Twitter opensourced it’s  Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses  and produces a kafka stream out of it, which can be utilized later for counting the terms, or putting that data from Kafka to Storm (Kafka-Storm pipeline ) or HDFS ( as we will see in next post for  using Camus API ).

You can download and run complete Sample here (more…)

How-To : Setup Realtime Alalytics over Logs with ELK Stack : Elasticsearch, Logstash, Kibana?

Once we know something, we find it hard to imagine what it was like not to know it.

– Chip & Dan Heath, Authors of Made to Stick, Switch


What is the ELK stack ?

The ELK stack consists of opensource tools ElasticSearch, Logstash and Kibana. These three provide a fully working real-time data analytics tool for getting wonderful information sitting on your data.


ElasticSearch,built on top of Apache Lucene, is a search engine with focus on real-time analysis of the data, and is based on the RESTful architecture. It provides standard full text search functionality and powerful search based on query. ElasticSearch is document-oriented/based and you can store everything you want as JSON. This makes it powerful, simple and flexible.


Logstash is a tool for managing events and logs. You can use it to collect logs, parse them, and store them for later use.In ELK Stack logstash plays an important role in shipping the log and indexing them later which can be supplied to Elastic Search.


Kibana is a user friendly way to view, search and visualize your log data, which will present the data stored from Logstash into ElasticSearch, in a very customizable interface with histogram and other panels which provides real-time analysis and search of data you have parsed into ElasticSearch.

How do I get it  ?


How do they work together ?

Logstash is essentially a pipelining tool. In a basic, centralized installation a logstash agent, known as the shipper, will read input from one to many input sources and output that text wrapped in a JSON message to a broker. Typically Redis, the broker, caches the messages until another logstash agent, known as the collector, picks them up, and sends them to another output. In the common example this output is Elasticsearch, where the messages will be indexed and stored for searching. The Elasticsearch store is accessed via the Kibana web application which allows you to visualize and search through the logs. The entire system is scalable. Many different shippers may be running on many different hosts, watching log files and shipping the messages off to a cluster of brokers. Then many collectors can be reading those messages and writing them to an Elasticsearch cluster.

Realtime Analytics for logs using ELK Stack
(E)lasticSearch (L)ogstash  (K)ibana (The ELK Stack)

How do i fetch useful information out of logs ? 

Fetching useful information from logs is one of the most important part of this stack and is being done in logstash using its grok filters and a set of input , filter and output plugins which helps to scale this functionality for taking various kinds of inputs ( file,tcp, udp, gemfire, stdin, unix, web sockets and even IRC and twitter and many more) , filter them using (groks,grep,date filters etc.) and finally write ouput to ElasticSearch,redis,email,HTTP,MongoDB,Gemfire , Jira , Google Cloud Storage etc.

A bit more about Log Stash

Realtime Analytics over Logs using ELK Stack


Transforming the logs as they go through the pipeline is possible as well using filters. Either on the shipper or collector, whichever suits your needs better. As an example, an Apache HTTP log entry can have each element (request, response code, response size, etc) parsed out into individual fields so they can be searched on more seamlessly. Information can be dropped if it isn’t important. Sensitive data can be masked. Messages can be tagged. The list goes on.



Above example takes input from an apache log file applies a grok filter with %{COMBINEDAPACHELOG}, which will index apache logs information on fields and finally output to Standard Output Console.

Writing Grok Filters

Writing grok filters and fetching information is the only task that requires some serious efforts and if done properly will give you great insights in to your data like Number of Transations performed over time, Which type of products have most hits etc.

Below links will help you a lot in writing grok filters and test them with ease –

Grok Debugger

Grok Debugger is a wonderful tool for testing your grok patterns before using in your logstash filters.


Grok Patterns Lookup

You can lookup grok for various commonly used log patterns here –


If you like this post you will love to read my book on ELK stack – https://www.packtpub.com/big-data-and-business-intelligence/learning-elk-stack  . The book covers all the basics of Elasticsearch, Logstash and Kibana4 to get you started on ELK stack.Please find more details of the book here.


Related Articles :-

Hadoop Certification

Getting Started with Apache Pig

Hadoop Reading List


Hadoop : Getting Started with Pig

What is Apache Pig?

Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. It’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.Pig Scripts are converted into MapReduce Jobs which runs on data stored in HDFS (refer to the diagram below).

Through the User Defined Functions(UDF) facility in Pig, It can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use it as a component to build larger and more complex applications that tackle real business problems.


Pig Achitecture

How It is being Used ?

  • Rapid prototyping of algorithms for processing large data sets.
  • Data Processing for web search platforms.
  • Ad Hoc queries across large data sets.
  • Web log processing.

Pig Elements

It consists of three elements –

  • Pig Latin
    • High level scripting language
    • No Schema
    • Translated to MapReduce Jobs
  • Pig Grunt Shell
    • Interactive shell for executing pig commands.
  • PiggyBank
    • Shared repository for User defined functions (explained later).

Pig Latin Statements 

Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output(except LOAD and STORE statements).

Pig Latin statements are generally organized as follows:

  • A LOAD statement to read data from the file system.
  • A series of “transformation” statements to process the data.
  • A DUMP statement to view results or a STORE statement to save the results.

Note that a DUMP or STORE statement is required to generate output.

  • In this example Pig will validate, but not execute, the LOAD and FOREACH statements.
  • In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.

Storing Intermediate Results

Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property.

Storing Final Results

Use the STORE operator and the load/store functions to write results to the file system ( PigStorage is the default store function).

Note: During the testing/debugging phase of your implementation, you can use DUMP to display results to your terminal screen. However, in a production environment you always want to use the STORE operator to save your results.

Debugging Pig Latin

Pig Latin provides operators that can help you debug your Pig Latin statements:

  • Use the DUMP operator to display results to your terminal screen.
  • Use the DESCRIBE operator to review the schema of a relation.
  • Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
  • Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.

What are Pig User Defined Functions (UDFs) ?

Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig. UDF is very powerful functionality to do many complex operations on data.The Piggy Bank is a place for users to share their functions(UDFs).


I hope now you know some basic Pig Concepts already !

Happy Learning !!

References :-

 Related articles :

Top 20 Hadoop and Big Data Books

Big Data Books

Hadoop: The Definitive Guide


Hadoop: The Definitive Guides the ideal guide for anyone who wants to know about the Apache Hadoop  and all that can be done with it.Good book on basics of Hadoop (HDFS, MapReduce & other related technologies). This book provides all necessary details to start work with Hadoop, program using it

“Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk.” — Doug Cutting, Hadoop Founder, Yahoo!

Latest version 4th Edition is available here  – Hadoop – The Definitive Guide 4e


How-To :Become a Hadoop Certified Developer ?

Hadoop Certified


 Apache Hadoop is an open source framework for distributed storing and processing of large sets of data on commodity hardware. Hadoop enables businesses to gain insight from massive amounts of structured and unstructured data quickly.

Hadoop and Big Data are the hot trends of the Industry these days. Most of the companies are already implementing these or they have at least started to show interest to remain competitive in the market. Big Data and Analytic are certainly one of the great concepts for current and forthcoming IT generation as most of the innovation is driven by vast amount of data that is being generated exponentially.

There are many vendors for Enterprise Hadoop in the Industry – Cloudera, HortonWorks (forked out of Yahoo), MapR, IBM are some of the few front runners among them. They all have their own Hadoop Distributions which differs in one way or other in terms of features keeping Hadoop to its core. They provide training on various Hadoop and Big Data technologies and as an Industry trend are coming out to provide certifications around these technologies too.

In this article I am going to list down all the latest available certifications for Hadoop by different vendors in the industry. Certifications are helpful to your career or not , that’s altogether a different debate and out of scope of this article. It may be useful for some of the folks out there who thinks they have done enough reading about it and now they want to judge themselves or those who are looking to add values to their  portfolios.


CCAH (Administrator) Exams

Cloudera Certified Administrator for Apache Hadoop (CCA-410)

There are three versions for this exam currently –

Exam Code: CCA-410
Number of Questions: 60 questions
Time Limit: 90 minutes
Passing Score: 70%
Language: English, Japanese
Price: USD $295

Exam Code: CCA-500
Number of Questions: 60 questions
Time Limit: 90 minutes
Passing Score: 70%
Language: English, Japanese (forthcoming)
Price: USD $295

CCAH CDH 5 Upgrade Exam
Exam Code: CCA-505
Number of Questions: 45 questions
Time Limit: 90 minutes
Passing Score: 70%
Language: English, Japanese (forthcoming)
Price: USD $125

CCAH Practice Test

CCAH Study Guide

CCDH (Developer) Exams

Cloudera Certified Developer for Apache Hadoop (CCD-410)

Exam Code: CCD-410
Number of Questions: 50 – 55 live questions
Time Limit: 90 minutes
Passing Score: 70%
Language: English, Japanese
Price: USD $295

Study Guide :- Available at Cloudera site.

Practice Tests :- Available at Cloudera site.


For 2.x Certifications

1) Hadoop 2.0 Java Developer Certification

This certification is intended for developers who design, develop and architect Hadoop-based solutions written in the Java programming language.

Time Limit  : 90 minutes

Number of Questions : 50

Passing Score : 75%

Price : $150 USD

Practice tests can be taken by registering at certification site.

2) Hadoop 2.0 Developer Certification

The Certified Apache Hadoop 2.0 Developer certification is intended for developers who design, develop and architect Hadoop-based solutions, consultants who create Hadoop project proposals and Hadoop development instructors.

Time Limit  : 90 minutes.

Number of Questions :50

Passing Score : 75%

Price : $150 USD

Practice tests can be taken by registering at certification site.

3) Hortonworks Certified Apache Hadoop 2.0 Administrator

This is intended for administrators who deploy and manage Apache Hadoop 2.0 clusters, teaches students how to install,configure, maintain and scale the Hadoop 2.0 environment.

Time Limit  : 90 minutes.

Number of Questions :48

Passing Score : 75%

Price : $150 USD

For 1.x Certifications

1) Hadoop 1.0 Developer Certification

Time Limit  : 90 minutes.

Number of Questions :53

Passing Score : 75%

Price : $150 USD

2) Hadoop 1.0 Administrator Certification

Time Limit  : 60 minutes.

Number of Questions :41

Passing Score : 75%

Price : $150 USD


Related Articles :


Recommended Readings for Hadoop

I am writing this series to mention some of the recommended reading to understand Hadoop , its architecture, minute details of cluster setup etc.

Understanding Hadoop Cluster Setup and Network – Brad Hedlund, with his expertise in Networks, provide minute details of cluster setup, data exchange mechanisms of a typical Hadoop Cluster Setup.

MongoDB and Hadoop – Webinar by Mike O’Brien,Software Engineer, MongoDB on how MongoDB and Hadoop can be used together , using core MapReduce and Pig and Hive as well.

Please post comments if you have come across some great article/webinar link, which explains things in great details with ease.

%d bloggers like this: