HDFS is the distributed file system used in Hadoop and helps to achieve the purpose of storing very larger files on a commodity Hardware. While working on Hadoop and BigData in general it is very important to understand the basic concepts of they underlying file system, i.e. HDFS in case of Hadoop. When you are appearing in BigData Interviews , it is important to know these concepts. Let’s see some of the basic HDFS interview questions – (more…)
Simple String Example for Setting up Camus for Kafka-HDFS Data Pipeline
I came across Camus while building a Lambda Architecture framework recently. I couldn’t find a good Illustration of getting started with Kafk-HDFS pipeline , In this post we will see how we can use Camus to build a Kafka-HDFS data pipeline using a twitter stream produced by Kafka Producer as mentioned in last post .
What is Camus?
Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. It includes the following features: (more…)
What is Coprocessor in HBase ?
Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.
You can think of them like either Aspects in Java where it intercepts code before and after some critical operations and executes some user supplied behavior or Triggers or Stored Procedures in RDBMS which gets executed at run time and near to the data.
This post is intended for folks who are looking out for a quick start on developing a basic Hadoop MapReduce application.
We will see how to set up a basic MR application for
WordCount using Java, Maven and Eclipse and run a basic MR program in local mode , which is easy for debugging at an early stage.
What is Apache Pig?
Apache Pig is a high level scripting language that is used with Apache Hadoop. It enables data analysts to write complex data transformations without knowing Java. It’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.Pig Scripts are converted into MapReduce Jobs which runs on data stored in HDFS (refer to the diagram below).
Through the User Defined Functions(UDF) facility in Pig, It can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use it as a component to build larger and more complex applications that tackle real business problems.
How It is being Used ?
- Rapid prototyping of algorithms for processing large data sets.
- Data Processing for web search platforms.
- Ad Hoc queries across large data sets.
- Web log processing.
It consists of three elements –
- Pig Latin
- High level scripting language
- No Schema
- Translated to MapReduce Jobs
- Pig Grunt Shell
- Interactive shell for executing pig commands.
- Shared repository for User defined functions (explained later).
Pig Latin Statements
Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin statement is an operator that takes a relation as input and produces another relation as output(except LOAD and STORE statements).
Pig Latin statements are generally organized as follows:
- A LOAD statement to read data from the file system.
- A series of “transformation” statements to process the data.
- A DUMP statement to view results or a STORE statement to save the results.
Note that a DUMP or STORE statement is required to generate output.
- In this example Pig will validate, but not execute, the LOAD and FOREACH statements.Java12A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;
- In this example, Pig will validate and then execute the LOAD, FOREACH, and DUMP statements.Java1234567A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;DUMP B;(John)(Mary)(Bill)(Joe)
Storing Intermediate Results
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property.
Storing Final Results
Use the STORE operator and the load/store functions to write results to the file system ( PigStorage is the default store function).
Note: During the testing/debugging phase of your implementation, you can use DUMP to display results to your terminal screen. However, in a production environment you always want to use the STORE operator to save your results.
Debugging Pig Latin
Pig Latin provides operators that can help you debug your Pig Latin statements:
- Use the DUMP operator to display results to your terminal screen.
- Use the DESCRIBE operator to review the schema of a relation.
- Use the EXPLAIN operator to view the logical, physical, or map reduce execution plans to compute a relation.
- Use the ILLUSTRATE operator to view the step-by-step execution of a series of statements.
What are Pig User Defined Functions (UDFs) ?
Pig provides extensive support for user-defined functions (UDFs) as a way to specify custom processing. Functions can be a part of almost every operator in Pig. UDF is very powerful functionality to do many complex operations on data.The Piggy Bank is a place for users to share their functions(UDFs).
A = LOAD 'employee_data' AS (name: chararray, age: int, designation: chararray);
B = FOREACH A GENERATE saurzcodeUDF.UPPER(name);
I hope now you know some basic Pig Concepts already !
Happy Learning !!
Related articles :
Oh Yes! It’s Free !!!
With the rising popularity , increase in demand and lack of experts in Big Data and Hadoop technologies, various paid training courses and certifications are available from various Enterprise Hadoop providers like Cloudera, Hortonworks , IBM, MapR etc .But if you don’t want to shell out some money and want to learn at your comfort and pace, you should definitely take a look at these online courses available online and explore the world of Hadoop and Big Data. Happy Hadooping 🙂
- Intro to Hadoop and MapReduce by Udacity – This course by Cloudera provides a nice explanation of the core concepts and internal working of Hadoop components embedded with quizzes around each concept and some good hands on exercises. They also provide VM for training purpose, which can be used to run example questions and to solve quizzes and exams for the course.
- How Hadoop fits into the world (recognize the problems it solves)
- Understand the concepts of HDFS and MapReduce (find out how it solves the problems)
- Write MapReduce programs (see how we solve the problems)
- Practice solving problems on your own
Some basic programming knowledge and a good interest in learning 🙂
2. Introduction to Mapreduce Programming by BigDataUniversity-
This is a good course on understanding basics of Map and Reduce and how MapReduce applications works.
3. Moving Data in to Hadoop
4. Introduction to Yarn and Mapreduce 2 Excellent webinar covering the how Yarn can change the way distributed processing works.
Related Articles :
- Recommended Readings for Hadoop
- How to Become Hadoop Certified Developer ?
- Reading List : Hadoop and Big Data Books