Tag: bigdata

Top 15 HDFS Interview Questions

HDFS is the distributed file system used in Hadoop and helps to achieve the purpose of storing very larger files on a commodity Hardware. While working on Hadoop and BigData in general it is very important to understand the basic concepts of they underlying file system, i.e. HDFS in case of Hadoop. When you are appearing in BigData Interviews , it is important to know these concepts. Let’s see some of the basic HDFS interview questions –saurzcode_hadoop (more…)

What is RDD in Spark ? and Why do we need it ?

Resilient Distributed Datasets -RDDs in Spark

Apcahe Spark has already taken over Hadoop (MapReduce)  because of plenty of benefits it provides in terms of faster execution in iterative processing algorithms such as Machine learning.

In this post, we will try to understand what makes Spark RDDs so useful in batch analytics .

Why RDD ?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as  Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.


What is Apache HCatalog ?

What is HCatalog ?

Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data  from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed to external systems to access the metadata. (more…)

Hive Strict Mode

Sort By vs Order By vs Group By vs Cluster By in Hive

What is Hive Strict Mode ?

Hive Strict Mode ( hive.mapred.mode=strict) enables hive to restrict certain performance intensive operations. Such as –

  • It restricts queries of partitioned tables without a WHERE clause.

  • It restricts ORDER BY operation without a LIMIT clause ( since it uses a single reducer which can choke your processing if not handled properly

Also for dynamic partitons –

This is a default setting and prevents all partitions to be dynamic and requires at least one static partition.

You may also like –

How-To : Configure MySQL Metastore for Hive ?

Hive by default comes with Derby as its metastore storage, which is suited only for testing purposes and in most of the production scenarios it is recommended to use MySQL as a metastore. This is a step by step guide on How to Configure MySQL Metastore for Hive in place of Derby Metastore (Default).

Assumptions : Basic knowledge of Unix is assumed and also It’s assumed that Hadoop and Hive configurations are in place.Hive with default metastore Derby is properly configured and tested out.

  1. Install  MySQL –

Note:  You will be prompted to set a password for root.


Getting Started with Hadoop : Free Online Hadoop Trainings

Hadoop Trainings

Oh Yes! It’s Free !!!

With the rising popularity , increase in demand and lack of experts in Big Data and Hadoop technologies, various  paid training courses and certifications are available from various Enterprise Hadoop providers like Cloudera, Hortonworks , IBM, MapR etc .But if you don’t want to shell out some money and want to learn at your comfort and pace, you should definitely take a look at these online courses available online and explore the world of Hadoop and Big Data. Happy Hadooping 🙂

  1. Intro to Hadoop and MapReduce  by  Udacity – This course by Cloudera provides a nice explanation of the core concepts and internal working of  Hadoop components embedded with quizzes around each concept and some good hands on exercises. They also provide VM for training purpose, which can be used to run example questions and to solve quizzes and exams for the course.

Goals –

  • How Hadoop fits into the world (recognize the problems it solves)
  • Understand the concepts of HDFS and MapReduce (find out how it solves the problems)
  • Write MapReduce programs (see how we solve the problems)
  • Practice solving problems on your own

Prerequisites –

Some basic programming knowledge and a good interest in learning 🙂

2. Introduction to Mapreduce Programming  by BigDataUniversity-

This is a good course on understanding basics of Map and Reduce and how MapReduce applications works.

3. Moving Data in to Hadoop

4. Introduction to Yarn and Mapreduce 2  Excellent webinar covering the how Yarn can change the way distributed processing works.


Related Articles :


%d bloggers like this: