Tag: hadoop

Top 20 Hadoop MapReduce Interview Questions


BigData and Data analytics Jobs are the most sought after jobs of current time. It is important to understand the basics before you appear for interview. In this post, I am covering few of the basic MapReduce interview questions for Hadoop MapReduce.

saurzcode_hadoop (more…)


What is RDD in Spark ? and Why do we need it ?

Resilient Distributed Datasets -RDDs in Spark

Apcahe Spark has already taken over Hadoop (MapReduce)  because of plenty of benefits it provides in terms of faster execution in iterative processing algorithms such as Machine learning.

In this post, we will try to understand what makes Spark RDDs so useful in batch analytics .

Why RDD ?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as  Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.

(more…)

What is Apache HCatalog ?

What is HCatalog ?

Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data  from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed to external systems to access the metadata. (more…)

Unix Job Control Commands – bg, fg, Ctrl+Z,jobs

Since Hadoop jobs are often long running, its difficult for newbies to manage the processes in Unix unless they know some useful Unix commands to do so, so that they can increase their efficiency.

In this post, I will explain some of the commands that are very useful while executing some long running jobs .We will see how to execute a job in background, bring it back to foreground, stopping the execution and starting it back and kill a job. (more…)

How-to : Write a CoProcessor in HBase

What is Coprocessor in HBase ?

Coprocessor is a mechanism which helps to move computations closer to the data in HBase. It is like a Mapreduce framework to distribute tasks across the cluster.hbase_logo

You can think of them like either Aspects in Java  where it intercepts code before and after some critical operations and executes some user supplied behavior or Triggers or Stored Procedures in RDBMS which gets executed at run time and near to the data.

(more…)

%d bloggers like this: