How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

saurzcode-spark-eclipseApache Spark is becoming very popular among organization looking to leverage its fast, in-memory computing capability for big-data processing. This article is for beginners to get started with Spark Setup on Eclipse/Scala IDE and  getting familiar with Spark terminologies in general –

Hope you have read previous article on RDD basics , to get a basic understanding of Spark RDD.

Tools Used :

  • Scala IDE for Eclipse – Download latest version of Scala IDE from here .Here, I used Scala IDE 4.7.0 Release, which support both Scala and Java
  • Scala Version – 2.11 ( make sure scala compiler is set to this version as well)
  • Spark Version 2.2 ( provided in maven dependency)
  • Java Version 1.8
  • Maven Version 3.3.9 ( Embedded in Eclipse)
  • winutils.exe

(more…)

Why Log analysis is important for you ?

Need for Log Analysis

Logs provide us with necessary information on how our system is behaving. However, the content and format of logs varies among different services or say, among different components of the same system. For example a scanner may log error messages related to communication with other devices, on the other hand a web server logs information on all incoming requests, outgoing response, time taken for response etc. Similarly application logs for an e-commerce website will log business specific logs.

As the logs vary by their content, so are their uses. For example, the logs from the scanner might be used for troubleshooting or for simple status check or reporting, while the Web-server log is used to analyze the traffic patterns across multiple products. Analysis of logs from an ecommerce site can help to figure out if packages from a specific location are returned repeatedly and probable reasons for the same.

(more…)

Top 15 HDFS Interview Questions

HDFS is the distributed file system used in Hadoop and helps to achieve the purpose of storing very larger files on a commodity Hardware. While working on Hadoop and BigData in general it is very important to understand the basic concepts of they underlying file system, i.e. HDFS in case of Hadoop. When you are appearing in BigData Interviews , it is important to know these concepts. Let’s see some of the basic HDFS interview questions –saurzcode_hadoop (more…)

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled ” Learning ELK Stack ” with PacktPub publications.

Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

This is the first book ever published which covers ELK stack.

ELK
Learning ELK Stack by Saurabh Chhajed

(more…)

What is RDD in Spark ? and Why do we need it ?

Resilient Distributed Datasets -RDDs in Spark

Apcahe Spark has already taken over Hadoop (MapReduce)  because of plenty of benefits it provides in terms of faster execution in iterative processing algorithms such as Machine learning.

In this post, we will try to understand what makes Spark RDDs so useful in batch analytics .

Why RDD ?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as  Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.

(more…)

%d bloggers like this: