Category: Java

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.

Apache Spark is becoming very popular among organization looking to leverage its fast, in-memory computing capability for big-data processing. This article is for beginners to get started with Spark Setup on Eclipse/Scala IDE and  getting familiar with Spark terminologies in general –

Hope you have read previous article on RDD basics , to get a basic understanding of Spark RDD.

Tools Used :

  • Scala IDE for Eclipse – Download latest version of Scala IDE from here .Here, I used Scala IDE 4.7.0 Release, which support both Scala and Java
  • Scala Version – 2.11 ( make sure scala compiler is set to this version as well)
  • Spark Version 2.2 ( provided in maven dependency)
  • Java Version 1.8
  • Maven Version 3.3.9 ( Embedded in Eclipse)
  • winutils.exe

For running in Windows environment , you need hadoop binaries in windows format. winutils provides that and we need to set hadoop.home.dir system property to bin path inside which winutils.exe is present. You can download winutils.exe here and place at path like this – c:/hadoop/bin/winutils.exe

Creating a Sample Application in Eclipse –

In Scala IDE, create a new Maven Project –



Replace POM.XML as below –


For creating a Java WordCount program, create a new Java Class and copy the code below –

Java Code for WordCount

Scala Version

For running the Scala version of WordCount program in scala, create a new Scala Object and use the code below –

You may need to set project as scala project to run this, and make sure scala compiler version matches Scala version in your Spark dependency, by setting in build path – saurzcode-eclipse-spark

So, your final setup will look like this –


Running the Code in Eclipse

You can run above code in Scala or Java as simple  Run As Scala or Java Application in eclipse to see the output.


Now you should be able to see the word count output , along with log lines generated using default Spark log4j properties.


In the next post, I will explain how you can open Spark WebUI and look at various stages, tasks on Spark code execution internally.

You may also be interested in some other BigData posts –

My Book on ELK Stack : Learning ELK Stack

Learning ELK Stack

I am writing this post to announce the general availability of my book on ELK stack titled ” Learning ELK Stack ” with PacktPub publications.

Book aims to provide individuals/technologists who seek to implement their own log and data analytics solutions using opensource stack of Elasticsearch, Logstash and Kibana popularly known as ELK stack.

This is the first book ever published which covers ELK stack.

Learning ELK Stack by Saurabh Chhajed


What is Apache HCatalog ?

What is HCatalog ?

Apache HCatalog is a Storage Management Layer for Hadoop that helps to users of different data processing tools in Hadoop ecosystem like Hive, Pig and MapReduce easily read and write data from the cluster.HCatalog enables with relational view of data  from RCFile format, Parquet, ORC files, Sequence files stored on HDFS. It also exposes REST API exposed to external systems to access the metadata. (more…)

Hashmap Performance Improvements in Java 8

Problem Statement :

Until Java 7, java.util.Hashmap implementations always suffered with the problem of Hash Collision, i.e. when multiple hashCode() values end up in the same bucket, values are placed in a Linked List implementation, which reduces Hashmap performance from O(1) to O(n).

Solution :

Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries.This will improve collision performance for any key type that implements Comparable. (more…)

How-To : Integrate Kafka with HDFS using Camus (Twitter Stream Example)

Simple String Example for Setting up Camus for Kafka-HDFS Data Pipeline

I came across Camus while building a Lambda Architecture framework recently. I couldn’t find a good Illustration of getting started with Kafk-HDFS pipeline ,  In this post we will see how we can use Camus to build a Kafka-HDFS data pipeline using a twitter stream produced by Kafka Producer as mentioned in last post .

What is Camus?

Camus is LinkedIn’s Kafka->HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. It includes the following features: (more…)

How-To : Write a Kafka Producer using Twitter Stream ( Twitter HBC Client)

Twitter opensourced it’s  Hosebird client (hbc) , a robust Java HTTP library for consuming Twitter’s Streaming API . In this post, I am going to present a demo of how we can use hbc to create a Kafka twitter stream producer , which tracks few terms on twitter statuses  and produces a kafka stream out of it, which can be utilized later for counting the terms, or putting that data from Kafka to Storm (Kafka-Storm pipeline ) or HDFS ( as we will see in next post for  using Camus API ).

You can download and run complete Sample here (more…)

%d bloggers like this: