Category: Spark

How to Configure Spark Application ( Scala and Java 8 Version with Maven ) in Eclipse.


Apache Spark is becoming very popular among organization looking to leverage its fast, in-memory computing capability for big-data processing. This article is for beginners to get started with Spark Setup on Eclipse/Scala IDE and  getting familiar with Spark terminologies in general –

Hope you have read previous article on RDD basics , to get a basic understanding of Spark RDD.

Tools Used :

  • Scala IDE for Eclipse – Download latest version of Scala IDE from here .Here, I used Scala IDE 4.7.0 Release, which support both Scala and Java
  • Scala Version – 2.11 ( make sure scala compiler is set to this version as well)
  • Spark Version 2.2 ( provided in maven dependency)
  • Java Version 1.8
  • Maven Version 3.3.9 ( Embedded in Eclipse)
  • winutils.exe

For running in Windows environment , you need hadoop binaries in windows format. winutils provides that and we need to set hadoop.home.dir system property to bin path inside which winutils.exe is present. You can download winutils.exe here and place at path like this – c:/hadoop/bin/winutils.exe

Creating a Sample Application in Eclipse –

In Scala IDE, create a new Maven Project –

saurzcode-eclipse-sparksaurzcode-eclipse-sparksaurzcode

 

Replace POM.XML as below –

POM.XML

For creating a Java WordCount program, create a new Java Class and copy the code below –

Java Code for WordCount

Scala Version

For running the Scala version of WordCount program in scala, create a new Scala Object and use the code below –

You may need to set project as scala project to run this, and make sure scala compiler version matches Scala version in your Spark dependency, by setting in build path – saurzcode-eclipse-spark

So, your final setup will look like this –

saurzcode-eclipse-spark

Running the Code in Eclipse

You can run above code in Scala or Java as simple  Run As Scala or Java Application in eclipse to see the output.

Output

Now you should be able to see the word count output , along with log lines generated using default Spark log4j properties.

saurzcode-eclipse-spark

In the next post, I will explain how you can open Spark WebUI and look at various stages, tasks on Spark code execution internally.

You may also be interested in some other BigData posts –


What is RDD in Spark ? and Why do we need it ?

Resilient Distributed Datasets -RDDs in Spark

Apcahe Spark has already taken over Hadoop (MapReduce)  because of plenty of benefits it provides in terms of faster execution in iterative processing algorithms such as Machine learning.

In this post, we will try to understand what makes Spark RDDs so useful in batch analytics .

Why RDD ?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as  Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.

(more…)

%d bloggers like this: