How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse
A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples. —
Table of Contents
Introduction
Apache Spark is a popular big data processing engine known for its fast, in-memory computation. This guide will help you set up a Spark project in Eclipse (Scala IDE), configure Maven, and run sample Java and Scala Spark applications.
Tools and Prerequisites
- Scala IDE for Eclipse (Download)
- Example: Scala IDE 4.7.0 (supports both Scala and Java)
- Scala Version: 2.11 (ensure your compiler matches this)
- Spark Version: 2.2 (set in Maven dependency)
- Java Version: 1.8
- Maven Version: 3.3.9 (embedded in Eclipse)
- winutils.exe (for Windows only)
Windows Note: winutils.exe
If running on Windows, you need Hadoop binaries in Windows format. winutils.exe
provides this functionality. Set the hadoop.home.dir
system property to the bin
path containing winutils.exe
.
- Download winutils.exe
- Place it at:
C:/hadoop/bin/winutils.exe
- See this guide for more details.
Creating a Sample Spark Application in Eclipse
Maven Project Setup
- In Scala IDE, create a new Maven Project.
- Replace the generated
pom.xml
with the following:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.saurzcode.spark</groupId>
<artifactId>spark-app</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core\_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
</project>
Java WordCount Example
Create a new Java class (e.g., JavaWordCount
) and use the following code:
package com.saurzcode.spark;
import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
public class JavaWordCount {
public static void main(String[] args) throws Exception {
String inputFile = "src/main/resources/input.txt";
// Set HADOOP_HOME for Windows
System.setProperty("hadoop.home.dir", "c://hadoop//");
// Initialize Spark Context
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("wordCount").setMaster("local[4]"));
// Load data from Input File
JavaRDD<String> input = sc.textFile(inputFile);
// Split up into words and count
JavaPairRDD<String, Integer> counts = input
.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
.mapToPair(word -> new Tuple2<>(word, 1))
.reduceByKey((a, b) -> a + b);
System.out.println(counts.collect());
sc.stop();
sc.close();
}
}
Scala WordCount Example
Create a new Scala object (e.g., ScalaWordCount
) and use the following code:
package com.saurzcode.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object ScalaWordCount {
def main(args: Array[String]): Unit = {
// Set HADOOP_HOME for Windows
System.setProperty("hadoop.home.dir", "c://hadoop//")
// Create Spark context
val sc = new SparkContext(new SparkConf().setAppName("Spark WordCount").setMaster("local[4]"))
// Load input file
val inputFile = sc.textFile("src/main/resources/input.txt")
val counts = inputFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.foreach(println)
sc.stop()
}
}
Tip: Make sure your project is set as a Scala project and the Scala compiler version matches the version in your Spark dependency. You can set this in the build path.
Running the Code in Eclipse
- Run the Java or Scala code as a standard Java or Scala Application in Eclipse.
- You should see the word count output and Spark log lines in the console.
Output
You should see output similar to:
(hello, 3)
(world, 2)
(example, 1)
...