How to Configure Spark Application (Scala and Java 8 Version with Maven) in Eclipse

A step-by-step, developer-friendly guide to setting up Apache Spark applications in Eclipse/Scala IDE using Maven, with both Java and Scala examples.

Table of Contents


Introduction

Apache Spark is a popular big data processing engine known for its fast, in-memory computation. This guide will help you set up a Spark project in Eclipse (Scala IDE), configure Maven, and run sample Java and Scala Spark applications.


Tools and Prerequisites

  • Scala IDE for Eclipse (Download)
    • Example: Scala IDE 4.7.0 (supports both Scala and Java)
  • Scala Version: 2.11 (ensure your compiler matches this)
  • Spark Version: 2.2 (set in Maven dependency)
  • Java Version: 1.8
  • Maven Version: 3.3.9 (embedded in Eclipse)
  • winutils.exe (for Windows only)

Windows Note: winutils.exe

If running on Windows, you need Hadoop binaries in Windows format. winutils.exe provides this functionality. Set the hadoop.home.dir system property to the bin path containing winutils.exe.


Creating a Sample Spark Application in Eclipse

Maven Project Setup

  1. In Scala IDE, create a new Maven Project.
  2. Replace the generated pom.xml with the following:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>com.saurzcode.spark</groupId>
	<artifactId>spark-app</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<dependencies>
		<dependency> <!-- Spark dependency -->
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core\_2.11</artifactId>
			<version>2.2.0</version>
			<scope>provided</scope>
		</dependency>
	</dependencies>
</project>

Java WordCount Example

Create a new Java class (e.g., JavaWordCount) and use the following code:

package com.saurzcode.spark;

import java.util.Arrays;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

public class JavaWordCount {
	public static void main(String[] args) throws Exception {
		String inputFile = "src/main/resources/input.txt";
		// Set HADOOP_HOME for Windows
		System.setProperty("hadoop.home.dir", "c://hadoop//");
		// Initialize Spark Context
		JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("wordCount").setMaster("local[4]"));
		// Load data from Input File
		JavaRDD<String> input = sc.textFile(inputFile);
		// Split up into words and count
		JavaPairRDD<String, Integer> counts = input
			.flatMap(line -> Arrays.asList(line.split(" ")).iterator())
			.mapToPair(word -> new Tuple2<>(word, 1))
			.reduceByKey((a, b) -> a + b);
		System.out.println(counts.collect());
		sc.stop();
		sc.close();
	}
}

Scala WordCount Example

Create a new Scala object (e.g., ScalaWordCount) and use the following code:

package com.saurzcode.spark

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object ScalaWordCount {
	def main(args: Array[String]): Unit = {
		// Set HADOOP_HOME for Windows
		System.setProperty("hadoop.home.dir", "c://hadoop//")
		// Create Spark context
		val sc = new SparkContext(new SparkConf().setAppName("Spark WordCount").setMaster("local[4]"))
		// Load input file
		val inputFile = sc.textFile("src/main/resources/input.txt")
		val counts = inputFile
			.flatMap(line => line.split(" "))
			.map(word => (word, 1))
			.reduceByKey(_ + _)
		counts.foreach(println)
		sc.stop()
	}
}

Tip: Make sure your project is set as a Scala project and the Scala compiler version matches the version in your Spark dependency. You can set this in the build path.


Running the Code in Eclipse

  • Run the Java or Scala code as a standard Java or Scala Application in Eclipse.
  • You should see the word count output and Spark log lines in the console.

Output

You should see output similar to:

(hello, 3)
(world, 2)
(example, 1)
...