Apache Spark is becoming very popular among organization looking to leverage its fast, in-memory computing capability for big-data processing. This article is for beginners to get started with Spark Setup on Eclipse/Scala IDE and getting familiar with Spark terminologies in general –
Hope you have read previous article on RDD basics , to get a basic understanding of Spark RDD.
Tools Used :
- Scala IDE for Eclipse – Download latest version of Scala IDE from here .Here, I used Scala IDE 4.7.0 Release, which support both Scala and Java
- Scala Version – 2.11 ( make sure scala compiler is set to this version as well)
- Spark Version 2.2 ( provided in maven dependency)
- Java Version 1.8
- Maven Version 3.3.9 ( Embedded in Eclipse)
Resilient Distributed Datasets -RDDs in Spark
Apcahe Spark has already taken over Hadoop (MapReduce) because of plenty of benefits it provides in terms of faster execution in iterative processing algorithms such as Machine learning.
In this post, we will try to understand what makes Spark RDDs so useful in batch analytics .
Why RDD ?
When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.