Resilient Distributed Datasets -RDDs in Spark

Apcahe Spark has already taken over Hadoop (MapReduce)  because of plenty of benefits it provides in terms of faster execution in iterative processing algorithms such as Machine learning.

In this post, we will try to understand what makes Spark RDDs so useful in batch analytics .

Why RDD ?

When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations such as  Logistic Regression, K-means clustering, Page rank algorithms, it is fairly common to reuse or share the data among multiple jobs or it may involve multiple ad-hoc queries over a shared data set.This makes it very important to have a very good data sharing architecture so that we can perform fast computations.