Spark is a cluster computing framework for building parallel and distributed processing on massive amount of data. It somehow replaces MapReduce, but yet is not as simple.
A better Hadoop
Hadoop MapReduce is effective on processing huge amount of data. It divides data into many tiny parts, processes them locally in parallel on a cluster, and produces an output. It is very straightforward yet not completely satisfying:
- performance issue : intermediary and final outputs are saved on disk,
- interface/api issue : developers are forced to adapt there code to fit in a map/reduce
Spark is meant to address these concerns as a improvement version. Yet, the latter might not be as easy to grasp.
Spark core concepts
In order to Undertstand spark let’s make an overview on its core concepts.
The driver program is the cockpit of a spark application. It runs the user main function, triggers operations on the cluster, and receives the final output.
The Spark Context is the main entry point of a spark application. It is created once in the beginning of a the driver program.
The cluster manager provides the driver program with resources of the cluster. An external cluster manager can be used (Mesos or Yarn), as well as the built-in one.
An executor is a process executing tasks on a worker node. It is acquired by the driver program through the cluster manager. Each spark application has its own executors.
Dataset (previously Resilient Distributed Dataset)
A dataset is a partitioned, read-only collection of items. It can be loaded from a HDFS file and cached for future computations.
Direct Acyclic Graph (DAG)
The direct acyclic graph is a graph that represents a set of operations to be performed on the data in several stages.
A transformation (like map, filter, reduceByKey) create a Dataset from an existing one. It is only evaluated lazily when a action is called.
An action (like reduce, collect, count) return a value to the driver program. An action trigger the previous transformations.
The execution of a distributed job
A spark job is executed as follows:
- The driver program starts and creates several stages divided into tasks
- The driver program contacts the cluster manager to start executors on the cluster
- Executors execute tasks assigned and return the result to the driver program
- The driver program releases resources to the cluster manager and exits
Spark specific features
Caching is a key feature allowing to store a DataSet and reuse after in several different actions. Caching can be done in memory or/and on disk.
In the normal case, each parameter passed into a function (such as map or reduce) are copied on each machine as a read-only variables.
Broadcast variables are read-only cached on every node. They are useful when multiple stages need the same data.
Accumulator are variables that can be only added. They support parallel operations because addition is commutative and associative.
Certain operations need data to be redistributed across the cluster. By example in reduceByKey, the keys are not necessarily on the same machine.
Example in Scala explained
We can use Spark shell in order for quickly creating a spark application. The shell can started by running the following command in spark directory:
The following program create a DataSet from a text file, then apply a transformation (filter) and an action (collect). Note that no operation on data is done before the action is called.