Understanding Spark

Spark is a cluster computing framework for building parallel and distributed processing on massive amount of data. It somehow replaces MapReduce, but yet is not as simple.

A better Hadoop

Hadoop MapReduce is effective on processing huge amount of data. It divides data into many tiny parts, processes them locally in parallel on a cluster, and produces an output. It is very straightforward yet not completely satisfying:

  • performance issue : intermediary and final outputs are saved on disk,
  • interface/api issue : developers are forced to adapt there code to fit in a map/reduce

Spark is meant to address these concerns as a improved version of Hadoop. Yet, the newer framework might not be as easy to grasp.

Spark core concepts

In order to Undertstand spark let’s make an overview on its core concepts.

Driver Program

The driver program is the cockpit of a spark application. It runs the user main function, triggers operations on the cluster, and receives the final output.

Spark Context

The Spark Context is the main entry point of a spark application. It is created once in the beginning of a the driver program.

Cluster Manager

The cluster manager provides the driver program with resources of the cluster.  An external cluster manager can be used (Mesos or Yarn), as well as the built-in one.


An executor is a process executing tasks on a worker node. It is acquired by the driver program through the cluster manager. Each spark application has its own executors.

Dataset (previously Resilient Distributed Dataset)

A dataset is a partitioned, read-only collection of items. It can be loaded from a HDFS file and cached for future computations.

Direct Acyclic Graph (DAG)

The direct acyclic graph is a graph that represents a set of operations to be performed on the data in several stages.


A transformation (like mapfilter, reduceByKey) create a Dataset from an existing one. It is only evaluated lazily when a action is called.


An action (like reduce, collect, count) return a value to the driver program. An action trigger the previous transformations.

The execution of a distributed job

A spark job is executed as follows:

  • The driver program starts and creates several stages divided into tasks
  • The driver program contacts the cluster manager to start executors on the cluster
  • Executors execute tasks assigned and return the result to the driver program
  • The driver program releases resources to the cluster manager and exits


Spark specific features


Caching is a key feature allowing to store a DataSet and reuse after in several different actions. Caching can be done in memory or/and on disk.

Normal variables

In the normal case, each parameter passed into a function (such as map or reduce) are copied on each machine as a read-only variables.

Broadcast Variables

Broadcast variables are read-only cached on every node. They are useful when multiple stages need the same data.


Accumulator are variables that can be only added. They support parallel operations because addition is commutative and associative.


Certain operations need data to be redistributed across the cluster. By example in reduceByKey, the keys are not necessarily on the same machine.

Example in Scala explained

We can use the Spark shell in order for quickly create a spark application. The shell can started by running the following command in spark directory:


The following program creates a DataSet from a text file, then applies a transformation (filter) and an action (collect) on the DatSet. Note that no operation on data is done before the action is called.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s