Today we are going to talk about a famous BigData framework called Hadoop MapReduce. In this article, after presenting the framework, we will make a small example using Java and MapReduce Hadoop on Linux Raspbian (yes I am testing Hadoop on a raspberry pi!).
What is MapReduce?
MapReduce is a framework to process very large amount of data easily and efficiently. Typically it gives a solution to this problem: we have terabytes of data, and we need to apply some computation on it to produce a result.
Ok, but how? In very short, MapReduce will use the memory, cpu and storage capacity of a cluster of nodes. The developer using this framework will just need to:
- install and configure Hadoop on each node
- make a program that implements two methods: map and reduce
And it is magic? It is near magic, yes! The key concept of understanding MapReduce is actually its file system: the Hadoop Distributed File System (HDSF). It splits huge data into small chunks (typically 68MB or 128MB) and stores them in different nodes of a cluster. During computation, nodes will execute in parallel on the data that is locally stored. Therefore, we are able to:
- process extremely huge data by splitting them into smaller ones
- optimize CPU/RAM by executing in different nodes concurrently
- reduce network bandwidth because the data is located on the node where the computation happens
What does Map and Reduce Do?
Map takes key/value pairs and produce intermediary key/value pairs. The is known as the Map Phase and is executed in parallel by MapperTasks where each MapperTask process a different chunk of the input data.
(input) <k1, v1> -> Map -> <k2, v2> (intermediary ouput)
The framework then takes over in a process know as Shuffle and Sort. This happens before the ReducerTasks entering the Reduce phase.
Each intermediary key/value pairs is attributed to a specific ReducerTask (Shuffle): the only one responsible of a given key. The pairs are then grouped by key (Sort) producing pairs of key/list of values. According to the documentation, the shuffle and sort phases occur simultaneously.
Finally, ReducerTasks produces the final output for each key it is processing.
(int. ouput) <k2, v2> -> Shuffle&Sort -> <k2, listOf v2> -> Reduce -> <k3, v3> (ouput)
The complete process is:
<k1, v1> -> Map -> <k2, v2> -> Shuffle&Sort -> <k2, list of v2> -> Reduce -> <k3, v3>
Example with WordSearch
We need to install Hadoop first. I have choosen to install it on Raspbian. For the example, we are going to set up a single node system for the purpose of simplicity.
The input consists of two files located on a folder in the Linux machine.
hello world bye hadoop
hello world hi harry home holiday
Nb: for simplicity, those files and folders are created manually without using HDFS commands.
Our Hadoop program is a Java program that looks like this. In the Map method we take only the word that starts with a specific string. In the Reduce method, we simply count the occurrences of each word filtered by the Map Phase.
Running the MapReduce program
First, we need to compile and package our program into a .jar file. Second we can run the program with Hadoop by choosing the following parameters:
- the path of the java program
- the name of the java class contaning Map/Reduce
- the path of the input folder
- the path of the output folder
- the “startWith” string parameter
For the sake of simplicity let’s say that our jar, input and output are all located in Hadoop folder installation.
The command is to run MapReduce is:
bin/hadoop jar wordsearch.jar WordSearch input ouput ha
Looking at the file created in the output folder, we have a file named “part-r-00000” with all words starting with “ha” and the number of times they appear:
Which is correct!
Hadoop MapReduce is a cleanly thought distributed framework. In order to be use, one need a good comprehension of how things work.