Last time we have made an introduction to Hadoop MapReduce. We have seen that it relied on its file system: Hadoop Distributed File System (HDFS). Today we are going to take a look at this.
What is HDFS?
The goal of HDFS is to stored large amount of data in a distributed and fault tolerant way. A large file loaded into HDFS will be divided into small chunks and replicated several times into different nodes of the cluster. The block size and the replication factor is actually configurable per file. This allows to apply some computation on the data locally using MapReduce. HDFS is said to be append only: once a file is created in HDFS it can only be appended.
Name nodes and Data nodes.
HDFS is based on a master/slave architecture. The NamedNode is the only node responsible for managing the access to the data by clients. The DataNodes are the nodes responsible for storing the data. The NamedNode actually stores the metadata of all the data stored in HDFS, and maintains a traditional hierarchical file organisation: the File System Namespace.
Some HDFS commands
These commands are rather simple. In order to do this, you should simply install Hadoop on a machine using a single node setup.
List contents: bin/hadoop fs -ls
Load a file: bin/hadoop fs -put sourcedir targetdir
Delete a file: bin/hadoop fs -rmr -r dir