Distributed programming

Kafka core concepts

Kafka is a messaging framework for building real time streaming applications. It allows to build distributed publish-subscribe systems. In this article we will present the core concepts of the framework.

APIs

One way to start with Kafka is to understand its APIs. There are four of them :

  • The Producer API allows to publish records to a particular category of message called topic.
  • The Streams API allows to transform messages from a topic into messages of a output topic
  • The Consumer API allows to subscribe to messages published.
  • The Connect API allows to send messages to other applications such as a Database

Clients that expose those APIs are available in several languages such as Java, C#, C++ and Python.

Architecture

Kafka is based on the concept of topic. A topic is a category of messages. It is divided into several parts called partitions. The whole topic represents a log. A partition is an append only, ordered sequence of records.

kafka_partition

Each record in a partition is uniquely identified by an id called the offset. The record consists in a key, a time stamp, and a value. Records are saved on disk and have an automatic retention policy: they are erased after the configured period. Partitions can also be replicated.

The producer chooses to which partition he will write into. For example it could be in a round robin fashion or based on a hash function.

kafka_architecture

 

A consumer group represents a set of consumer instances. The consumption of the partitions is fairly divided among the consumer instances.  Each consumer instance is the exclusive consumer of “his” partitions.

Key elements to understand

Kafka can deal with large amount of data with little overhead thanks to key features:

  • Consumption managed by consumers means less overhead on the server side
  • Partitions having different consumers provide parallelism and a way to scale
  • Automatic deletion gives the fastest way to delete old record

The framework also give some guarantees:

  • Partitions replication provides fault tolerance
  • Persistent messages allows to replay computation for a consumer
  • A consumer has an exclusivity (within its group) over the partitions he consumes
  • Writes in a partition are append only and ordered

Kafka is made with simple choices yet rather innovative. Those choices makes it robust, fast and scalable.

Advertisements
NoSQL

Things to know when switching from a RDMS to MongoDB

Before switching from a RDMS such as Oracle or SQL Server to MongoDB, one should be familiar with some key concepts of the NoSQL DataBase.

Translation needed

Before starting, we have should translate some traditional concepts of the RDMS world :

  • A collection is like a table
  • A document is a like a row
  • A column is a like a field

That said we can take a look at the key concepts of MongoDB.

Key Concepts

Dynamic schema

The fields of a document can be changed at anytime. Documents of a collection may have different fields or even different data types for the same field.

Atomicity of write operations

Write operations are atomic for one single document. A document will never be partially update, even if the document contains multiple sub-documents. On the contrary write operations on multiple documents are never atomic.

Unique Index

Unique index allows to have uniqueness. For example, we can create a unique index on the field person_id on the persons collection.

Read uncomitted

Readers can see the result of writes before the writes are said to be durable. A durable write is a write that persists after a shutdown or restart of one or more servers.

Replica

The same data set can be replicated in order to provide high availability (for read performance) or redundancy (for data security).

Sharding

A collection can be partitioned among multiple server  in order to offer horizontal scaling. Data is partitioning is based on a key called the shard key.

MongoDB is easy to start with but can be difficult to master. Those are the key concepts to know before diving more into the understanding of the NoSQL database.

Distributed programming

A word on HDFS (Hadoop Distributed File System)

Last time we have made an introduction to Hadoop MapReduce. We have seen that it relied on its file system: Hadoop Distributed File System (HDFS). Today we are going to take a look at this.

What is HDFS?

The goal of HDFS is to stored large amount of data in a distributed and fault tolerant way. A large file loaded into HDFS will be divided into small chunks  and replicated several times into different nodes of the cluster. The block size and the replication factor is actually configurable per file. This allows to apply some computation on the data locally using MapReduce.  HDFS is said to be append only: once a file is created in HDFS it can only be appended.

Name nodes and Data nodes.

HDFS is based on a master/slave architecture. The NamedNode is the only node responsible for managing the access to the data by clients.  The DataNodes are the nodes responsible for storing the data. The NamedNode  actually stores the metadata of all the data stored in HDFS, and maintains a traditional hierarchical file organisation: the File System Namespace.

Some HDFS commands

These commands are rather simple. In order to do this, you should simply install Hadoop on a machine using a single node setup.

List contents: bin/hadoop fs -ls

Load a file: bin/hadoop fs -put sourcedir targetdir

Delete a file: bin/hadoop fs -rmr -r dir

Distributed programming

An introduction to Hadoop MapReduce

Today we are going to talk about a famous BigData framework called Hadoop MapReduce. In this article, after presenting the framework, we will make a small example using Java and MapReduce Hadoop on Linux Raspbian (yes I am testing Hadoop on a raspberry pi!).

What is MapReduce?

MapReduce is a framework to process very large amount of data easily and efficiently. Typically it gives a solution to this problem: we have terabytes of data, and we need to apply some computation on it to produce a result.

Ok, but how? In very short, MapReduce will use the memory, cpu and storage capacity of a cluster of nodes. The developer using this framework will just need to:

  • install and configure Hadoop on each node
  • make a program that implements two methods: map and reduce

And it is magic? It is near magic, yes! The key concept of understanding MapReduce is actually its file system: the Hadoop Distributed File System (HDSF). It splits huge data into small chunks (typically 68MB or 128MB) and stores them in different nodes of a cluster. During computation, nodes will execute in parallel on the data that is locally stored. Therefore, we are able to:

  • process extremely huge data by splitting them into smaller ones
  • optimize CPU/RAM by executing in different nodes concurrently
  • reduce network bandwidth because the data is located on the node where the computation happens

What does Map and Reduce Do?

Map Phase

Map takes key/value pairs and produce intermediary key/value pairs. The is known as the Map Phase and is executed in parallel by MapperTasks where each MapperTask process a different chunk of the input data.

(input) <k1, v1> -> Map -> <k2, v2> (intermediary ouput)

Reduce Phase

The framework then takes over in a process know as Shuffle and Sort. This happens before the ReducerTasks entering the Reduce phase.

Each intermediary key/value pairs is attributed to a specific ReducerTask (Shuffle): the only one responsible of a given key.  The pairs are then grouped by key (Sort) producing pairs of key/list of values. According to the documentation, the shuffle and sort phases occur simultaneously.

Finally,  ReducerTasks produces the final output for each key it is processing.

 (int. ouput) <k2, v2> -> Shuffle&Sort -> <k2, listOf v2> -> Reduce -> <k3,  v3> (ouput)

The complete process is:

<k1, v1> -> Map -> <k2, v2> -> Shuffle&Sort -> <k2, list of v2> -> Reduce -> <k3,  v3>

Example with WordSearch

Installation

We need to install Hadoop first. I have choosen to install it on Raspbian. For the example, we are going to set up a single node system for the purpose of simplicity.

Input data

The input consists of two files located on a folder in the Linux machine.

File01 contains: hello world bye hadoop

File02 contains: hello world hi harry home holiday

Nb: for simplicity, those files and folders are created manually without using HDFS commands.

Source File

Our Hadoop program is a Java program that looks like this. In the Map method we take only the word that starts with a specific string. In the Reduce method, we simply count the occurrences of each word filtered by the Map Phase.

Running the MapReduce program

First, we need to compile and package our program into a .jar file. Second we can run the program with Hadoop by choosing the following parameters:

  • the path of the java program
  • the name of the java class contaning Map/Reduce
  • the path of the input folder
  • the path of the output folder
  • the “startWith” string parameter

For the sake of simplicity let’s say that our jar, input and output are all located in Hadoop folder installation.

The command is to run MapReduce is:

bin/hadoop jar wordsearch.jar WordSearch input ouput ha

Output

Looking at the file created in the output folder, we have a file named “part-r-00000” with all words starting with “ha” and the number of times they appear:

hadoop 1
harry 1

Which is correct!

Hadoop MapReduce is a cleanly thought distributed framework. In order to be use, one need a good comprehension of how things work.

NoSQL

A brief look at MongoDB

In this article, we are going to look at a NoSQL Database named MongoDB. MongoDB is a document oriented database that targets high performance and high volume. In this article, I am going to install a MongoD server on windows 10 64bits and use the C# client, but other operating systems for the server, as well as other languages for the client are avalaible.

From server installation to DataBase creation (Windows 10 64bits)

The first thing that we need is to download and install the .msi of the community version of MongoDB for Windows 64-bit. (I have personally downloaded the version 3.4.3).

Then we create a folder where the database will be installed in the disk. I have chosen to create a folder MongoDB in “C:\Data\”. But it could be anywhere else.

Third we need to start the server using mongod.exe :

> "C:\Program Files\MongoDB\Server\3.4\bin\mongod.exe" --dbpath "C:\Data\MongoDB"

We can now connect to the server using  mongo.exe (on a new command window),

> "C:\Program Files\MongoDB\Server\3.4\bin\mongo.exe"

… and create a new database named testdb.

> use testdb

We have now a new database ready to be used!

Using the DataBase with C# client

Installing C# client via Nuget

The C# client can be downloaded from Nuget by installing the MongoDB.Driver package. I have personally installed it along a Console Application in .Net 4.6.2.

Creating documents

Remember that in NoSQL, we do not need define the structure of a collection before inserting the data. Here I am going to create two documents in the collection “persons“.

Finding a document

And now we are going to find one of them to see if the insertion went well. For that we are going to use a filter on the profile_id of a person.

Which works fine!

{ "_id" : ObjectId("58dfc1e6a515863c90cc3976"), "profile" : { "profile_id" : "1", "surname" : "Khakurel", "firstname" : "Pradip" } }

MongoDB is very intuitive and easy to use!!!

Distributed programming

Introducing ZeroMQ with C# and Java native port

Zeromq is a messaging library written in C++ for building distributed applications. The library can be used when performance and stability both matters. In addition to the original library, there are bindings in language such as C# or Java that wraps the C++ dll. Moreover, 100% native ports exists in these same languages. In this post, I will use Jeromq (Java) and Netmq (.Net) to make a client and a server communicate in a very simple request-reply pattern.

The server in C#

We need to download the latest version of Netmq via Nuget. The server binds to an address so that client can connect to it, receives a message (the request) and sends a message (the reply).

The client in Java

We need to include the dependencies in the Maven file pom.xml.

Then we can create the client application. The client connects to the server, sends a message (the request), and receives a message (the reply).

Zeromq and their related ports can also be used in other patterns like publish-subscribe, push-pull, exclusive pair, etc… This might be a topic of an other post …

Concurrent Programming

A quick review of the Concurrency Visualizer in Visual Studio 2015

The Concurrency visualizer is a free extension available in Visual Studio 2015 that can be used to analyse the performance of a concurrent application. I am going to do a simple overview of this extension.

The tested code

This is the code I am going to profile with the extension. It basically creates four non blocking tasks that compute 1 x 1 infinitely. The main thread then waits for user input.

Concurrency Visualizer in action

CPU Utilization shows that nearly 100% of four cores are used.

cv_cpu_utilization

The profile reports shows that four CLR worker threads (9664, 6352, 5168, and 9724) in execution most of the time, and the main thread waits for I/O most of the time.

cv_threads_view_1

Clicking on a particular worker thread at a particular time shows the call stack.

cv_threads_view_call_stack

The core view show that threads switches cores while running. More specifically, our four worker threads have more than one third of their context switches that cross cores.

cv_core_view

Concurrency visualizer helps identify problems like : serialized execution (thread executing one at a time), poor work distribution, over subscription (too many active threads), overuse of I/O, inefficient synchronization (lock convoys). It seems like a great tool!