Distributed programming

Kafka core concepts

Kafka is a messaging framework for building real time streaming applications. It allows to build distributed publish-subscribe systems. In this article we will present the core concepts of the framework.

APIs

One way to start with Kafka is to understand its APIs. There are four of them :

  • The Producer API allows to publish records to a particular category of message called topic.
  • The Streams API allows to transform messages from a topic into messages of a output topic
  • The Consumer API allows to subscribe to messages published.
  • The Connect API allows to send messages to other applications such as a Database

Clients that expose those APIs are available in several languages such as Java, C#, C++ and Python.

Architecture

Kafka is based on the concept of topic. A topic is a category of messages. It is divided into several parts called partitions. The whole topic represents a log. A partition is an append only, ordered sequence of records.

kafka_partition

Each record in a partition is uniquely identified by an id called the offset. The record consists in a key, a time stamp, and a value. Records are saved on disk and have an automatic retention policy: they are erased after the configured period. Partitions can also be replicated.

The producer chooses to which partition he will write into. For example it could be in a round robin fashion or based on a hash function.

kafka_architecture

 

A consumer group represents a set of consumer instances. The consumption of the partitions is fairly divided among the consumer instances.  Each consumer instance is the exclusive consumer of “his” partitions.

Key elements to understand

Kafka can deal with large amount of data with little overhead thanks to key features:

  • Consumption managed by consumers means less overhead on the server side
  • Partitions having different consumers provide parallelism and a way to scale
  • Automatic deletion gives the fastest way to delete old record

The framework also give some guarantees:

  • Partitions replication provides fault tolerance
  • Persistent messages allows to replay computation for a consumer
  • A consumer has an exclusivity (within its group) over the partitions he consumes
  • Writes in a partition are append only and ordered

Kafka is made with simple choices yet rather innovative. Those choices makes it robust, fast and scalable.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s