This article is a beginners guide to Apache Kafka basic architecture, components, concepts etc. Here we will try and understand what is Kafka, what are the use cases of Kafka, what are some basic APIs and components of Kafka ecosystem.
What is Kafka?
Kafka is distributed messaging system based on the principle of pub-sub (publish-subscribe) model. It let’s us publish and subscribe steam of records. It’s incredibly fast, highly scaleable, fault-tolerant system. It can also store stream of records. It can be used as an alternative to message broker as well that lets us process/transform stream of records. Kafka can be used as messaging system but in a rather incomparably huge scale.
Use Cases of Kafka
Kafka has many use cases. I have listed some of the very popular ones below –
- Kafka can be used as a robust publish-subscribe system.
- Kafka can work as a message broker solution as well.
- One use case of Kafka is User Interaction/Website activity tracking. In a website, a user is visiting what all pages, what they are searching, what link they are clicking on etc are published for back-end analytics and user interaction pattern discovery. Each of these different events in general has separate topic.
- One of the very compelling use case of Kafka is Application log collection and metrics. In distributed environments, there will be a huge number of servers and each will generate log files. It’s nearly impossible to go through each server to check the logs. One solution to this is to design a centralized logging system. From individual servers and applications, Kafka can receive the data/events, process the same with stream API and publish it to the appropriate topics.
- Kafka is an excellent choice for building a processing pipeline of data streams that go through different stage of processing, transformation etc. One such use can be a recommendation system in e-commerce website or mobile application. User activities, clicks etc can be tracked and published to a topic like “useractivities“; the events in this topic may further be processed for de-duplication and enrichment, building user interest graph and the same can be used to recommend different possible product of interest to user.
Kafka APIs and High-level Components in Kafka Ecosystem
As illustrated in above diagram, Kafka has following core APIs –
- Producer API – This API enables the source or sender system to send data to the topics in Kafka cluster(s).
- Consumer API – This API enables the receiving or consuming application to consume the data from Kafka cluster(s).
- Streams API – This API enables transformation of incoming data; transformation may be simple mapping, filtering, aggregation etc.
- Kafka Connect Source API – When we have very similar or similar source systems and every system writes similar code to connect to the kafka cluster. To avoid such situation, this API is provided that helps developers to reuse/configure instead of writing repeatable code.
- Kafka Connect Sink API – Similarly, this API sinks the data from Kafka into the target system.
- AdminClient API – This API provides the capability to manage various kafka objects.
- Mirror Maker – Kafka mirror making tool provides the feature to mirror data between kafka clusters. A common use case for this feature is creating a backup/replica of the dataset in another datacenter.
- Zookeeper – Zookeeper is must for running Kafka and is used to co-ordinate/manage its brokers and operations.
Kafka Fundamental Concepts
In this section we’ll go over some fundamental concepts of Kafka. It’s imperative to have a clear understanding of these concepts. The concepts will be useful when we start working tutorials.
In a pub-sub model, “topic” is a logical channel to which producers publishes message and from which the consumers receive messages; same applies for Kafka as well.
- In Kafka, a topic defines stream of a particular type/classification of data.
- Messages are structured or organized into topics. In other words, a particular type of messages are published to a particular topic.
- A producer writes its messages to the topics and consumers read those messages from topics.
- A topic is identified by its name and must be unique in a Kafka cluster.
- There is no limitation on the number of topics.
- How long a published data will be kept in Kafka is configurable. By default, it’s kept for 1 week.
- Once a data is published to a topic, it can’t be changed/updated. The data in the offset (refer to partitions below to understand offset concept) will remain immutable.
Topics are split into partitions and are replicated across brokers in a Kafka cluster.
- In general, there is no guarantee that to which partition a published message will be written to.
- However, there is a possibility of adding a key to a message. If a producer publishes a message with a key, it’s ensured that all these messages (with same key) will end up in the same partition. This is how Kafka provides message sequencing guarantee.
- Data is written to partitions randomly unless a key is added to it. (described in the previous point)
- Messages are stored in sequenced fashion in one partition.
- Each message in a partition is assigned an incremental id, also called offset.
- These offsets are meaningful only within the partition. It does not have any value across partitions in a topic.
- In Kafka, data ordering/sequencing is guaranteed only within a single partition. Sequencing is NOT guaranteed across partitions in a topic.
- There is no limitation on number of partitions in a topic.
Following diagram illustrates the relationship between topic and partitions –
A Kafka server is also called a Kafka broker. A Kafka cluster consists multiple brokers.
- Each broker has an integer identification number
- Each broker contains some topic partitions.
- Each broker can contain multiple partitions of same topic
- A producer or consumer can connect to any broker and in turn gets connected to the entire cluster.
Topic Replication Factor
It’s always a wise decision to factor in topic replication while designing a Kafka system. This way, if a broker goes down, its topics’ replicas from another broker can solve the crisis. Let’s take a look into the below example. Here, we have 3 brokers and 3 topics. Broker1 has Topic 1 and Partition 0, it’s replica is in Broker2, so on and so forth. It has got a replication factor of 2; it means it will have one additional copy other than the primary one.
Couple of notes –
- Replication happens in the partition level
- At a time, only one broker can be a leader for a given partition; other brokers will have in-sync replica; also known as ISR.
- You can’t have number of replication factor more than the number of available brokers.
- In normal circumstance, this leader partition will receive data from producer and consumers will read from the leader partition.
- Producers are the ones to write data to a topic
- Producers need to specify the topic name and one broker to connect to; Kafka will automatically take care of sending the data to the right partition of the right broker. Kafka will take care of the required load balancing among multiple partitions across multiple brokers.
- Producers have the provision to receive back the acknowledgement of the data it writes. There are following kinds of acknowledgements possible –
- acks =0 [In this case, producer does not wait for any acknowledgment. Producer writes the message to topic and moves on. In this way, producer won’t wait for acknowledgment. This is the fastest way of publishing the message to topic. ]
- acks = 1 [In this case, producer will wait for only leader acknowledgment. It guarantees that atleast one broker has got the message; however there is no guarantee that the data has made it to the replicas.]
- acks=all [In this case, the leader and all the replicas will need to acknowledge back; this has worst possible performance impact among total 3 types.]
- As mentioned earlier, if a producer sends a key along with the message, Kafka guarantees that messages with same key will appear in the same partition.
- As in a pub-sub model, Consumers read data from topics
- As explained earlier, topics are divided into multiple partitions. So, effectively Consumers read data from partitions in a topic.
- However, these details are abstracted from the consumers. Consumers need to mention the topic name and a broker while connecting. Kafka will ensure the consumer is connected to the entire cluster whenever a consumer connects to a single broker.
- Consumers read data from a partition in order
- However, the order is not guaranteed across partitions. Kafka consumers read from across partition in parallel.
- A consumer group can have multiple consumer process/instance running.
- One consumer group will have one unique group-id.
- While reading, data from one partition is read by exactly one consumer instance in one consumer group.
- However, if there are more than one consumer group, one instance from each of these groups can read from one single partition.
- If the number of consumers exceed the number of partitions, then there will be some inactive consumers. For example, there are 6 partitions in total and there are 8 consumers in a single consumer groups. In this case, there will be 2 inactive consumers.
- Kafka stores the partition offset from where a consumer (in a group) is reading; once the consumer processes a data from an offset, it commits the offset. Hence, if the consumer process is paused or killed; it can start reading (when it comes back up) the data from the offset right after the committed offset.
- Zookeeper is a must required component in Kafka ecosystem.
- Zookeeper helps in managing Kafka brokers
- It also helps in leader election of partitions.
- It helps in maintaining the cluster membership. For example, when a new broker is added, a broker is removed, a new topic is added or a topic is deleted, when a broker goes down or comes up etc, Zookeeper manages such situations, informs Kafka.
- It also manages topic configurations like number of partitions a topic has, leader of the partitions for a topic.
- ACLs i.e. which consumer is allowed to read and which producer is allowed write.
- Zookeeper is used for quotas too; i.e. limiting a client with a predefined set of data consumption/publishing quota.