This article is a beginner’s guide to Apache Kafka basic architecture, components, concepts, etc. Here we will try and understand what is Kafka, what are the use cases of Kafka, and what are some basic APIs and components of the Kafka ecosystem.
What is Kafka?
Kafka is a distributed (a collection of multiple separate servers/data storage in a connected network) messaging systems based on the principle of the pub-sub (publish-subscribe) model. It lets us publish and subscribe to the stream of records. It’s an incredibly fast, highly scalable, fault-tolerant system. It can also store a stream of records. Kafka can be used as a messaging system but on a rather incomparably huge scale.
Use Cases of Kafka
Kafka has many use cases. I have listed some of the very popular ones below –
- Kafka can be used as a robust publish-subscribe system.
- Kafka can work as a message broker solution as well.
- One use case of Kafka is User Interaction/Website activity tracking. On a website, a user is visiting all pages, what they are searching, what link they are clicking on, etc are published for back-end analytics and user interaction pattern discovery. Each of these different events in general has a separate topic.
- One of the very compelling use cases of Kafka is Application log collection and metrics. In distributed environments, there will be a huge number of servers and each will generate log files. It’s nearly impossible to go through each server to check the logs. One solution to this is to design a centralized logging system. From individual servers and applications, Kafka can receive the data/events, process the same with stream API, and publish it to the appropriate topics.
- Kafka is an excellent choice for building a processing pipeline of data streams that go through the different stages of processing, transformation, etc. One such use can be a recommendation system in an e-commerce website or mobile application. User activities, clicks, etc can be tracked and published to a topic like “useractivities“; the events in this topic may further be processed for de-duplication and enrichment, building user interest graph and the same can be used to recommend different possible products of interest to the user.
Apache Kafka Basic Architecture, Components, and concepts
As illustrated in the above diagram, Kafka has the following core APIs –
- Producer API – This API enables the source or sender system to send data to the topics in Kafka cluster(s).
- Consumer API – This API enables the receiving or consuming application to consume the data from Kafka cluster(s).
- Streams API – This API enables the transformation of incoming data; transformation may be simple mapping, filtering, aggregation, etc.
- Kafka Connect Source API – When we have very similar or similar source systems and every system writes similar code to connect to the Kafka cluster. To avoid such a situation, this API is provided that helps developers to reuse/configure instead of writing repeatable code.
- Kafka Connect Sink API – Similarly, this API sinks the data from Kafka into the target system.
- AdminClient API – This API provides the capability to manage various Kafka objects.
- Mirror Maker – Kafka mirror-making tool provides the feature to mirror data between Kafka clusters. A common use case for this feature is creating a backup/replica of the dataset in another data center.
- Zookeeper – The zookeeper is a must for running Kafka and is used to coordinate/manage its brokers and operations.
Kafka Fundamental Concepts
In this section, we’ll go over some fundamental concepts of Kafka. It’s imperative to have a clear understanding of these concepts. The concepts will be useful when we start working on tutorials.
In a pub-sub model, “topic” is a logical channel to which producers publish a message and from which the consumers receive messages; the same applies to Kafka as well.
- In Kafka, a topic defines the stream of a particular type/classification of data.
- Messages are structured or organized into topics. In other words, a particular type of message is published on a particular topic.
- A producer writes its messages to the topics and consumers read those messages from topics.
- A topic is identified by its name and must be unique in a Kafka cluster.
- There is no limitation on the number of topics.
- How long a published data will be kept in Kafka is configurable. By default, it’s kept for 1 week.
- Once data is published to a topic, it can’t be changed/updated. The data in the offset (refer to partitions below to understand the offset concept) will remain immutable.
Topics are split into partitions and are replicated across brokers in a Kafka cluster.
- In general, there is no guarantee to which partition a published message will be written.
- However, there is a possibility of adding a key to a message. If a producer publishes a message with a key, it’s ensured that all these messages (with the same key) will end up in the same partition. This is how Kafka provides a message sequencing guarantee.
- Data is written to partitions randomly unless a key is added to it. (described in the previous point)
- Messages are stored in a sequenced fashion in one partition.
- Each message in a partition is assigned an incremental id, also called offset.
- These offsets are meaningful only within the partition. It does not have any value across partitions in a topic.
- In Kafka, data ordering/sequencing is guaranteed only within a single partition. Sequencing is NOT guaranteed across partitions in a topic.
- There is no limitation on the number of partitions in a topic.
The following diagram illustrates the relationship between topic and partitions –
A Kafka server is also called a Kafka broker. A Kafka cluster consists of multiple brokers.
- Each broker has an integer identification number
- Each broker contains some topic partitions.
- Each broker can contain multiple partitions of the same topic
- A producer or consumer can connect to any broker and in turn, gets connected to the entire cluster.
Topic Replication Factor
It’s always a wise decision to factor in topic replication while designing a Kafka system. This way, if a broker goes down, its topics’ replicas from another broker can solve the crisis. Let’s take a look at the below example. Here, we have 3 brokers and 3 topics. Broker1 has Topic 1 and Partition 0, its replica is in Broker2, and so on and so forth. It has got a replication factor of 2; which means it will have one additional copy other than the primary one.
A couple of notes –
- Replication happens at the partition level
- At a time, only one broker can be a leader for a given partition; other brokers will have an in-sync replica; also known as ISR.
- You can’t have the number of replication factors more than the number of available brokers.
- In normal circumstances, this leader partition will receive data from the producer and consumers will read from the leader partition.
- Producers are the ones to write data on a topic
- Producers need to specify the topic name and one broker to connect to; Kafka will automatically take care of sending the data to the right partition of the right broker. Kafka will take care of the required load balancing among multiple partitions across multiple brokers.
- Producers have the provision to receive back the acknowledgment of the data it writes. There are the following kinds of acknowledgments possible –
- acks =0 [In this case, the producer does not wait for any acknowledgment. The producer writes the message to the topic and moves on. In this way, the producer won’t wait for an acknowledgment. This is the fastest way of publishing the message on the topic. ]
- acks = 1 [In this case, the producer will wait for only leader acknowledgment. It guarantees that at least one broker has got the message; however, there is no guarantee that the data has made it to the replicas.]
- acks=all [In this case, the leader and all the replicas will need to acknowledge back; this has the worst possible performance impact among the total of 3 types.]
- As mentioned earlier, if a producer sends a key along with the message, Kafka guarantees that messages with the same key will appear in the same partition.
- As in a pub-sub model, Consumers read data from topics
- As explained earlier, topics are divided into multiple partitions. So, effectively Consumers read data from partitions in a topic.
- However, these details are abstracted from the consumers. Consumers need to mention the topic name and a broker while connecting. Kafka will ensure the consumer is connected to the entire cluster whenever a consumer connects to a single broker.
- Consumers read data from a partition in order
- However, the order is not guaranteed across partitions. Kafka consumers read from across partitions in parallel.
- A consumer group can have multiple consumer processes/instances running.
- One consumer group will have one unique group-id.
- While reading, data from one partition is read by exactly one consumer instance in one consumer group.
- However, if there is more than one consumer group, one instance from each of these groups can be read from one single partition.
- If the number of consumers exceeds the number of partitions, then there will be some inactive consumers. For example, there are 6 partitions in total and there are 8 consumers in a single consumer group. In this case, there will be 2 inactive consumers.
- Kafka stores the partition offset from where a consumer (in a group) is reading; once the consumer processes data from an offset, it commits the offset. Hence, if the consumer process is paused or killed; it can start reading (when it comes back up) the data from the offset right after the committed offset.
- Zookeeper is a must-required component in the Kafka ecosystem.
- Zookeeper helps in managing Kafka brokers
- It also helps in the leader election of partitions.
- It helps in maintaining the cluster membership. For example, when a new broker is added, a broker is removed, a new topic is added or a topic is deleted, when a broker goes down or comes up, etc, Zookeeper manages such situations, informs Kafka.
- It also manages topic configurations like the number of partitions a topic has, the leader of the partitions for a topic.
- ACLs i.e. which consumer is allowed to read and which producer is allowed write.
- Zookeeper is used for quotas too; i.e. limiting a client with a predefined set of data consumption/publishing quota.