書名： Building Data Streaming Applications with Apache Kafka
作者名： Manish Kumar Chanchal Singh
本章字數： 1045字
更新時間： 2022-07-12 10:38:14

Kafka producer internals

In this section, we will walk through different Kafka producer components, and at a higher level, cover how messages get transferred from a Kafka producer application to Kafka queues. While writing producer applications, you generally use Producer APIs, which expose methods at a very abstract level. Before sending any data, a lot of steps are performed by these APIs. So it is very important to understand these internal steps in order to gain complete knowledge about Kafka producers. We will cover these in this section. First, we need to understand the responsibilities of Kafka producers apart from publishing messages. Let's look at them one by one:

Bootstrapping Kafka broker URLs: The Producer connects to at least one broker to fetch metadata about the Kafka cluster. It may happen that the first broker to which the producer wants to connect may be down. To ensure a failover, the producer implementation takes a list of more than one broker URL to bootstrap from. Producer iterates through a list of Kafka broker addresses until it finds the one to connect to fetch cluster metadata.
Data serialization: Kafka uses a binary protocol to send and receive data over TCP. This means that while writing data to Kafka, producers need to send the ordered byte sequence to the defined Kafka broker's network port. Subsequently, it will read the response byte sequence from the Kafka broker in the same ordered fashion. Kafka producer serializes every message data object into ByteArrays before sending any record to the respective broker over the wire. Similarly, it converts any byte sequence received from the broker as a response to the message object.
Determining topic partition: It is the responsibility of the Kafka producer to determine which topic partition data needs to be sent. If the partition is specified by the caller program, then Producer APIs do not determine topic partition and send data directly to it. However, if no partition is specified, then producer will choose a partition for the message. This is generally based on the key of the message data object. You can also code for your custom partitioner in case you want data to be partitioned as per specific business logic for your enterprise.
Determining the leader of the partition: Producers send data to the leader of the partition directly. It is the producer's responsibility to determine the leader of the partition to which it will write messages. To do so, producers ask for metadata from any of the Kafka brokers. Brokers answer the request for metadata about active servers and leaders of the topic's partitions at that point of time.
Failure handling/retry ability: Handling failure responses or number of retries is something that needs to be controlled through the producer application. You can configure the number of retries through Producer API configuration, and this has to be decided as per your enterprise standards. Exception handling should be done through the producer application component. Depending on the type of exception, you can determine different data flows.
Batching: For efficient message transfers, batching is a very useful mechanism. Through Producer API configurations, you can control whether you need to use the producer in asynchronous mode or not. Batching ensures reduced I/O and optimum utilization of producer memory. While deciding on the number of messages in a batch, you have to keep in mind the end-to-end latency. End-to-end latency increases with the number of messages in a batch.

Hopefully, the preceding paragraphs have given you an idea about the prime responsibilities of Kafka producers. Now, we will discuss Kafka producer data flows. This will give you a clear understanding about the steps involved in producing Kafka messages.

Internal implementation or the sequence of steps in Producer APIs may differ for respective programming languages. Some of the steps can be done in parallel using threads or callbacks.

The following image shows the high-level steps involved in producing messages to the Kafka cluster:

Kafka producer high-level flow

Publishing messages to a Kafka topic starts with calling Producer APIs with appropriate details such as messages in string format, topic, partitions (optional), and other configuration details such as broker URLs and so on. The Producer API uses the passed on information to form a data object in a form of nested key-value pair. Once the data object is formed, the producer serializes it into byte arrays. You can either use an inbuilt serializer or you can develop your custom serializer. Avro is one of the commonly used data serializers.

Serialization ensures compliance to the Kafka binary protocol and efficient network transfer.

Next, the partition to which data needs to be sent is determined. If partition information is passed in API calls, then producer would use that partition directly. However, in case partition information is not passed, then producer determines the partition to which data should be sent. Generally, this is decided by the keys defined in data objects. Once the record partition is decided, producer determines which broker to connect to in order to send messages. This is generally done by the bootstrap process of selecting the producers and then, based on the fetched metadata, determining the leader broker.

Producers also need to determine supported API versions of a Kafka broker. This is accomplished by using API versions exposed by the Kafka cluster. The goal is that producer will support different versions of Producer APIs. While communicating with the respective leader broker, they should use the highest API version supported by both the producers and brokers.

Producers send the used API version in their write requests. Brokers can reject the write request if a compatible API version is not reflected in the write request. This kind of setup ensures incremental API evolution while supporting older versions of APIs.

Once a serialized data object is sent to the selected Broker, producer receives a response from those brokers. If they receive metadata about the respective partition along with new message offsets, then the response is considered successful. However, if error codes are received in the response, then producer can either throw the exception or retry as per the received configuration.

As we move further in the chapter, we will dive deeply into the technical side of Kafka Producer APIs and write them using Java and Scala programs.

官术网_书友最值得收藏!

Building Data Streaming Applications with Apache Kafka

Kafka producer internals