書名： Building Data Streaming Applications with Apache Kafka
作者名： Manish Kumar Chanchal Singh
本章字數： 963字
更新時間： 2022-07-12 10:38:12

Message partitions

Suppose that we have in our possession a purchase table and we want to read records for an item from the purchase table that belongs to a certain category, say, electronics. In the normal course of events, we will simply filter out other records, but what if we partition our table in such a way that we will be able to read the records of our choice quickly?

This is exactly what happens when topics are broken into partitions known as units of parallelism in Kafka. This means that the greater the number of partitions, the more throughput. This does not mean that we should choose a huge number of partitions. We will talk about the pros and cons of increasing the number of partitions further.

While creating topics, you can always mention the number of partitions that you require for a topic. Each of the messages will be appended to partitions and each message is then assigned with a number called an offset. Kafka makes sure that messages with similar keys always go to the same partition; it calculates the hash of the message key and appends the message to the partition. Time ordering of messages is not guaranteed in topics but within a partition, it's always guaranteed. This means that messages that come later will always be appended to the end of the partition.

Partitions are fault-tolerant; they are replicated across the Kafka brokers. Each partition has its leader that serves messages to the consumer that wants to read the message from the partition. If the leader fails a new leader is elected and continues to serve messages to the consumers. This helps in achieving high throughput and latency.

Let's understand the pros and cons of a large number of partitions:

High throughput: Partitions are a way to achieve parallelism in Kafka. Write operations on different partitions happen in parallel. All time-consuming operations will happen in parallel as well; this operation will utilize hardware resources at the maximum. On the consumer side, one partition will be assigned to one consumer within a consumer group, which means that different consumers available in different groups can read from the same partition, but different consumers from the same consumer group will not be allowed to read from the same partition.

So, the degree of parallelism in a single consumer group depends on the number of partitions it is reading from. A large number of partitions results in high throughput.
Choosing the number of partitions depends on how much throughput you want to achieve. We will talk about it in detail later. Throughput on the producer side also depends on many other factors such as batch size, compression type, number of replications, types of acknowledgement, and some other configurations, which we will see in detail in Chapter 3, Deep Dive into Kafka Producers.
However, we should be very careful about modifying the number of partitions--the mapping of messages to partitions completely depends on the hash code generated based on the message key that guarantees that messages with the same key will be written to the same partition. This guarantees the consumer about the delivery of messages in the order which they were stored in the partition. If we change the number of partitions, the distribution of messages will change and this order will no longer be guaranteed for consumers who were looking for the previous order subscribed. Throughput for the producer and consumer can be increased or decreased based on different configurations that we will discuss in detail in upcoming chapters.

Increases producer memory: You must be wondering how increasing the number of partitions will force us to increase producer memory. A producer does some internal stuff before flushing data to the broker and asking them to store it in the partition. The producer buffers incoming messages per partition. Once the upper bound or the time set is reached, the producer sends his messages to the broker and removes it from the buffer.

If we increase the number of partitions, the memory allocated for the buffering may exceed in a very short interval of time, and the producer will block producing messages until it sends buffered data to the broker. This may result in lower throughput. To overcome this, we need to configure more memory on the producer side, which will result in allocating extra memory to the producer.

High availability issue: Kafka is known as high-availability, high-throughput, and distributed messaging system. Brokers in Kafka store thousands of partitions of different topics. Reading and writing to partitions happens through the leader of that partition. Generally, if the leader fails, electing a new leader takes only a few milliseconds. Observation of failure is done through controllers. Controllers are just one of the brokers. Now, the new leader will serve the request from the producer and consumer. Before serving the request, it reads metadata of the partition from Zookeeper. However, for normal and expected failure, the window is very small and takes only a few milliseconds. In the case of unexpected failure, such as killing a broker unintentionally, it may result in a delay of a few seconds based on the number of partitions. The general formula is:

Delay Time = (Number of Partition/replication * Time to read metadata for single partition)

The other possibility could be that the failed broker is a controller, the controller replacement time depends on the number of partitions, the new controller reads the metadata of each partition, and the time to start the controller will increase with an increase in the number of partitions.

Kafka partitions (Ref: https://kafka.apache.org/documentation/)

We should take care while choosing the number of partitions and we will talk about this in upcoming chapters and how we can make the best use of Kafka's capability.

官术网_书友最值得收藏!

Building Data Streaming Applications with Apache Kafka

Message partitions