Troubleshooting Common Kafka Conundrums

Home Blog Troubleshooting Common Kafka Conundrums

This is the third blog in our series on Kafka, where we continue to explore the nuances of deploying Kafka for scale. In our previous blogs, Essential Metrics for Kafka Performance Monitoring and Auto-Instrumenting OpenTelemetry for Kafka, we laid the foundation for understanding Kafka’s performance and monitoring aspects. Now, as we explore further into the Kafka ecosystem, we’re here to tackle the common challenges that can arise during deployment and scaling.

Kafka has become the backbone of many data-intensive applications, enabling real-time data processing and event-driven architectures. However, with great power comes the potential for complex challenges. Whether it’s ensuring data replication, managing high CPU loads, or dealing with intricate data transformations, Kafka can present a unique set of conundrums that demand expert troubleshooting.

Our goal in this blog is to equip you with the knowledge and strategies needed to tackle these issues head-on, building upon the foundation we’ve established in our previous Kafka-related discussions. So, let’s be immersed into the world of Kafka troubleshooting and navigate the common conundrums that may arise in your journey towards scalable data streaming.

Un-replicated and Under-replicated Topics

Un-replicated topics are topics that have a replication factor of one. This means that there is only one copy of the data for each partition in the topic. If the leader broker for a partition fails, the data for that partition will be lost.

Under-replicated topics are topics that have a replication factor greater than one, but do not have enough replicas to meet the desired replication factor. For example, if a topic has a replication factor of three, but one of the replicas is down, the topic is considered to be under-replicated.

Un-replicated and under-replicated topics represent potential data vulnerabilities that can lead to substantial losses if not promptly addressed. By utilizing observability, you can gain visibility into the state of your Kafka topics, ensuring that even subtle replication issues are detected early on. Lumigo offers a holistic view of your Kafka environments, helping you proactively pinpoint under-replicated topics, assess their impact, and take swift corrective action. The ability to set up automated alerts and explore enriched trace data allows you to maintain system and data integrity, ultimately bolstering the reliability and resilience of your Kafka deployment.

How to fix un-replicated and under-replicated topics

There are a few different ways to fix un-replicated and under-replicated topics in Kafka:

  • Increase the replication factor for the topic. This can be done using the kafka-topics.sh tool.
  • Add new brokers to the cluster. This will increase the number of replicas that are available for each topic.
  • Restart the failed replicas. If a replica is down, you can try restarting it. However, it is important to note that this may not always be possible, especially if the replica has crashed.

It is important to note that un-replicated and under-replicated topics can lead to data loss. Therefore, it is important to monitor for these topics and fix them as soon as possible. You can use the kafka-reassign-partitions.sh tool to manually reassign partitions to different brokers. This can be useful for fixing under-replicated topics or for balancing the load across the brokers in the cluster. 

Kafka Liveness Checks

Kafka liveness check problems can occur when the monitoring tool is unable to reach the Kafka broker or when the broker is not responding to requests. This can be caused by a number of factors, including:

  • Network problems
  • Broker crashes
  • Broker restarts
  • Broker configuration problems

Kafka liveness check problems can lead to the broker being restarted, which can disrupt service and cause data loss.

To configure a monitoring tool to look for Kafka liveness check problems, you will need to specify the following:

  1. The IP address or hostname of the Kafka broker
  2. The port that the Kafka broker is listening on
  3. The type of liveness check to perform

The most common type of liveness check is a TCP port check. This involves the monitoring tool sending a TCP packet to the Kafka broker on the specified port. If the broker is responding, the liveness check will pass. If the broker is not responding, the liveness check will fail.

Another type of liveness check is an HTTP GET request. This involves the monitoring tool sending an HTTP GET request to the Kafka broker on a specific URL. If the broker is responding and the request is successful, the liveness check will pass. If the broker is not responding or the request is not successful, the liveness check will fail.

The script shown below will create a Kafka consumer and subscribe to a topic called liveness-check-topic. The script will then set a timeout to check if the consumer has received a message in the last 5 seconds. If the consumer has not received a message in the last 5 seconds, the script will print an error message and exit. Please remember that to use this script, you will need to install the kafka-node package.

const { Kafka } = require('kafka-node');


const kafkaHost = 'localhost:9092';


const kafka = new Kafka({ kafkaHost });


const consumer = new kafka.Consumer({

  groupId: 'liveness-check-group',

  kafkaHost,

});


consumer.on('error', (err) => {

  console.error(err);

});


consumer.on('ready', () => {

  console.log('Kafka consumer is ready');


  consumer.subscribe('liveness-check-topic');


  consumer.on('message', () => {

console.log('Received message from Kafka');

  });

});


// Set a timeout to check if the consumer has received a message in the last 5 seconds.

setTimeout(() => {

  if (!consumer.hasReceivedMessage()) {

console.error('Kafka consumer has not received a message in the last 5 seconds');

  }

}, 5000);

The Problem of Constantly Adding More Brokers

Adding more brokers to a Kafka instance can have some negative impacts on performance, such as:

  • Increased resource usage: Each broker requires some resources, such as memory and CPU. By adding more brokers, you will increase the overall resource usage of the Kafka cluster.
  • Increased complexity: Managing a large number of brokers can be complex. This can be especially true if the brokers are located on different servers or in different data centers.
  • Increased risk of rebalancing: When a broker is added or removed from a Kafka cluster, the cluster needs to rebalance. This can be a time-consuming process, and it can disrupt service.

There are many steps that can be taken to prevent the negative impacts of adding more brokers. The foremost is to plan carefully. Before adding new brokers, take some time to draw out an optimal design. The following factors may be taken into consideration:

  • What is the current load on the Kafka cluster?
  • What kind of traffic is the Kafka cluster handling?
  • What are the performance requirements for the Kafka cluster?
    • Use a broker placement strategy: A broker placement strategy can be used to distribute brokers across servers and data centers in a way that maximizes performance and minimizes risk.
    • Use a monitoring tool: A monitoring tool can be used to monitor the performance and health of your Kafka cluster. This can help you to identify and resolve any problems that occur.

Fine tuning individual broker configuration is often the best way to manage performance issues as they arise from adding more brokers.

One example of fine tuning a broker configuration in Kafka is to adjust the log.segment.bytes parameter. This parameter controls the maximum size of a Kafka log segment. By default, the value of this parameter is 1GB. If your Kafka cluster is handling large messages, you may need to increase this value to improve performance.

Another example of fine tuning a broker configuration in Kafka is to adjust the num.io.threads parameter. This parameter controls the number of threads that a Kafka broker uses to handle I/O requests. By default, the value of this parameter is 8. If your Kafka cluster is handling a high volume of traffic, you may need to increase this value to improve performance.

Here are some specific examples of how to fine tune Kafka broker configurations:

  • If you are handling large messages, you may need to increase the log.segment.bytes parameter. For example, you could increase the value of this parameter to 2GB or 4GB.
  • If your Kafka cluster is handling a high volume of traffic, you may need to increase the num.io.threads parameter. For example, you could increase the value of this parameter to 16 or 32.
  • If you are using a distributed Kafka cluster, you may need to adjust the replication.factor parameter. This parameter controls the number of replicas of each Kafka partition. By default, the value of this parameter is 3. If you need higher availability or durability, you can increase this value.
  • If you are using a Kafka cluster to store persistent data, you may need to adjust the retention.ms parameter. This parameter controls the amount of time that Kafka keeps Kafka messages before deleting them. By default, the value of this parameter is 7 days. If you need to store data for a longer period of time, you can increase this value.

It is important to note that there is no one-size-fits-all answer to the question of how to fine tune Kafka broker configurations. The best configuration for your Kafka cluster will depend on your specific needs and requirements. It is important to monitor the performance of your Kafka cluster and adjust the configuration as needed.

Managing Storage – Cost and Count

One of the most common issues with scaling Kafka instances is the volume and cost of storage. Both of which rise steeply as more data is managed by the tool. There are a number of ways to manage storage issues when working with Kafka. Here are a few tips:

  • Set retention policies. Kafka allows you to set retention policies for each topic. This allows you to specify how long Kafka should keep messages in each topic before deleting them. By setting appropriate retention policies, you can reduce the amount of storage space that Kafka is using.
  • Use a tiered storage architecture. A tiered storage architecture uses different types of storage media, such as SSDs and HDDs, to store data. SSDs are faster and more expensive than HDDs, but they can also handle more I/O. HDDs are slower and less expensive than SSDs, but they can store more data. You can use a tiered storage architecture to store frequently accessed data on SSDs and less frequently accessed data on HDDs. This can help to reduce the cost of storing your Kafka data.
  • Use a cloud-based storage provider. There are a number of cloud-based storage providers that offer managed Kafka services. These services can automatically scale your Kafka cluster up or down as needed, and they can also provide tiered storage options. Using a cloud-based storage provider can help you to reduce the cost and complexity of managing your Kafka storage.

Improper Data Retention Policies

On a somewhat related note, it is always a good idea to purge the Kafka instance of data that is in flight. Improper data retention settings in Kafka can lead to a number of problems. One specific example of improper data retention settings in Kafka would be to retain all messages in a topic for an indefinite period of time. This could be done by setting the retention.ms parameter for the topic to -1. This would cause Kafka to keep all messages in the topic until they are manually deleted.

This would be an improper data retention setting for a number of reasons. First, it would waste storage space. Second, it could impact the performance of the Kafka cluster by increasing the amount of data that Kafka has to manage. Third, it could put the organization at risk of compliance violations if the organization is required to comply with regulations that specify how long data must be retained. Fourth, it could increase the risk of data breaches if sensitive data is retained for longer than necessary. To prevent these problems, it is important to set appropriate retention policies for each Kafka topic. The retention policy for a topic should specify how long Kafka should keep messages in the topic before deleting them. The retention policy should be set based on the needs of the organization and the specific type of data that is being stored in the topic. Here are some methods to mitigate the problems caused by improper data retention settings in Kafka:

  • Set appropriate retention policies: Kafka allows you to set retention policies for each topic. This allows you to specify how long Kafka should keep messages in each topic before deleting them. By setting appropriate retention policies, you can reduce the amount of storage space that Kafka is using and improve the performance of the Kafka cluster.
  • Monitor storage usage: It is important to monitor the storage usage of your Kafka cluster to identify any topics that are retaining too much data. You can use a monitoring tool to help you do this.
  • Archive old data: If you have old data that you no longer need to access frequently, you can archive it to a less expensive storage medium, such as a cloud-based object storage service.
  • Delete unused data: If you have data that you are no longer using, you should delete it. This will free up storage space and reduce costs.

Leadership Partitioning And Associated Reassignment

Partition leadership balancing in Kafka is the process of evenly distributing the leadership of partitions across the Kafka brokers in a cluster. This is important for performance and reliability, as leaders are responsible for handling read and write requests for their partitions.

Kafka reassign partition script is a tool that can be used to manually rebalance the leadership of partitions in a Kafka cluster. This can be useful for correcting an unbalanced leadership distribution or for migrating partitions to different brokers.

Partition leadership balancing and the Kafka reassign partition script are used in the following situations:

  • To improve performance and reliability: When leaders are evenly distributed across the brokers, the Kafka cluster is better able to handle load spikes and failures.
  • To correct an unbalanced leadership distribution: Over time, the leadership distribution in a Kafka cluster can become unbalanced. This can be due to factors such as broker failures, topic additions and removals, and changes in the traffic patterns of the cluster. Using the Kafka reassign partition script to manually rebalance the leadership of partitions can correct this imbalance.
  • To migrate partitions to different brokers: If you need to move partitions to different brokers, you can use the Kafka reassign partition script to do so. This can be useful for maintenance tasks such as upgrading brokers or decommissioning brokers.

Partition leadership balancing is a dynamic process that is continuously happening in the background. However, there may be cases where it is necessary to manually rebalance the leadership of partitions. The Kafka reassign partition script is a powerful tool, but it should be used with caution. If used incorrectly, it can disrupt the operation of the Kafka cluster. Before using the Kafka reassign partition script, it is important to understand the impact that it will have on the Kafka cluster. You should also have a rollback plan in case something goes wrong.

Unsupported Queues

There are some queueing paradigms that are not supported by Apache Kafka. Users who have historically used RabbitMQ or ActiveMQ often run into this problem when they try to migrate their old queuing architecture to their new Kafka instances. 

When implementing unsupported queues in Kafka, use a separate Kafka cluster for point-to-point queues and request/reply queues. This will help to isolate the performance of these messaging patterns from the performance of other Kafka workloads. It is also good to make use of a smaller number of partitions for point-to-point queues and request/reply queues in order  to reduce latency. Having a rollback plan  will help you to minimize the disruption to your service when encountering problems with your point-to-point queues or request/reply queues.

Data Transformation problems with Kafka

Making changes to data when in transit is not an ideal use case for Kafka. It is designed with the sole purpose of being an agent that delivers messages.

A complex data transformation can cause the Kafka cluster to become overloaded and unresponsive.A data transformation error can corrupt the data and/or disrupt the Kafka cluster. An incompatible data transformation can cause consumers to reject messages, which can lead to data loss. There are a number of lightweight data transformation libraries available that are specifically designed for Kafka. These libraries can help to reduce the performance overhead of data transformation. A data transformation that is not properly tested can cause unexpected problems in production.

Inconsistent Topic Offsets

Topic offsets in Kafka are a way of tracking the progress of a consumer group in consuming messages from a topic. Each partition of a topic has its own set of offsets, which indicate the last message that was successfully processed by the consumer group for that partition.

Topic offsets are important because they allow consumers to resume consuming from where they left off, even if the consumer crashes or the Kafka cluster is restarted. Topic offsets also allow consumers to process messages in parallel, which can improve performance.

If topic offsets are not replicated in Kafka, the following problems can occur:

  • Data loss: If a consumer crashes before it can commit its offset, the consumer will lose all of the messages that it has processed since the last commit.
  • Duplication: If a consumer crashes and restarts, it will start consuming from the beginning of the topic again. This can lead to duplicate messages being processed.
  • Inconsistent state: If consumers are not able to see the same offsets, they will have an inconsistent view of the data in the topic. This can lead to errors and unexpected behavior.

To avoid these problems, it is important to replicate topic offsets in Kafka. Kafka does this by replicating the consumer group offset metadata to all of the brokers in the cluster. There are some proven ways to manage this problem when it occurs. Consumer groups allow consumers to coordinate their consumption of messages from a topic. This can help to prevent data loss and duplication. Finally, users must ensure that consumers should commit their offsets regularly to ensure that they do not lose any data. 

Kafka Instances Generating CPU Loads

At times, Kafka instances can generate high CPU loads. The amount of CPU load that a Kafka instance generates will depend on a number of factors, such as the volume of traffic, the complexity of the data, and the configuration of the Kafka instance.

Here are some ways to address the problem of high CPU loads from Kafka instances:

  • Monitor the CPU usage of your Kafka instances to monitor the CPU usage of your Kafka instances. This will help you to identify any instances that are experiencing high CPU loads.
  • Identify the root cause of the high CPU load. Once you have identified a Kafka instance that is experiencing a high CPU load, you need to identify the root cause of the problem. This may involve checking the Kafka logs, monitoring the Kafka metrics, and analyzing the performance of your Kafka applications.
  • Take steps to address the root cause of the high CPU load. Once you have identified the root cause of the high CPU load, you can take steps to address it. For example, if the high CPU load is caused by a specific application, you can optimize the application or move it to a different server.
  • Scale up your Kafka cluster. If you are experiencing high CPU loads on all of your Kafka instances, you may need to scale up your Kafka cluster. This means adding more Kafka brokers to the cluster.

Kafka Monitoring is Key

Amidst the ever-shifting landscape of data streaming, the indispensability of robust monitoring tools stands as an undeniable truth when it comes to understanding Kafka performance. Throughout this exploration of Kafka’s common conundrums, one recurring theme has emerged: the critical role of vigilant monitoring. Kafka’s capacity for real-time data processing and event-driven architectures is a powerful tool, but harnessing its potential requires proactive observability. 

Enter Lumigo, your solution for Kafka monitoring excellence. With Lumigo’s intuitive OTel operator, achieving end-to-end observability for your applications and tools is effortless. It seamlessly integrates into your existing environment, with just a 1-click OpenTelemetry deployment. To get started enhancing Kafka performance and reliability, sign up for Lumigo today. We’re here to help you navigate the intricacies of troubleshooting and ensure your kafka data streaming experiences are nothing short of exceptional.