20 Best Practices for Working With Apache Kafka at Scale - DZone Big Data (2022)

DZone > Big Data Zone > 20 Best Practices for Working With Apache Kafka at Scale

In this post, a software engineer gives a great look at 20 ways fellow developers and data scientists can use Apache Kafka to its utmost.

20 Best Practices for Working With Apache Kafka at Scale - DZone Big Data (1) by

Tony Mancill


(Video) How to Operate an Enterprise Scale Apache Kafka Cluster | DZone Webinar by Instaclustr

Aug. 15, 18 · Big Data Zone · Opinion

Join the DZone community and get the full member experience.

Join For Free

Apache Kafka is a widely popular distributed streaming platform that thousands of companies like New Relic, Uber, and Square use to build scalable, high-throughput, and reliable real-time streaming systems. For example, the production Kafka cluster at New Relic processes more than 15 million messages per second for an aggregate data rate approaching 1 Tbps.

Kafka has gained popularity with application developers and data management experts because it greatly simplifies working with data streams. But Kafka can get complex at scale. A high-throughput publish-subscribe (pub/sub) pattern with automated data retention limits doesn't do you much good if your consumers are unable to keep up with your data stream and messages disappear before they're ever seen. Likewise, you won't get much sleep if the systems hosting the data stream can't scale to meet demand or are otherwise unreliable.

In hopes of reducing that complexity, I'd like to share 20 of New Relic's best practices for operating scalable, high-throughput Kafka clusters. We've divided these tips into four categories for working with:

1. Partitions

2. Consumers

3. Producers

(Video) Making Apache Kafka Dead Easy With StreamSets | DZone.com Webinar

4. Brokers

But First, a Quick Rundown of Kafka and Its Architecture

Kafka is an efficient distributed messaging system providing built-in data redundancy and resiliency while retaining both high-throughput and scalability. It includes automatic data retention limits, making it well suited for applications that treat data as a stream, and it also supports "compacted" streams that model a map of key-value pairs.

To understand these best practices, you'll need to be familiar with some key terms:

  • Message: A record or unit of data within Kafka. Each message has a key and a value, and optionally headers.

  • Producer: Producers publish messages to Kafka topics. Producers decide which topic partition to publish to, either randomly (round-robin) or using a partitioning algorithm based on a message's key.

  • Broker: Kafka runs in a distributed system or cluster. Each node in the cluster is called a broker.

  • Topic: A topic is a category to which data records — or messages —are published. Consumers subscribe to topics in order to read the data written to them.

  • Topic partition: Topics are divided into partitions, and each message is given an offset. Each partition is typically replicated at least once or twice. Each partition has a leader and one or more replicas (copies of the data) that exist on followers, providing protection against a broker failure. All brokers in the cluster are both leaders and followers, but a broker has at most one replica of a topic partition. The leader is used for all reads and writes.

  • Offset: Each message within a partition is assigned an offset, a monotonically increasing integer that serves as a unique identifier for the message within the partition.

  • Consumer: Consumers read messages from Kafka topics by subscribing to topic partitions. The consuming application then processes the message to accomplish whatever work is desired.

  • Consumer group: Consumers can be organized into logic consumer groups. Topic partitions are assigned to balance the assignments among all consumers in the group. Within a consumer group, all consumers work in a load-balanced mode; in other words, each message will be seen by one consumer in the group. If a consumer goes away, the partition is assigned to another consumer in the group. This is referred to as a rebalance. If there are more consumers in a group than partitions, some consumers will be idle. If there are fewer consumers in a group than partitions, some consumers will consume messages from more than one partition.

    (Video) How to Use Time Series Data to Forecast at Scale| DZone.com Webinar

  • Lag: A consumer is lagging when it's unable to read from a partition as fast as messages are produced to it. Lag is expressed as the number of offsets that are behind the head of the partition. The time required to recover from lag (to "catch up") depends on how quickly the consumer is able to consume messages per second:

time = messages / (consume rate per second - produce rate per second)

Best Practices for Working With Partitions

  • Understand the data rate of your partitions to ensure you have the correct retention space. The data rate of a partition is the rate at which data is produced to it; in other words, it's the average message size times the number of messages per second. The data rate dictates how much retention space, in bytes, is needed to guarantee retention for a given amount of time. If you don't know the data rate, you can't correctly calculate the retention space needed to meet a time-based retention goal. The data rate also specifies the minimum performance a single consumer needs to support without lagging.
  • Unless you have architectural needs that require you to do otherwise, use random partitioning when writing to topics. When you're operating at scale, uneven data rates among partitions can be difficult to manage. There are three main reasons for this:
    • First, consumers of the "hot" (higher throughput) partitions will have to process more messages than other consumers in the consumer group, potentially leading to processing and networking bottlenecks.
    • Second, topic retention must be sized for the partition with the highest data rate, which can result in increased disk usage across other partitions in the topic.
    • Third, attaining an optimum balance in terms of partition leadership is more complex than simply spreading the leadership across all brokers. A "hot" partition might carry 10 times the weight of another partition in the same topic.

For a closer look at working with topic partitions, see Effective Strategies for Kafka Topic Partitioning.

Best Practices for Working With Consumers

  • If your consumers are running versions of Kafka older than 0.10, upgrade them. In version 0.8.x, consumers use Apache ZooKeeper for consumer group coordination, and a number of known bugs can result in long-running rebalances or even failures of the rebalance algorithm(we refer to these as "rebalance storms"). During a rebalance, one or more partitions are assigned to each consumer in the consumer group. In a rebalance storm, partition ownership is continually shuffled among the consumers, preventing any consumer from making real progress on consumption.
  • Tune your consumer socket buffers for high-speed ingest.In Kafka 0.10.x, the parameter is receive.buffer.bytes, which defaults to 64kB. In Kafka 0.8.x, the parameter is socket.receive.buffer.bytes, which defaults to 100kB. Both of these default values are too small for high-throughput environments, particularly if the network's bandwidth-delay product between the broker and the consumer is larger than a local area network (LAN). For high-bandwidth networks (10 Gbps or higher) with latencies of 1 millisecond or more, consider setting the socket buffers to 8 or 16 MB. If memory is scarce, consider 1 MB. You can also use a value of -1, which lets the underlying operating system tune the buffer size based on network conditions. However, the automatic tuning might not occur fast enough for consumers that need to start "hot."
  • Design high-throughput consumers to implement back-pressure when warranted. It is better to consume only what you can process efficiently than it is to consume so much that your process grinds to a halt and then drops out of the consumer group. Consumers should consume into fixed-sized buffers (see the Disruptor pattern), preferably off-heap if running in a Java Virtual Machine (JVM). A fixed-size buffer will prevent a consumer from pulling so much data onto the heap that the JVM spends all of its time performing garbage collection instead of the work you want to achieve — which is processing messages.
  • When running consumers on a JVM, be wary of the impact that garbage collection can have on your consumers. For example, long garbage collection pauses can result in dropped ZooKeeper sessions or consumer-group rebalances. The same is true for brokers, which risk dropping out of the cluster if garbage collection pauses are too long.

Best Practices for Working With Producers

  • Configure your producer to wait for acknowledgments. This is how the producer knows that the message has actually made it to the partition on the broker. In Kafka 0.10.x, the settings is acks; in 0.8.x, it's request.required.acks. Kafka provides fault-tolerance via replication so the failure of a single node or a change in partition leadership does not affect availability. If you configure your producers without acks (otherwise known as "fire and forget"), messages can be silently lost.
  • Configure retries on your producers. The default value is 3, which is often too low. The right value will depend on your application; for applications where data-loss cannot be tolerated, consider Integer.MAX_VALUE (effectively, infinity). This guards against situations where the broker leading the partition isn't able to respond to a produce request right away.
  • For high-throughput producers, tune buffer sizes, particularly buffer.memory and batch.size (which is counted in bytes). Because batch.size is a per-partition setting, producer performance and memory usage can be correlated with the number of partitions in the topic. The values here depend on several factors: producer data rate (both the size and number of messages), the number of partitions you are producing to, and the amount of memory you have available. Keep in mind that larger buffers are not always better because if the producer stalls for some reason (say, one leader is slower to respond with acknowledgments), having more data buffered on-heap could result in more garbage collection.
  • Instrument your application to track metrics such as number of produced messages, average produced message size, and number of consumed messages.

Best Practices for Working With Brokers

  • Compacted topics require memory and CPU resources on your brokers.Log compaction needs both heap (memory) and CPU cycles on the brokers to complete successfully, and failed log compaction puts brokers at risk from a partition that grows unbounded. You can tune log.cleaner.dedupe.buffer.size and log.cleaner.threads on your brokers, but keep in mind that these values affect heap usage on the brokers. If a broker throws an OutOfMemoryError exception, it will shut down and potentially lose data. The buffer size and thread count will depend on both the number of topic partitions to be cleaned and the data rate and key size of the messages in those partitions. As of Kafka version, monitoring the log-cleaner log file for ERROR entries is the surest way to detect issues with log cleaner threads.
  • Monitor your brokers for network throughput.Make sure to do this with both transmit (TX) and receive (RX), as well as disk I/O, disk space, and CPU usage. Capacity planning is a key part of maintaining cluster performance.
  • Distribute partition leadership among brokers in the cluster. Leadership requires a lot of network I/O resources. For example, when running with replication factor 3, a leader must receive the partition data, transmit two copies to replicas, plus transmit to however many consumers want to consume that data. So, in this example, being a leader is at least four times as expensive as being a follower in terms of network I/O used. Leaders may also have to read from disk; followers only write.
  • Don't neglect to monitor your brokers for in-sync replica (ISR) shrinks, under-replicated partitions, and unpreferred leaders. These are signs of potential problems in your cluster. For example, frequent ISR shrinks for a single partition can indicate that the data rate for that partition exceeds the leader's ability to service the consumer and replica threads.
  • Modify the Apache Log4j properties as needed. Kafka broker logging can use an excessive amount of disk space. However, don't forgo logging completely — broker logs can be the best, and sometimes only, way to reconstruct the sequence of events after an incident.
  • Either disable automatic topic creation or establish a clear policy regarding the cleanup of unused topics. For example, if no messages are seen for x days, consider the topic defunct and remove it from the cluster. This will avoid the creation of additional metadata within the cluster that you'll have to manage.
  • For sustained, high-throughput brokers, provision sufficient memory to avoid reading from the disk subsystem. Partition data should be served directly from the operating system's file system cache whenever possible. However, this means you'll have to ensure your consumers can keep up; a lagging consumer will force the broker to read from disk.
  • For a large cluster with high-throughput service level objectives (SLOs), consider isolating topics to a subset of brokers. How you determine which topics to isolate will depend on the needs of your business. For example, if you have multiple online transaction processing (OLTP) systems using the same cluster, isolating the topics for each system to distinct subsets of brokers can help to limit the potential blast radius of an incident.
  • Using older clients with newer topic message formats, and vice versa, places extra load on the brokers as they convert the formats on behalf of the client. Avoid this whenever possible.
  • Don't assume that testing a broker on a local desktop machine is representative of the performance you'll see in production. Testing over a loopback interface to a partition using replication factor 1 is a very different topology from most production environments. The network latency is negligible via the loopback and the time required to receive leader acknowledgements can vary greatly when there is no replication involved.

Additional Resources

Hopefully, these tips will get you thinking about how to use Kafka more effectively. If you're looking to increase your Kafka expertise, review the operations section of the Kafka documentation, which contains useful information about manipulating a cluster, and draws on experience from LinkedIn, where Kafka was developed. Additionally, Confluent regularly conducts and publishes online talks that can be quite helpful in learning more about Kafka.

This article was originally posted on the New Relic blog.

kafka Data stream operating system cluster AI

Published at DZone with permission of Tony Mancill, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Jakarta EE 10 Has Landed!
  • (Don’t) Follow the Hype [Comic]
  • Develop a Full-Stack Java Application With Kafka and Spring Boot
  • README [Comic]


What is the maximum size of data for Kafka? ›

The Kafka max message size is 1MB. In this lesson we will look at two approaches for handling larger messages in Kafka. Kafka has a default limit of 1MB per message in the topic.

Can Kafka handle Big Data? ›

Solution. To overcome the above described challenges Apache Kafka is working as a pub-sub system, which plays a significant role for the development prospects of Big Data ecosystem. Kafka can also store high-volume data on commodity hardware and is designed as a multi-subscription system.

How do you handle a large message in Kafka? ›

Kafka Broker Configuration

An optional configuration property, “message. max. bytes“, can be used to allow all topics on a Broker to accept messages of greater than 1MB in size. And this holds the value of the largest record batch size allowed by Kafka after compression (if compression is enabled).

How do you achieve scalability in Kafka? ›

To scale the Kafka connector side you have to increase the number of tasks, ensuring that there are sufficient partitions. In theory, you can set the number of partitions to a large number initially, but in practice, this is a bad idea.

How do you check message size in Kafka? ›

To know the amount of bytes received by a topic, you can measure this metric on the server side: kafka. server:type=BrokerTopicMetrics,name=BytesInPerSec or checking outgoing-byte-rate metric on the producer side.

What is Kafka not good for? ›

It's best to avoid using Kafka as the processing engine for ETL jobs, especially where real-time processing is needed. That said, there are third-party tools you can use that work with Kafka to give you additional robust capabilities – for example, to optimize tables for real-time analytics.

How Kafka is used in big data? ›

Kafka is used for real-time streams of data, to collect big data, or to do real time analysis (or both). Kafka is used with in-memory microservices to provide durability and it can be used to feed events to CEP (complex event streaming systems) and IoT/IFTTT-style automation systems.

What is the purpose of Apache Kafka? ›

Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously.

What makes Kafka so fast? ›

Why is Kafka fast? Kafka achieves low latency message delivery through Sequential I/O and Zero Copy Principle. The same techniques are commonly used in many other messaging/streaming platforms. Zero copy is a shortcut to save the multiple data copies between application context and kernel context.

How do you scale a topic in Kafka? ›

The main way we scale data consumption from a Kafka topic is by adding more consumers to a consumer group. It is common for Kafka consumers to do high-latency operations such as write to a database or a time-consuming computation on the data.

Why is Kafka horizontally scalable? ›

Partitions: They are a way by which Kafka provides redundancy and scalability. Each partition can be hosted on a different server, which means that a single topic can be scaled horizontally across multiple servers to provide performance far beyond the ability of a single server.

How do I ensure a message order in Kafka? ›

Short Answer
  1. Initialize the project.
  2. Get Confluent Platform.
  3. Create the Kafka topic.
  4. Describe the topic.
  5. Configure the project application.
  6. Set the application properties.
  7. Create the Kafka Producer application.
  8. Create data to produce to Kafka.

Can we send file through Kafka? ›

Sending large files directly via Kafka is possible and sometimes easier to implement. The architecture is much simpler and more cost-effective.

How do I push a message in Kafka? ›

How to Produce a Message into a Kafka Topic using the CLI?
  1. Find your Kafka hostname and port e.g., localhost:9092.
  2. If Kafka v2. 5+, use the --bootstrap-server option.
  3. If older version of Kafka, use the --broker-list option.
  4. Provide the mandatory parameters: topic name.
  5. Use the kafka-console-producer.sh CLI as outlined below.

How do I upload files to Kafka? ›

  1. Start or stop services.
  2. Connect to Apache Kafka from a different machine.
  3. Create a Kafka multi-broker cluster.
  4. Upload files using SFTP.
  5. Run a Kafka producer and consumer.
22 Jan 2021

What is Kafka message format? ›

A message in kafka is a key-value pair with a small amount of associated metadata. A message set is just a sequence of messages with offset and size information. This format happens to be used both for the on-disk storage on the broker and the on-the-wire format.

Why do we need partitions in Kafka? ›

Partitioning takes the single topic log and breaks it into multiple logs, each of which can live on a separate node in the Kafka cluster. This way, the work of storing messages, writing new messages, and processing existing messages can be split among many nodes in the cluster.

What is batch size in Kafka? ›

batch. size is the maximum number of bytes that will be included in a batch. The default is 16KB . Increasing a batch size to 32KB or 64KB can help increase the compression, throughput, and efficiency of requests. Any message that is bigger than the batch size will not be batched.

How many connections can Kafka handle? ›

A maximum of 16384 GiB of storage per broker. A cluster that uses IAM access control can have up to 3000 TCP connections per broker at any given time. To increase this limit, you can adjust the listener.

Can I use Kafka as database? ›

February 8, 2022. Apache Kafka is more than just a better message broker. The framework implementation has features that give it database capabilities. It's now replacing the relational databases as the definitive record for events in businesses.

What are the advantages of Kafka? ›

Apache Kafka is massively scalable because it allows data to be distributed across multiple servers, and it's extremely fast because it decouples data streams, which results in low latency. It can also distribute and replicate partitions across many servers, which protects against server failure.

How much storage does Kafka need? ›

Furthermore, Kafka uses heap space very carefully and does not require setting heap sizes more than 6 GB. This will result in a file system cache of up to 28-30 GB on a 32 GB machine. You need sufficient memory to buffer active readers and writers.

What is batch size in Kafka? ›

batch. size is the maximum number of bytes that will be included in a batch. The default is 16KB . Increasing a batch size to 32KB or 64KB can help increase the compression, throughput, and efficiency of requests. Any message that is bigger than the batch size will not be batched.

How many connections can Kafka handle? ›

A maximum of 16384 GiB of storage per broker. A cluster that uses IAM access control can have up to 3000 TCP connections per broker at any given time. To increase this limit, you can adjust the listener.

How many Kafka nodes do I need? ›

Even a lightly used Kafka cluster deployed for production purposes requires three to six brokers and three to five ZooKeeper nodes. The components should be spread across multiple availability zones for redundancy. Note: ZooKeeper will eventually be replaced, but its role will still have to be performed by the cluster.

Does Kafka store data in memory? ›

Kafka's designers made this simple by relying on the Linux file system, which caches up to the limits of available memory. Inside a Kafka broker, all stream data gets instantly written onto a persistent log on the filesystem, where it is cached before writing it to disk.

Where are Kafka messages stored? ›

Kafka stores all the messages with the same key into a single partition. Each new message in the partition gets an Id which is one more than the previous Id number. This Id number is also called the Offset. So, the first message is at 'offset' 0, the second message is at offset 1 and so on.

What is Kafka used for? ›

Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.

Why Kafka is so fast? ›

Why is Kafka fast? Kafka achieves low latency message delivery through Sequential I/O and Zero Copy Principle. The same techniques are commonly used in many other messaging/streaming platforms. Zero copy is a shortcut to save the multiple data copies between application context and kernel context.

How can I make Kafka producer faster? ›

  1. Provision your Kafka cluster.
  2. Initialize the project.
  3. Write the cluster information into a local file.
  4. Download and setup the Confluent CLI.
  5. Create a topic.
  6. Run a baseline producer performance test.
  7. Run a producer performance test with optimized throughput.
  8. Teardown Confluent Cloud resources.

What is buffer memory in Kafka? ›


memory represents the total bytes of memory that the producer can use to buffer records waiting to be sent to the server. The default buffer. memory is 32MB. If the producer sends the records faster than they can be delivered to the server, the buffer.

Is Kafka push or pull? ›

Since Kafka is pull-based, it implements aggressive batching of data. Kafka like many pull based systems implements a long poll (SQS, Kafka both do). A long poll keeps a connection open after a request for a period and waits for a response.

How can I run Kafka without zookeeper? ›

To start with Kafka without Zookeeper, you should run Kafka with Kafka Raft metadata mode i.e. KRaft. The KRaft controllers collectively form a Kraft quorum, which stores all the metadata information regarding Kafka clusters.

What is Kafka architecture? ›

Kafka Streams partitions data for processing it. In both cases, this partitioning is what enables data locality, elasticity, scalability, high performance, and fault tolerance. Kafka Streams uses the concepts of partitions and tasks as logical units of its parallelism model based on Kafka topic partitions.

How many messages per second can Kafka handle? ›

How many messages can Apache Kafka® process per second? At Honeycomb, it's easily over one million messages.

How many topics can be created in Kafka? ›

Is There a Limit on the Number of Topics in a Kafka Instance?
FlavorBrokersMaximum Partitions per Broker
1 more row
11 Aug 2022

Can Kafka have multiple clusters? ›

A single Kafka cluster is enough for local developments. But, it is beneficial to have multiple clusters. There are several reasons which best describes the advantages of multiple clusters: Isolation of types of data.


1. Best Practices for Container Security | DZone.com Webinar
2. Consistency without Clocks: Database Correctness at Scale | DZone.com Webinar
3. Understanding Apache Kafka P99 Latency at Scale
4. Scaling for Extreme Growth? The Data Layer is Ground Zero! | DZone Webinar by Aerospike
5. How to Evaluate a Distributed SQL Database | DZone Webinar by MariaDB
6. Data Management for Hybrid and Multi-Cloud: A Four-Step Journey | DZone.com Webinar

Top Articles

Latest Posts

Article information

Author: Edwin Metz

Last Updated: 11/14/2022

Views: 6512

Rating: 4.8 / 5 (58 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Edwin Metz

Birthday: 1997-04-16

Address: 51593 Leanne Light, Kuphalmouth, DE 50012-5183

Phone: +639107620957

Job: Corporate Banking Technician

Hobby: Reading, scrapbook, role-playing games, Fishing, Fishing, Scuba diving, Beekeeping

Introduction: My name is Edwin Metz, I am a fair, energetic, helpful, brave, outstanding, nice, helpful person who loves writing and wants to share my knowledge and understanding with you.