Announcing Kafka-on-Pulsar: bring native Kafka protocol support to Apache Pulsar
March 24, 2020
We are excited to announce that StreamNative and OVHcloud are open-sourcing “Kafka on Pulsar" (KoP). KoP brings the native Apache Kafka protocol support to Apache Pulsar by introducing a Kafka protocol handler on Pulsar brokers. By adding the KoP protocol handler to your existing Pulsar cluster, you can now migrate your existing Kafka applications and services to Pulsar without modifying the code. This enables Kafka applications to leverage Pulsar’s powerful features, such as:
- Streamlined operations with enterprise-grade multi-tenancy
- Simplified operations with a rebalance-free architecture
- Infinite event stream retention with Apache BookKeeper and tiered storage
- Serverless event processing with Pulsar Functions
What is Apache Pulsar?
Apache Pulsar is an event streaming platform designed from the ground up to be cloud-native deploying a multi-layer and segment-centric architecture. The architecture separates serving and storage into different layers, making the system container-friendly. The cloud-native architecture provides scalability, availability and resiliency and enables companies to expand their offerings with real-time data-enabled solutions. Pulsar has gained wide adoption since it was open-sourced in 2016 and was designated an Apache Top-Level project in 2018.
The Need for KoP
Pulsar provides a unified messaging model for both queueing and streaming workloads. Pulsar implemented its own protobuf-based binary protocol to provide high performance and low latency. This choice of protobuf makes it convenient to implement Pulsar clients and the project already supports Java, Go, Python and C++ languages alongside third-party clients provided by the community. However, existing applications written using other messaging protocols had to be rewritten to adopt Pulsar’s new unified messaging protocol.
To address this, the Pulsar community developed applications to facilitate the migration to Pulsar from other messaging systems. For example, Pulsar provides a Kafka wrapper on Kafka Java API, allows existing applications that already use Kafka Java Client to switch from Kafka to Pulsar without code change. Pulsar also has a rich connector ecosystem, connecting Pulsar with other data systems. Yet, there was still a strong demand from those looking to switch from other Kafka applications to Pulsar.
StreamNative and OVHcloud's collaboration
StreamNative was receiving a lot of inbound requests for help migrating from other messaging systems to Pulsar and recognized the need to support other messaging protocols (such as AMQP and Kafka) natively on Pulsar. StreamNative began working on introducing a general protocol handler framework in Pulsar that would allow developers using other messaging protocols to use Pulsar.
Internally, OVHcloud had been running Apache Kafka for years. Despite their experience in operating multiple clusters with millions of messages per second on Kafka, they met painful operational challenges. For example, putting thousands of topics from thousands of users into a single cluster was difficult without multi-tenancy.
As a result, OVHcloud decided to shift and build the foundation of their topic-as-a-service product, called ioStream, on Pulsar instead of Kafka. Pulsar’s multi-tenancy and the overall architecture with Apache Bookkeeper simplified operations compared to Kafka.
After spawning the first region, OVHcloud decided to implement it as a proof-of-concept proxy capable of transforming the Kafka protocol to Pulsar on the fly. They encountered some issues, mainly on how to simulate and manipulate offsets and consumer groups as they did not have access to low-level storage details. During this process, OVHcloud discovered that StreamNative was working on bringing the Kafka protocol natively to Pulsar, and they joined forces to develop KoP.
KoP was developed to provide a streamlined and comprehensive solution leveraging Pulsar and BookKeeper’s event stream storage infrastructure and Pulsar’s pluggable protocol handler framework. KoP is implemented as a protocol handler plugin with protocol name "kafka". It can be installed and configured to run as part of Pulsar brokers.
Both Pulsar and Kafka share a very similar data model around log for both pub/sub messaging and event streaming. For example, both are built on top of a distributed log. A key difference between these two systems is how they implement the distributed log. Kafka implements the distributed log in a partition-basis architecture, where a distributed log (a partition in Kafka) is designated to store in a set of brokers, while Pulsar deploys a segment-based architecture to implement its distributed log by leveraging Apache BookKeeper as its scale-out segment storage layer. Pulsar’s segment based architecture provides benefits such as rebalance-free, instant scalability, and infinite event stream storage. You can learn more about the key differences between Pulsar and Kafka in this Splunk blog and in this blog from the Bookkeeper project.
Since both of the systems are built on a similar data model, a distributed log, it is very simple to implement a Kafka-compatible protocol handler by leveraging Pulsar’s distributed log storage and its pluggable protocol handler framework (introduced in the 2.5.0 release).
The implementation is done by comparing the protocols between Pulsar and Kafka. We found that there are a lot of similarities between these two protocols. Both protocols are comprised of the following operations:
- Topic Lookup: All the clients connect to any broker to lookup the metadata (i.e. the owner broker) of the topics. After fetching the metadata, the clients establish persistent TCP connections to the owner brokers.Produce: The clients talk to the owner broker of a topic partition to append the messages to a distributed log.
- Consume: The clients talk to the owner broker of a topic partition to read the messages from a distributed log.
- Offset: The messages produced to a topic partition are assigned with an offset. The offset in Pulsar is called MessageId. Consumers can use offsets to seek to a given position within the log to read messages.
- Consumption State: Both systems maintain the consumption state for consumers within a subscription (or a consumer group in Kafka). The consumption state is stored in
__offsetstopic in Kafka, while the consumption state is stored as
As you can see, these are all the primitive operations provided by a scale-out distributed log storage such as Apache BookKeeper. The core capabilities of Pulsar are implemented on top of Apache BookKeeper. Thus it is pretty easy and straightforward to implement the Kafka concepts by using the existing components that Pulsar has developed on BookKeeper.
The following figure illustrates how we add the Kafka protocol support within Pulsar. We are introducing a new Protocol Handler which implements the Kafka wire protocol by leveraging the existing components (such as topic discovery, the distributed log library - ManagedLedger, cursors and etc) that Pulsar already has.
In Kafka, all the topics are stored in one flat namespace. But in Pulsar, topics are organized in hierarchical multi-tenant namespaces. We introduce a setting
kafkaNamespace in broker configuration to allow the administrator configuring to map Kafka topics to Pulsar topics.
In order to let Kafka users leverage the multi-tenancy feature of Apache Pulsar, a Kafka user can specify a Pulsar tenant and namespace as its SASL username when it uses SASL authentication mechanism to authenticate a Kafka client.
Message ID and offset
In Kafka, each message is assigned with an offset once it is successfully produced to a topic partition. In Pulsar, each message is assigned with a
MessageID. The message id consists of 3 components,
batch-index. We are using the same approach in Pulsar-Kafka wrapper to convert a Pulsar MessageID to an offset and vice versa.
Both a Kafka message and a Pulsar message have key, value, timestamp, and headers (note: this is called ‘properties’ in Pulsar). We convert these fields automatically between Kafka messages and Pulsar messages.
We use the same topic lookup approach for the Kafka request handler as the Pulsar request handler. The request handler does topic discovery to lookup all the ownerships for the requested topic partitions and responds with the ownership information as part of Kafka
TopicMetadata back to Kafka clients.
When the Kafka request handler receives produced messages from a Kafka client, it converts Kafka messages to Pulsar messages by mapping the fields (i.e. key, value, timestamp and headers) one by one, and uses the ManagedLedger append API to append those converted Pulsar messages to BookKeeper. Converting Kafka messages to Pulsar messages allows existing Pulsar applications to consume messages produced by Kafka clients.
When the Kafka request handler receives a consumer request from a Kafka client, it opens a non-durable cursor to read the entries starting from the requested offset. The Kafka request handler converts the Pulsar messages back to Kafka messages to allow existing Kafka applications to consume the messages produced by Pulsar clients.
Group coordinator & offsets management
The most challenging part is to implement the group coordinator and offsets management. Because Pulsar doesn’t have a centralized group coordinator for assigning partitions to consumers of a consumer group and managing offsets for each consumer group. In Pulsar, the partition assignment is managed by broker on a per-partition basis, and the offset management is done by storing the acknowledgements in cursors by the owner broker of that partition.
It is difficult to align the Pulsar model with the Kafka model. Hence, for the sake of providing full compatibility with Kafka clients, we implemented the Kafka group coordinator by storing the coordinator group changes and offsets in a system topic called
public/kafka/__offsets in Pulsar.This allows us to bridge the gap between Pulsar and Kafka and allows people to use existing Pulsar tools and policies to manage subscriptions and monitor Kafka consumers. We add a background thread in the implemented group coordinator to periodically sync offset updates from the system topic to Pulsar cursors. Hence a Kafka consumer group is effectively treated as a Pulsar subscription. All the existing Pulsar toolings can be used for managing Kafka consumer groups as well.
Bridge two popular messaging ecosystems
At both companies, we value customer success. We believe that providing a native Kafka protocol on Apache Pulsar will reduce the barriers for people adopting Pulsar to achieve their business success. By integrating two popular event streaming ecosystems, KoP unlocks new use cases. Customers can leverage advantages from each ecosystem and build a truly unified event streaming platform with Apache Pulsar to accelerate the development of real-time applications and services.
With KoP, a log collector can continue collecting log data from its sources and producing messages to Apache Pulsar using existing Kafka integrations. The downstream applications can use Pulsar Functions to process the events arriving in the system to do serverless event streaming.
Try it out
KoP is open sourced under Apache License V2 in https://github.com/streamnative/kop. It is available as part of StreamNative Platform. You can download the StreamNative platform to try out all the features of KoP. If you already have a Pulsar cluster running and would like to enable Kafka protocol support on it, you can follow the instructions to install the KoP protocol handler to your existing Pulsar cluster.
StreamNative and OVHcloud are also hosting a webinar about KoP on March 31. If you are interested in learning more details about KoP, please sign up. Looking forward to meeting you online.
The KoP project was originally initiated by StreamNative. The OVHcloud team joined the project to collaborate on the development of the KoP project. Many thanks to Pierre Zemb and Steven Le Roux from OVHcloud for their contributions to this project!
Have something to say about this article? Share it with us on Twitter or contact us.
Kafka uses a binary protocol over TCP. The protocol defines all APIs as request response message pairs. All messages are size delimited and are made up of the following primitive types.
Apache Kafka and Pulsar both support long-term storage, but Kafka allows a smart compaction strategy instead of creating snapshots and leaving the Topic as is. Apache Pulsar provides for the deletion of messages based on consumption.
Pulsar provides an easy option for applications that are currently written using the Apache Kafka Java client API.
Introduction. The HTTP - Kafka bridge allows clients to communicate with an Apache Kafka cluster over the HTTP/1.1 protocol. It's possible to include a mixture of both HTTP clients and native Apache Kafka clients in the same cluster.
Kafka writes data to a scalable disk structure and replicates it for fault-tolerance. Producers can wait for write acknowledgments. Stream processing with Kafka Streams API, enables complex aggregations or joins of input streams onto an output stream of processed data.
Apache Kafka is a distributed data store optimized for ingesting and processing streaming data in real-time. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records in simultaneously.
Pulsar is able to achieve 2.5 times the maximum throughput compared to Kafka. This is a significant advantage for use cases that ingest and process large volumes of data, such as log analysis, cybersecurity, and sensor data collection. Higher throughput means less hardware, resulting in lower operational costs.
Pulsar functions can be written in Java, Python, or Go and can be configured to run as Kubernetes pods.
Pulsar uses a system called Apache BookKeeper for persistent message storage. BookKeeper is a distributed write-ahead log (WAL) system that provides a number of crucial advantages for Pulsar: It enables Pulsar to utilize many independent logs, called ledgers. Multiple ledgers can be created for topics over time.
Pulsar is a unified messaging and streaming platform that can handle all your pub-sub, queueing, streaming, and stream processing needs in one place.
Pulsar also uses a push-based approach but with an API that simulates consumer pulls. Pull based architectures are often preferable for high throughput workloads as they allow consumers to manage their own flow control, essentially fetching only what they need.
To start with Kafka without Zookeeper, you should run Kafka with Kafka Raft metadata mode i.e. KRaft. The KRaft controllers collectively form a Kraft quorum, which stores all the metadata information regarding Kafka clusters.
You can also make a wide array of other POST and GET API calls with the Kafka REST Proxy.
Since Kafka Connect is intended to be run as a service, it also supports a REST API for managing connectors. By default this service runs on port 8083 . When executed in distributed mode, the REST API will be the primary interface to the cluster.
Importing data from REST APIs into Apache® Kafka® topics generally involves writing a custom Kafka producer to read the data from the REST API and writing it in to topics. If you are dealing with multiple REST endpoints, responses, and authentications this can get complex quickly.
Kafka is a powerful stream processing tool, but it's an asynchronous tool.
We can use Kafka as a Message Queue or a Messaging System but as a distributed streaming platform Kafka has several other usages for stream processing or storing data. We can use Apache Kafka as: Messaging System: a highly scalable, fault-tolerant and distributed Publish/Subscribe messaging system.
Kafka consumers use a pull model to consume data. This means that a consumer periodically sends a request to a Kafka broker in order to fetch data from it. This is called a FetchRequest.
Jay Kreps chose to name the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.
Apache Kafka is an open source message broker that provides high throughput, high availability, and low latency. Apache Kafka can be used either on its own or with the additional technology from Confluent. Confluent Kafka provides additional technologies that sit on top of Apache Kafka.
Apache Kafka is an open source, distributed data streaming platform that can publish, subscribe to, store, and process streams of records in real time.
Pulsar is much faster than Kafka, thanks to its capability to deliver higher throughput with more consistent, significantly lower latency. However, the thing that really separates Pulsar from Kafka is one of its top-class features – geo-replication.
Ever since the cloud-native messaging and streaming platform Apache Pulsar was contributed to open source by Yahoo in 2017, the distributed platform has exploded in popularity.
Apache Spark is a general processing engine developed to perform both batch processing -- similar to MapReduce -- and workloads such as streaming, interactive queries and machine learning (ML). Kafka's architecture is that of a distributed messaging system, storing streams of records in categories called topics.
Pulsar is a multi-tenant, high-performance solution for server-to-server messaging. Originally developed by Yahoo, Pulsar is under the stewardship of the Apache Software Foundation.
And over 600 plus companies have adopted Pulsar and it's growing at a faster pace. And some of the companies that is running Pulsar in production includes Verizon Media, which is Yahoo, Yahoo Japan, Tencent and Splunk, recently adopted Pulsar and a big way.
Pulsar was contributed to open source by Yahoo in 2016 and became a top-level Apache Software Foundation project in 2018.
Apache Pulsar is an open source streaming project that was built at Yahoo as a streaming platform to solve for some of the limitations in Kafka. There are four areas where Pulsar is particularly strong: geo-replication, scaling, multitenancy, and queuing.
The Pulsar GraphQL API provides programmatic access to read and update Pulsar's data using a set of GraphQL API methods. This allows developers to access some types of data from Pulsar, and then use that data to leverage their own web applications.
Pulsar exposes a client API with language bindings for Java, Go, Python, C++ and C#. The client API optimizes and encapsulates Pulsar's client-broker communication protocol and exposes a simple and intuitive API for use by applications.
There are four naked bikes in the portfolio which includes the Pulsar 125, Pulsar 150 (available in three variants), Pulsar NS 160 and the Pulsar NS 200. The half-faired bikes include the Bajaj Pulsar 180F and the Pulsar 220F.
Pulsars are rotating neutron stars observed to have pulses of radiation at very regular intervals that typically range from milliseconds to seconds. Pulsars have very strong magnetic fields which funnel jets of particles out along the two magnetic poles. These accelerated particles produce very powerful beams of light.
A pulsar is formed when a massive star collapses exhausts its supply of fuel. It blasts out in a giant explosion known as a supernova, the most powerful and violent event in the universe. Without the opposing force of nuclear fusion to balance it, gravity begins to pull the mass of the star inward until it implodes.
Apache Kafka: Pull-based approach
A pull model is logical for Kafka because of partitioned data structure. Kafka provides message order in a partition with no contending consumers. This allows users to leverage the batching of messages for effective message delivery and higher throughput.
Finally, Redis streams are functionally very equivalent to Kafka. The following is a summary of the features of Redis streams: Unlike with Pub/Sub, messages are not removed from the stream once they are consumed. Redis streams can be consumed either in blocking or nonblocking ways.
Kafka is way faster than ActiveMQ. It can handle millions of messages per sec. ActiveMQ supports both message queues and publishes/subscribe messaging systems. On the other hand, Kafka is based on publish/subscribe but does have certain advantages of message-queues.
Replacing ZooKeeper with internally managed metadata will improve scalability and management, according to Kafka's developers. Change is coming for users of Apache Kafka, the leading distributed event-streaming platform.
If one the ZooKeeper nodes fails, the following occurs: Other ZooKeeper nodes detect the failure to respond. A new ZooKeeper leader is elected if the failed node is the current leader. If multiple nodes fail and ZooKeeper loses its quorum, it will drop into read-only mode and reject requests for changes.
At a detailed level, ZooKeeper handles the leadership election of Kafka brokers and manages service discovery as well as cluster topology so each broker knows when brokers have entered or exited the cluster, when a broker dies and who the preferred leader node is for a given topic/partition pair.
So if we implement a message broker based on the MQTT protocol, is this MQTT broker equivalent to Kafka? The answer is still no! Although Kafka is also a messaging system based on publish/subscribe pattern, it is also called ''distributed commit log'' or "distributed streaming platform".
The use of a standardized message protocol allows you to replace your RabbitMQ broker with any AMQP based broker. Kafka uses a custom protocol, on top of TCP/IP for communication between applications and the cluster. Kafka can't simply be removed and replaced, since its the only software implementing this protocol.
Before I get started, I must emphasize that the design pattern that is covered in this article involves using a WebSocket-based realtime messaging middleware between Kafka and your Internet-facing users.
A message in kafka is a key-value pair with a small amount of associated metadata. A message set is just a sequence of messages with offset and size information. This format happens to be used both for the on-disk storage on the broker and the on-the-wire format.
TL;DR. Choose MQTT for messaging if you have a bad network, tens of thousands of clients, or the need for a lightweight push-based messaging solution, then MQTT is the right choice. Elsewhere, Kafka, a powerful event streaming platform, is probably a great choice for messaging, data integration, and data processing.
Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design. On the other hand, MQTT is detailed as "A machine-to-machine Internet of Things connectivity protocol".
TCP is at network level and mqtt application layer. So Mqtt rely on tcp to work. With an application layer protocol you can build your business logic and sending message without having the knowledge of network things.. With MQTT the sender knows whether the message was received.
The Advanced Message Queuing Protocol (AMQP) is an open standard for passing business messages between applications or organizations. It connects systems, feeds business processes with the information they need and reliably transmits onward the instructions that achieve their goals.
MQTT vs AMQP: Understanding the Differences
AMQP was designed to provide general purpose high performance enterprise messaging, whereas MQTT was created as an IoT protocol. AMQP has many features to cater for a range of messaging scenarios and is more complex than MQTT.
The results showed that MQTT and AMQP are four times faster than HTTP protocol when comparing the message sent latencies.
Websockets should be used when you need real-time interactions, such as propagating the same message to multiple users (group messaging) in a chat app. Kafka should be used as a backbone communication layer between components of a system. It fits really well in event-driven architectures (microservices).
There are two methods to expose your Apache Kafka cluster so that external client applications that run outside the Kubernetes cluster can access it: using LoadBalancer type services. using NodePort type services.
Usually, GraphQL is used in the frontend with a server implemented in Node. js, while Kafka is often used as an integration layer between backend components.
Kafka stores the log of records (messages) from the first message up till now. Consumers fetch data starting from the specified offset in this log of records.
Kafka is an open source software which provides a framework for storing, reading and analysing streaming data. Being open source means that it is essentially free to use and has a large network of users and developers who contribute towards updates, new features and offering support for new users.
Jay Kreps chose to name the software after the author Franz Kafka because it is "a system optimized for writing", and he liked Kafka's work.