Apache Kafka is an awesome way to stream data between applications. It’s often used to connect up components in a distributed system, so it’s especially useful if you’re using microservices. But it does have some rather confusing terminology. So what is the difference between Kafka, Kafka Streams and Kafka Connect? Let’s find out.
Apache Kafka is a set of tools designed for event streaming.
People tend to get pretty excited about Kafka, because it’s fault-tolerant (this means that Kafka will keep running even when some of its components fail), and it can be run at huge scale. 📏⛰
Kafka, Kafka Streams and Kafka Connect are all components in the Kafka project. These three components seem similar, but there are some key things that set them apart.
But first, for all you busy TL;DR types, here’s the executive summary:
Apache Kafka is a back-end application that provides a way to share streams of events between applications.
An application publishes a stream of events or messages to a topic on a Kafka broker. The stream can then be consumed independently by other applications, and messages in the topic can even be replayed if needed.
Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker.
Kafka Connect is an API for moving data into and out of Kafka. It provides a pluggable way to integrate other applications with Kafka, by letting you use and share connectors to move data to or from popular applications, like databases.
P.S. Want the no-tech beginner’s guide to Kafka? Check out the fantastic Gently Down The Stream, which is an illustrated guide to Kafka by Mitch Seymour.
At first glance, Apache Kafka just seems like a message broker, which can ship a message from a producer to a consumer. But actually Kafka is a lot more than that.
Kafka is primarily a distributed event log. It’s usually run as a cluster of several brokers, which are joined together.
If you’re just getting started with Apache Kafka, you will probably want to learn the basics first. It’s worth investing in a course, as Kafka can become very complicated, very quickly!
This Kafka beginners’ course from Stephane Maarek is well-paced, and explains all of the complex terminology. It contains more than 7 hours of video lectures, sprinkled with examples that help you to understand the technical concepts.
And, most importantly, it includes a real-world project that you can follow, so you can start using Kafka for real.
Now let’s learn a bit more about Apache Kafka.
How does Apache Kafka work?
With Kafka, messages are published onto topics. These topics are like never-ending log files. Producers put their messages onto a topic. Consumers drop in at any time, receive messages from the topic, and can even rewind and replay old messages.
Messages are only deleted from a topic when you want them to be deleted. You can even run a Kafka broker that keeps every message ever (set log retention to “forever”), and Kafka will never delete anything.
One of the best things about Kafka is that it can replicate messages between brokers in the cluster. So consumers can keep receiving messages, even if a broker crashes. Life-changing!
This makes Kafka very capable of handling all sorts of scenarios, from simple point-to-point messaging, to stock price feeds, to processing massive streams of website clicks, and even using Kafka like a database (yes, some people are doing that).
If you’re learning Kafka and you want a Kafka cluster to play around with, try the free “Developer Duck” plan from Cloudkarafka.
Kafka vs traditional message brokers
One of the major things that sets Kafka apart from “traditional” message brokers like RabbitMQ or ActiveMQ, is that a topic in Kafka doesn’t know or care about its consumers. It’s simply a log that consumers can dive into and access data from, at any time.
On the other hand, on a traditional message broker, messages are delivered only once to consumers. And in traditional topics, consumers who subscribe to a topic can only receive messages from that point forward, they can’t rewind.
Why would you use it?
Kafka is massively scalable. Think of the biggest thing you can think of, and then double it. Kafka can handle that.
People talk about Kafka being scalable because it can handle a very large number of messages and consumers, due to the way that it spreads the load across a cluster of brokers. It spreads your messages across these brokers, in topic segments known as partitions.
Kafka is pretty damn performant. Even a fairly small Kafka cluster can support very high throughput of messages.
When would you use it?
Kafka is suited very well to these types of use cases:
Collecting metrics. Instead of writing metrics to a log file or database, you can write data to a Kafka “topic”, which other consumers might also be interested in reading.
Collecting high-volume events (e.g. website clicks)
Sharing database change events (called “change data capture”). This method is sometimes used for sharing data between microservices.(Video) Apache Kafka® 101: Kafka Connect
Last-value queues – where you might publish a bunch of information to a topic, but you want people to be able to access the “last value” quickly, e.g. stock prices.
Simple messaging (similar to RabbitMQ or ActiveMQ). Kafka can do simple messaging too.
What are the alternatives?
Amazon Web Services have their own data streaming product called Kinesis, which is modelled on Kafka.
If you only need traditional point-to-point messaging, you could use a classic message broker like RabbitMQ or ActiveMQ, or AWS Simple Queue Service (SQS).
How do you use it?
You can run your own cluster of Kafka brokers, or pay for a managed Kafka service from some cloud providers. Consumers can then connect to your cluster of brokers, to publish and consume events.
Kafka also includes a Producer and Consumer API, so that you can send and receive messages from your applications. But these APIs have a few limitations, as Stephane Maarek writes.
So Kafka is often paired with Kafka Streams and Kafka Connect, which are simpler APIs that make it easier to do the two really common things that people want to do with Kafka: process data in Kafka topics, and connect Kafka to external systems.
So let’s check out each of these projects in turn.
If you’re interested in running Apache Kafka on Kubernetes, then make sure you take a look at the Strimzi open source project. Or, if you’re using OpenShift, then Red Hat offers its own version of Kafka, which is called the Red Hat AMQ streams component.
Kafka Streams is another project from the Apache Kafka community. It’s basically a Java API for processing and transforming data inside Kafka topics.
Kafka Streams, or the Streams API, makes it easier to transform or filter data from one Kafka topic and publish it to another Kafka topic, although you can use Streams for sending events to external systems if you wish. (But, for doing that, you might find it easier to use Kafka Connect, which we’ll look at shortly.)
You can think of Kafka Streams as a Java-based toolkit that lets you change and modify messages in Kafka in real time, before the messages reach your external consumers.
If you’re looking to get started with Streams, you should grab a copy of Mastering Kafka Streams and ksqlDB by Mitch Seymour. It covers many aspects of data processing with Kafka Streams, including the central Processor API.
Let’s dive a bit more into Kafka Streams.
How do you use it?
To use Kafka Streams, you first import it into your Java application as a library (JAR file). The library gives you the Kafka Streams Java API.
With the API, you can write code to process or transform individual messages, one-by-one, and then publish those modified messages to a new Kafka topic, or to an external system.
With Kafka Streams, all your stream processing takes place inside your app, not on the brokers. You can even run multiple instances of your Kafka Streams-based application if you’ve got a firehose of messages and you need to handle high volumes.
Are there any alternatives?
Kafka Streams isn’t the only way to process data within Kafka. You could also use another open source project like Apache Samza or Apache Storm. But Kafka Streams allows you to do your stream processing using Kafka-specific tools.
And it’s pretty popular.
I love @kafkastreams, however I love burning shit more. https://t.co/6Em09O9Qwi
And so what if you want to bring data in or out of Kafka from other systems? Then you might want to look at Kafka Connect.
The final tool in this rundown of the Kafka projects is Kafka Connect.
Kafka Connect is a tool for connecting different input and output systems to Kafka. Think of it like an engine that can run a number of different components, which can stream Kafka messages into databases, Lambda functions, S3 buckets, or apps like Elasticsearch or Snowflake.
So it makes it much easier to connect Kafka to the other systems in your
big ball of mud architecture, without having to write all the glue code yourself. (And let’s be honest, it’s often much better to use someone else’s tried and tested code than write your own.)
How do you use it?
To use Kafka Connect, you download the Connect distribution, set the configuration files how you want them, and then start a Kafka Connect instance.
To use additional connectors, you can find them on places like Confluent Hub or community projects on GitHub. Then you unzip the download into your target environment, and tell Kafka Connect where to look for connectors.
You can also run Kafka Connect in containers. Some projects which use Kafka Connect, offer their own pre-built Docker image. Debezium has a ready-made Connect image that you can pull and run.
The idea of Kafka Connect is to minimise the amount of code you need to write to get data flowing between Kafka and your other systems.
What are the alternatives?
You don’t have to use Kafka Connect to integrate Kafka with your other apps and databases. You can write your own code using the Producer and Consumer API, or use the Streams API.
Or you could even use an integration framework that supports Kafka, like Apache Camel or Spring Integration.
Some integration frameworks have support for Kafka Connect, such as Apache Camel. This lets you integrate with Kafka, using the way that you might be already familiar with.
If you want to know more about Apache Kafka, Streams and Connect, then I recommend these articles:
What is the difference between Kafka and Kafka Connect? ›
Apache Kafka is a distributed streaming platform and kafka Connect is framework for connecting kafka with external systems like databases, key-value stores, search indexes, and file systems, using so-called Connectors.Is Kafka and Kafka stream is same? ›
Difference between Kafka Streams and Kafka Consumer
Kafka Streams is an easy data processing and transformation library within Kafka used as a messaging service. Whereas, Kafka Consumer API allows applications to process messages from topics.
Kafka Streams is an API for writing client applications that transform data in Apache Kafka. You usually do this by publishing the transformed data onto a new topic. The data processing itself happens within your client application, not on a Kafka broker. Kafka Connect is an API for moving data into and out of Kafka.What is Kafka Connect used for? ›
Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka® and other data systems. It makes it simple to quickly define connectors that move large data sets in and out of Kafka.Is Kafka stream push or pull? ›
Since Kafka is pull-based, it implements aggressive batching of data.How many modes exist in Kafka Connect? ›
The process is called a worker in Kafka Connect. There are two modes for running workers: Standalone mode: Useful for development and testing Kafka Connect on a local machine.Why use Kafka Streams over Kafka? ›
Kafka Streams greatly simplifies the stream processing from topics. Built on top of Kafka client libraries, it provides data parallelism, distributed coordination, fault tolerance, and scalability.
- Point-to-point messaging between IT applications.
- Data ingestion from various data sources into one or more data sinks.
- Data processing and data correlation (often called event streaming or event stream processing)
Kafka supports two types of topics: Regular and compacted. Regular topics can be configured with a retention time or a space bound. If there are records that are older than the specified retention time or if the space bound is exceeded for a partition, Kafka is allowed to delete old data to free storage space.Where does Kafka Connect run? ›
But where do the tasks actually run? Kafka Connect runs under the Java virtual machine (JVM) as a process known as a worker. Each worker can execute multiple connectors.
Is Kafka Connect an API? ›
Since Kafka Connect is intended to be run as a service, it also supports a REST API for managing connectors. By default this service runs on port 8083 . When executed in distributed mode, the REST API will be the primary interface to the cluster.How Connect Kafka Connect? ›
- Download a Kafka Connect connector, either from GitHub or Confluent Hub Confluent Hub.
- Create a configuration file for your connector.
- Use the connect-standalone.sh CLI to start the connector.
Kafka Streams allows the user to configure the number of threads that the library can use to parallelize processing within an application instance. Each thread can execute one or more stream tasks with their processor topologies independently.What is Kafka Connect schema? ›
A schema is a definition that specifies the structure and type information about data in an Apache Kafka topic. Use a schema when you need to ensure the data on the topic populated by your source connector has a consistent structure.When should I use Kafka connector? ›
Kafka Connectors are ready-to-use components, which can help us to import data from external systems into Kafka topics and export data from Kafka topics into external systems. We can use existing connector implementations for common data sources and sinks or implement our own connectors.Is Kafka a queue or stream? ›
We can use Kafka as a Message Queue or a Messaging System but as a distributed streaming platform Kafka has several other usages for stream processing or storing data. We can use Apache Kafka as: Messaging System: a highly scalable, fault-tolerant and distributed Publish/Subscribe messaging system.Is Kafka stateful or stateless? ›
Stateful operations in Kafka Streams are backed by state stores. The default is a persistent state store implemented in RocksDB, but you can also use in-memory stores. State stores are backed up by a changelog topic, making state in Kafka Streams fault-tolerant.Is Kafka LIFO or FIFO? ›
Kafka supports a publish-subscribe model that handles multiple message streams. These message streams are stored as a first-in-first-out (FIFO) queue in a fault-tolerant manner. Processes can read messages from streams at any time.How many connections can Kafka handle? ›
A maximum of 16384 GiB of storage per broker. A cluster that uses IAM access control can have up to 3000 TCP connections per broker at any given time. To increase this limit, you can adjust the listener.How many tasks Kafka Connect? ›
4 topics, 5 partitions each - the Kafka connection will spawn 10 tasks, each handling data from 2 topic partitions.
Does Kafka Connect need ZooKeeper? ›
This KIP introduces the Kafka Raft, or KRaft, mode of Self-Managed Metadata Quorum, and completely removes the necessity of ZooKeeper to manage a Kafka cluster.What is the purpose of streams? ›
Besides providing drinking water and irrigation for crops, streams wash away waste and can provide electricity through hydropower. People often use streams recreationally for activities such as swimming, fishing, and boating. Streams also provide important habitat for wildlife.What are the advantages of streams? ›
There are a lot of benefits to using streams in Java, such as the ability to write functions at a more abstract level which can reduce code bugs, compact functions into fewer and more readable lines of code, and the ease they offer for parallelization.Is Kafka stream open-source? ›
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.What is better than Kafka? ›
Apache Pulsar incorporates the best features of Traditional Messaging systems like RabbitMQ and Pub-sub (publish-subscribe) systems like Apache Kafka. With high performance, Cloud-native package, you get the best of both worlds.What is difference between Kafka and MQ? ›
Apache Kafka scales well and may track events but lacks some message simplification and granular security features. It is perhaps an excellent choice for teams that emphasize performance and efficiency. IBM MQ is a powerful conventional message queue system, but Apache Kafka is faster.What are disadvantages of Kafka? ›
- Doesn't possess a full set of management and monitoring tools. ...
- The broker uses certain system calls to deliver messages to the consumer, but if the message needs some tweaking, doing so reduces Kafka's performance significantly.
A replication factor is the number of copies of data over multiple brokers. The replication factor value should be greater than 1 always (between 2 or 3). This helps to store a replica of the data in another broker from where the user can access it.Why are there 3 brokers in Kafka? ›
If we have 3 Kafka brokers spread across 3 datacenters, then a partition with 3 replicas will never have multiple replicas in the same datacenter. With this configuration, datacenter outages are not significantly different from broker outages.What type of data can be stored in Kafka? ›
Kafka can be used for storing data. You may be wondering whether Kafka is a relational or NoSQL database. The answer is that it is neither one nor the other. Kafka, as an event streaming platform, works with streaming data.
How do I know if Kafka Connect is running? ›
You can use the REST API to view the current status of a connector and its tasks, including the ID of the worker to which each was assigned. Connectors and their tasks publish status updates to a shared topic (configured with status. storage. topic ) which all workers in the cluster monitor.Is Kafka Connect for cloud? ›
Administering Oracle Event Hub Cloud Service — Dedicated
Kafka Connect is a scalable and reliable tool for streaming data between Apache Kafka and other systems. You can choose to have Kafka Connect while creating a new Dedicated Cluster.
The Kafka Connect JDBC Sink connector exports data from Kafka topics to any relational database with a JDBC driver. The Kafka Connect JMS Source connector is used to move messages from any JMS-compliant broker into Kafka. The Kafka Connect Elasticsearch Service Sink connector moves data from Kafka to Elasticsearch.Is Kafka TCP or HTTP? ›
Kafka uses a binary protocol over TCP. The protocol defines all APIs as request response message pairs. All messages are size delimited and are made up of the following primitive types.Can Kafka call a REST API? ›
The Kafka REST Proxy is a RESTful web API that allows your application to send and receive messages using HTTP rather than TCP. It can be used to produce data to and consume data from Kafka or for executing queries on cluster configuration.What is difference between Kafka and confluent Kafka? ›
Apache Kafka is an open source message broker that provides high throughput, high availability, and low latency. Apache Kafka can be used either on its own or with the additional technology from Confluent. Confluent Kafka provides additional technologies that sit on top of Apache Kafka.How do I join a stream on Kafka? ›
Stream-stream joins combine two event streams into a new stream. The streams are joined based on a common key, so keys are necessary. You define a time window, and records on either side of the join need to arrive within the defined window.How does Kafka Connect to database? ›
- Step 1: Display All Available Connectors. ...
- Step 2: Establish the Required Connector Configuration Properties. ...
- Step 3: Create MySQL to Kafka Configuration File. ...
- Step 4: Load Configuration File to Create MySQL Kafka Connector. ...
- Step 5: Check Your Connector Status.
Kafka Connect is a pluggable framework with which you can use plugins for different connectors, transformations, and converters. You can find hundreds of these at Confluent Hub.What are the 3 types of streams? ›
One method of classifying streams is through physical, hydrological, and biological characteristics. Using these features, streams can fall into one of three types: perennial, intermittent, and ephemeral.
Is Kafka batch or stream? ›
As a technology that enables stream processing on a global scale, Kafka has emerged as the de facto standard for streaming architecture.How fast is Kafka streams? ›
|Peak Throughput (MB/s)||605 MB/s||305 MB/s|
|p99 Latency (ms)||5 ms (200 MB/s load)||25 ms (200 MB/s load)|
Spring framework has great support for testing your Spring application with Apache Kafka. You add spring-kafka-dependency in your maven pom. xml file and you annotate your test class with @EmbbededKafka and Spring will do the rest.Does Kafka Connect require schema registry? ›
If you specify a converter in a connector or worker (as an override or as the only setting), you must always include both the converter and the Schema Registry URL, otherwise the connector or worker will fail.Is Kafka synchronous or asynchronous? ›
Default Kafka producer send API is asynchronous and nonblocking. When you call the send API, it merely adds the ProducerRecord into the buffer and returns immediately. Asynchronous and non-blocking send is efficient.Is Kafka a message queue or bus? ›
Kafka is a message bus optimized for high-ingress data streams and replay. Kafka can be seen as a durable message broker where applications can process and re-process streamed data on disk."Why use confluent instead of Kafka? ›
Confluent provides a truly cloud-native experience, completing Kafka with a holistic set of enterprise-grade features to unleash developer productivity, operate efficiently at scale, and meet all of your architectural requirements before moving to production.What is the difference between Kafka Connect and Debezium? ›
Debezium platform has a vast set of CDC connectors, while Kafka Connect comprises various JDBC connectors to interact with external or downstream applications. However, Debeziums CDC connectors can only be used as a source connector that captures real-time event change records from external database systems.Is Kafka Connect a bootstrap server? ›
servers is a mandatory field in Kafka Producer API. It contains a list of host/port pairs for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping.Is Kafka Connect a confluent product? ›
Kafka Connect is a framework to stream data into and out of Apache Kafka®. The Confluent Platform ships with several built-in connectors that can be used to stream data to or from commonly used systems such as relational databases or HDFS.
Does Kafka Connect need schema registry? ›
To use Kafka Connect with Schema Registry, you must specify the key. converter or value. converter properties in the connector or in the Connect worker configuration. The converters need an additional configuration for the Schema Registry URL, which is specified by providing the required .What is Kafka Connect port? ›
Since Kafka Connect is intended to be run as a service, it also supports a REST API for managing connectors. By default this service runs on port 8083 .Why use Confluent instead of Kafka? ›
Confluent provides a truly cloud-native experience, completing Kafka with a holistic set of enterprise-grade features to unleash developer productivity, operate efficiently at scale, and meet all of your architectural requirements before moving to production.Why use Confluent vs Kafka? ›
While both platforms fall under big data technologies, they are classified into different categories. Confluent Kafka falls under the data processing category in the big data. On the other hand, Apache Kafka falls under the data operations category as it is a message queuing system.What are Kafka streams? ›
Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in an Apache Kafka® cluster. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka's server-side cluster technology.How to setup Kafka Connect? ›
- Installing the Connector for Confluent. Download the Kafka Connector Files. Install the Kafka Connector.
- Installing the Connector for Open Source Apache Kafka. Install Apache Kafka. Install the JDK. Download the Kafka Connector JAR Files. Install the Kafka Connector.