Kafka Series: 3. Creating 3 Node Kafka cluster on Virtual Box (2022)

Read more to learn how to deploy Kafka cluster on Ubuntu 18.04 using Virtual Box.

Kafka Series: 3. Creating 3 Node Kafka cluster on Virtual Box (1)Source: https://kafka.apache.org, https://design.ubuntu.com, and www.virtualbox.org

Our previous blogs detailed the steps to install Ubuntu 18.04 LTS Server on VirtualBox and a tutorial that helped set up Ubuntu servers using Virtualbox 6 on a Windows 10 operating system.

What is Kafka?

Kafka is a distributed messaging system that provides fast, highly scalable, and redundant messaging through a public subscribe distributed messaging system. Kafka was developed by LinkedIn and open-sourced in 2011. Kafka is written in Scala.

Kafka is a “public subscribe distributed messaging system” rather than a “queue system” since the message is received from the producer and broadcasted to a group of consumers rather than a single consumer.

A remarkable feature of Kafka is that it is highly available, immune from node failures, and has automatic recovery. All the above makes Apache Kafka an ideal tool for communication and integration between different parts of a large-scale data system in real-world data systems.

Kafka was meant to operate in a cluster, and we should be creating a cluster if you’re using Kafka.

The architecture of Kafka:

Kafka Series: 3. Creating 3 Node Kafka cluster on Virtual Box (2)Kafka Architecture Diagram

Topic:

Messages are published to a ‘Topic,’ and a partition is associated with each ‘Topic’. Kafka stores and organizes messages across its system, and essentially, a collection of messages are called Topics.

Brokers:

Broker in Kafka holds the messages that have been written by the producer before being consumed by the ‘consumer’.

Kafka cluster contains multiple brokers. A broker has a partition, and as already communicated, each partition is associated with a topic. The brokers receive the messages, and they are stored in the “brokers” for ’n’ number of days (which can be configured). After the ’n’ of days has expired, the messages are discarded. It is important to state here again that Kafka does not check whether each consumer or consumer group has read the messages.

Producer:

Different producers like Apps, DBMS, NoSQL write data to the Kafka cluster and publish messages to a Kafka topic.

Consumer:

A consumer or consumer group is/are subscribed to different topics, and in turn, they read from the partition for the topics to which they are subscribed. After the “producers” produced the message and sent it to the Kafka brokers, the consumers then read it. If a broker goes down, other brokers support the system and make sure everything runs smoothly.

ZooKeeper:

The Zookeeper’s primary responsibility is to coordinate with the different components of the Kafka cluster. The producer has the job to give the message to the broker leader/active controller, which in turn writes the message onto itself and replicates it to other brokers.

Zookeeper helps maintain consensus in the cluster, which means all the brokers know each other and know which one is the controller.

Requirements:

Please go through the installation of the Ubuntu server 18.04 LTS on Virtual Box as per the requirement for creating the multi-node Kafka cluster. Links:

  1. Steps to install Ubuntu 18.04 LTS server on VirtualBox.

  2. https://clairvoyant.atlassian.net/wiki/spaces/~898085804/blog/2020/05/09/1205698932/Setting+Up+Static+IP+and+internet+access+for+Ubuntu+Server+18.04+on+Virtualbox+6

Softwares:

1. VirtualBox

You can download the VirtualBox from the below link as per your operating system.

https://www.virtualbox.org/wiki/Downloads

2. Ubuntu 18.04 LTS ISO image.

Here is the link to download the image. https://releases.ubuntu.com/18.04.4/ubuntu-18.04.4-live-server-amd64.iso

3. Kafka packages 2.4.1

https://kafka.apache.org/downloads

Pre-requisites:

1. Install open JDK on VM’s

Use the following command to install the latest Java Developer Kit (JDK):

 sudo apt install -y default-jdk

Use the following command to verify that Java has been installed:

 java -version

2. Disable Swap usage on VM’s

Use the following command to disable RAM swap:

 swapoff -a

Use the following command to comment out swap in the /etc/fstab file:

 sudo sed -i '/ swap / s/^/#/' /etc/fstab 

Now, lets set up the Kafka Cluster:

A. Download the Kafka Packages

Download the Kafka package:

 wget https://www.apache.org/dyn/closer.cgi? path=/kafka/2.4.1/kafka_2.13-2.4.1.tgz

Extract the tar file and move it into the /tmp directory:

tar -xvf kafka_2.13-2.4.1.tgzsudo mv kafka_2.13-2.4.1 /tmp/kafkacd /tmp/kafka

B. Create a data directory to store Kafka messages and Zookeeper data.

Step- 1. Create a New Directory for Kafka and Zookeeper

sudo mkdir -p /kafkasudo mkdir -p /zookeeper 

Step- 2. Change ownership of those directories now:

sudo chown -R ubuntu:ubuntu /kafkasudo chown -R ubuntu:ubuntu /zookeeper 

C. Create a Zookeeper ID on each VM.

Create a file in /zookeeper first VM named "myid" with the ID as "1":echo "1" > /zookeeper/myid
Create a file in /zookeeper Second VM named "myid" with the ID as"2":echo "2" > /zookeeper/myid
Create a file in /zookeeper Third VM named "myid" with the ID as"3":echo "3" > /zookeeper/myid

D. Modify the Kafka and Zookeeper Configuration Files on all VM’s

Step- 1. Open the server.properties file:

cd /tmp/kafkavi config/server.properties

Step- 2. Update broker.id and advertised.listners into server.properties configuration as shown below:

Note: Add the below configuration on all VM’s. Run each command in the parallel console.

# change this for each brokerbroker.id=[broker_number]# change this to the hostname of each brokeradvertised.listeners=PLAINTEXT://[hostname]:9092# The ability to delete topicsdelete.topic.enable=true# Where logs are storedlog.dirs=/kafka# default number of partitionsnum.partitions=8# default replica count based on the number of brokersdefault.replication.factor=3# to protect yourself against broker failuremin.insync.replicas=2# logs will be deleted after how many hourslog.retention.hours=168# size of the log files log.segment.bytes=1073741824# check to see if any data needs to be deletedlog.retention.check.interval.ms=300000# location of all zookeeper instances and kafka directoryzookeeper.connect=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/kafka# timeout for connecting with zookeeperzookeeper.connection.timeout.ms=6000

Step- 3. open the zookeeper.properties file on all VM’s.

 vi config/zookeeper.properties

Step- 4. Paste the following configuration into the zookeeper.properties :

# the directory where the snapshot is stored.dataDir=/zookeeper# the port at which the clients will connectclientPort=2181# setting number of connections to unlimitedmaxClientCnxns=0# keeps a heartbeat of zookeeper in millisecondstickTime=2000# time for initial synchronizationinitLimit=10# how many ticks can pass before timeoutsyncLimit=5# define servers ip and internal ports to zookeeperserver.1=zookeeper1:2888:3888server.2=zookeeper2:2888:3888server.3=zookeeper3:2888:3888

E. Create the init.d scripts to start and stop for Kafka and Zookeeper service

Kafka:

Step- 1. Open file /etc/init.d/kafka on each VM on virtual box and paste in the following:

sudo vim /etc/init.d/kafka
#!/bin/bash#/etc/init.d/kafkaKAFKA_PATH=/tmp/kafka/binSERVICE_NAME=kafka
PATH=$PATH:$KAFKA_PATH
case "$1" in start) # Start daemon. pid=`ps ax | grep -i 'kafka.Kafka' | grep -v grep | awk '{print $1}'` if [ -n "$pid" ] then echo "Kafka is already running" else echo "Starting $SERVICE_NAME" $KAFKA_PATH/kafka-server-start.sh -daemon /tmp/kafka /config/server.properties fi ;; stop) echo "Shutting down $SERVICE_NAME" $KAFKA_PATH/kafka-server-stop.sh ;; restart) $0 stop sleep 2 $0 start ;; status) pid=`ps ax | grep -i 'kafka.Kafka' | grep -v grep | awk '{print $1}'` if [ -n "$pid" ] then echo "Kafka is Running as PID: $pid" else echo "Kafka is not Running" fi ;; *) echo "Usage: $0 {start|stop|restart|status}" exit 1esac
exit 0

Save and close the file.

Step- 2. Make the file /etc/init.d/kafka executable. Also, change the ownership and start the service:

sudo chmod +x /etc/init.d/kafkasudo chown root:root /etc/init.d/kafkasudo update-rc.d kafka defaultssudo service kafka startsudo service kafka status

Zookeeper :

Step- 1. Open file /etc/init.d/zookeeper on all VM’s on virtual box and paste in the following:

sudo vim /etc/init.d/zookeeper#!/bin/bash#/etc/init.d/zookeeperKAFKA_PATH=/tmp/kafka/binSERVICE_NAME=zookeeper
PATH=$PATH:$KAFKA_PATH
case "$1" in start) # Start daemon. pid=`ps ax | grep -i 'org.apache.zookeeper' | grep -v grep |
awk '{print $1}'` if [ -n "$pid" ] then echo "Zookeeper is already running"; else echo "Starting $SERVICE_NAME"; $KAFKA_PATH/zookeeper-server-start.sh -daemon
/tmp/kafka/config/zookeeper.properties fi ;; stop) echo "Shutting down $SERVICE_NAME"; $KAFKA_PATH/zookeeper-server-stop.sh ;; restart) $0 stop sleep 2 $0 start ;; status) pid=`ps ax | grep -i 'org.apache.zookeeper' | grep -v grep |
awk '{print $1}'` if [ -n "$pid" ] then echo "Zookeeper is Running as PID: $pid" else echo "Zookeeper is not Running" fi ;; *) echo "Usage: $0 {start|stop|restart|status}" exit 1esac
exit 0

Save and close the file.

Step- 2. Make the file /etc/init.d/zookeeper executable. Also, change the ownership and start the service:

sudo chmod +x /etc/init.d/zookeepersudo chown root:root /etc/init.d/zookeepersudo update-rc.d zookeeper defaultssudo service zookeeper startsudo service zookeeper status 

F. Create a Topic

Create a topic named test:

 ./bin/kafka-topics.sh --zookeeper zookeeper1:2181/kafka --create --topic test --replication-factor 1 --partitions 3

Describe the topic:

 ./bin/kafka-topics.sh --zookeeper zookeeper1:2181/kafka --topic test --describe

For the best data engineering solutions for your business, reach out to us at Clairvoyant.

Top Articles

Latest Posts

Article information

Author: Rueben Jacobs

Last Updated: 01/20/2023

Views: 6280

Rating: 4.7 / 5 (77 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Rueben Jacobs

Birthday: 1999-03-14

Address: 951 Caterina Walk, Schambergerside, CA 67667-0896

Phone: +6881806848632

Job: Internal Education Planner

Hobby: Candle making, Cabaret, Poi, Gambling, Rock climbing, Wood carving, Computer programming

Introduction: My name is Rueben Jacobs, I am a cooperative, beautiful, kind, comfortable, glamorous, open, magnificent person who loves writing and wants to share my knowledge and understanding with you.