Prerequisites

You have authorized the Alibaba Cloud account. For more information, see Role authorization.

Background information

You always consume Kafka data in practical applications. In EMR, you can run a Spark
Streaming job to consume Kafka data.

Step 1: Create a Hadoop cluster and Kafka cluster

We recommend that you specify the same security group for the Hadoop cluster as that
of the Kafka cluster when creating the two clusters. If the clusters are linked to
different security groups, the two clusters are not accessible by each other. You
must modify the required settings of the security groups to allow mutual access.

On the Cluster Management tab, click the Cluster ID of the target cluster to enter the Hadoop cluster.

In the left-side navigation pane, select Instances and view the IP address of the emr-header-1 instance in the Hadoop cluster.

Log on to the emr-header-1 instance by using SSH.

Upload the JAR file to a directory of the emr-header-1 instance.

Note The /home/hadoop directory is specified as a repository for data storage in this case. After you upload
the JAR file, we recommend that you keep the logon window open for later use.

Step 3: Create a topic on the Kafka cluster

You can create a topic in the EMR console. For more information, see Manage Kafka metadata. You can also log on to the emr-header-1 instance and create a topic by using the CLI. In this example, you can create a topic
named test with 10 partitions, 2 replicas.

In the preceding command, the parameters after the name of the JAR file are described
as follows:

192.168.xxx.xxx: indicates the internal or public IP address of a broker in the Kafka cluster. Figure 1 shows an example.

test: indicates the name of the topic.

5: indicates the time interval.

Figure 1. List of components in the Kafka cluster

Step 5: Use Kafka to publish messages

When you perform this step, ensure that the Spark Streaming job is running. After
you start a Kafka producer, the number of words is displayed in a shell on a client
instance of the Hadoop cluster. The value is updated in real time when you enter words
into a shell on a client instance of the Kafka cluster.