I have created a Kafka Topic and pushed a message to it.
So
bin/kafka-console-consumer --bootstrap-server abc.xyz.com:9092 --topic myTopic --from-beginning --property print.key=true --property key.separator="-"
prints
key1-customer1
on the command line.
I want to create a Kafka Stream out of this topic and want to print this key1-customer1 on the console.
I wrote the following for it:
final Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "app-id");
streamsConfiguration.put(StreamsConfig.CLIENT_ID_CONFIG, "client-id");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "abc.xyz.com:9092");
streamsConfiguration.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
// Records should be flushed every 10 seconds. This is less than the default
// in order to keep this example interactive.
streamsConfiguration.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 10 * 1000);
// For illustrative purposes we disable record caches
streamsConfiguration.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> customerStream = builder.stream("myTopic");
customerStream.foreach(new ForeachAction<String, String>() {
public void apply(String key, String value) {
System.out.println(key + ": " + value);
}
});
final KafkaStreams streams = new KafkaStreams(builder.build(), streamsConfiguration);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
This does not fail. However, this does not print anything on the console as this answer suggests.
I am new to Kafka. So any suggestions to make this work would help me a lot.
TL;DR Use Printed.
import org.apache.kafka.streams.kstream.Printed
val sysout = Printed
.toSysOut[String, String]
.withLabel("customerStream")
customerStream.print(sysout)
I would try unsetting CLIENT_ID_CONFIG and leaving only APPLICATION_ID_CONFIG. Kafka Streams uses the application ID to set the client ID.
I would also verify the offsets for the consumer group ID that your Kafka Streams application is using (this consumer group ID is also based on your application ID). Use the the kafka-consumer-groups.sh tool. It could be that your Streams application is ahead of all the records you've produced to that topic, possibly because of auto offset reset being set to latest, or possibly for some other reason not easily discernible from your question.
Related
I am new in Kafka and I have a question that I'm not able to resolve.
I have installed Kafka and Zookeeper in my own computer in Windows (not in Linux) and I have created a broker with a topic with several partitions (playing between 6 and 12 partitions).
When I create consumers, they works perfectly and read at good speed, but referring producer, I have created the simple producer one can see in many web sites. The producer is inside a loop and is sending many short messages (about 2000 very short messages).
I can see that consumers read the 2000 messages very quicly, but producer sends message to the broker at more or less 140 or 150 messages per second. As I said before, I'm working in my own laptop (only 1 disk), but when I read about millions of messages per second, I think there is something I forgot because I'm light-years far from that.
If I use more producers, the result is worse.
Is a question of more brokers in the same node or something like that? This problem have been imposed to me in my job and I have not the possibility of a better computer.
The code for creating the producer is
public class Producer {
public void publica(String topic, String strKey, String strValue) {
Properties configProperties = new Properties();
configProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
configProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
configProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(configProperties);
ProducerRecord<String, String> rec = new ProducerRecord<String, String>(topic, strValue);
producer.send(rec);
}
}
and the code for sending messages is (partial):
Producer prod = new Producer();
for (int i = 0; i < 2000; i++)
{
key = String.valueOf(i);
prod.publica("TopicName", key, texto + " - " + key);
// System.out.println(i + " - " + System.currentTimeMillis());
}
You may create your Kafka producer once and use it every time you need to send a message:
public class Producer {
private final KafkaProducer<String, String> producer; // initialize in constructor
public void publica(String topic, String strKey, String strValue) {
ProducerRecord<String, String> rec = new ProducerRecord<String, String>(topic, strValue);
producer.send(rec);
}
}
Also take a look at the producer and broker configurations available here. There are several options with which you can tune for your application's needs.
I am creating a app in Flink to
Read Messages from a topic
Do some simple process on it
Write Result to a different topic
My code does work, however it does not run in parallel
How do I do that?
It seems my code runs only on one thread/block?
On the Flink Web Dashboard:
App goes to running status
But, there is only one block shown in the overview subtasks
And Bytes Received / Sent, Records Received / Sent is always zero ( no Update )
Here is my code, please assist me in learning how to split my app to be able to run in parallel, and am I writing the app correctly?
public class SimpleApp {
public static void main(String[] args) throws Exception {
// create execution environment INPUT
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment();
// event time characteristic
env_in.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
// production Ready (Does NOT Work if greater than 1)
env_in.setParallelism(Integer.parseInt(args[0].toString()));
// configure kafka consumer
Properties properties = new Properties();
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("auto.offset.reset", "earliest");
// create a kafka consumer
final DataStream<String> consumer = env_in
.addSource(new FlinkKafkaConsumer09<>("test", new
SimpleStringSchema(), properties));
// filter data
SingleOutputStreamOperator<String> result = consumer.filter(new
FilterFunction<String>(){
#Override
public boolean filter(String s) throws Exception {
return s.substring(0, 2).contentEquals("PS");
}
});
// Process Data
// Transform String Records to JSON Objects
SingleOutputStreamOperator<JSONObject> data = result.map(new
MapFunction<String, JSONObject>()
{
#Override
public JSONObject map(String value) throws Exception
{
JSONObject jsnobj = new JSONObject();
if(value.substring(0, 2).contentEquals("PS"))
{
// 1. Raw Data
jsnobj.put("Raw_Data", value.substring(0, value.length()-6));
// 2. Comment
int first_index_comment = value.indexOf("$");
int last_index_comment = value.lastIndexOf("$") + 1;
// - set comment
String comment =
value.substring(first_index_comment, last_index_comment);
comment = comment.substring(0, comment.length()-6);
jsnobj.put("Comment", comment);
}
else {
jsnobj.put("INVALID", value);
}
return jsnobj;
}
});
// Write JSON to Kafka Topic
data.addSink(new FlinkKafkaProducer09<JSONObject>("localhost:9092",
"FilteredData",
new SimpleJsonSchema()));
env_in.execute();
}
}
My code does work, but it seems to run only on a single thread
( One block shown ) in web interface ( No passing of data, hence the bytes sent / received are not updated ).
How do I make it run in parallel ?
To run your job in parallel you can do 2 things:
Increase the parallelism of your job at the env level - i.e. do something like
StreamExecutionEnvironment env_in =
StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(4);
But this would only increase parallelism at flink end after it reads the data, so if the source is producing data faster it might not be fully utilized.
To fully parallelize your job, setup multiple partitions for your kafka topic, ideally the amount of parallelism you would want with your flink job. So, you might want to do something like below when you are creating your kafka topic:
bin/kafka-topics.sh --create --zookeeper localhost:2181
--replication-factor 3 --partitions 4 --topic test
I'm stuck on this problem and I can't figure out what's going on. I am trying to use Kafka streams to write a log to a topic. On the other end, I have Kafka-connect entering the each entry into MySQL. So, basically what I need is a Kafka streams program which reads a topic as strings and parses it into Avro format and then enters it into a different topic.
Here's the code I wrote:
//Define schema
String userSchema = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"ID\", \"type\":\"int\" },"
+ " { \"name\":\"COL_NAME_1\", \"type\":\"string\" },"
+ " { \"name\":\"COL_NAME_2\", \"type\":\"string\" }"
+ "]}";
String key = "key1";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);
//Settings
System.out.println("Kafka Streams Demonstration");
//Settings
Properties settings = new Properties();
// Set a few key parameters
settings.put(StreamsConfig.APPLICATION_ID_CONFIG, APP_ID);
// Kafka bootstrap server (broker to talk to); ubuntu is the host name for my VM running Kafka, port 9092 is where the (single) broker listens
settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
// Apache ZooKeeper instance keeping watch over the Kafka cluster; ubuntu is the host name for my VM running Kafka, port 2181 is where the ZooKeeper listens
settings.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
// default serdes for serialzing and deserializing key and value from and to streams in case no specific Serde is specified
settings.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.STATE_DIR_CONFIG ,"/tmp");
// to work around exception Exception in thread "StreamThread-1" java.lang.IllegalArgumentException: Invalid timestamp -1
// at org.apache.kafka.clients.producer.ProducerRecord.<init>(ProducerRecord.java:60)
// see: https://groups.google.com/forum/#!topic/confluent-platform/5oT0GRztPBo
// Create an instance of StreamsConfig from the Properties instance
StreamsConfig config = new StreamsConfig(getProperties());
final Serde < String > stringSerde = Serdes.String();
final Serde < Long > longSerde = Serdes.Long();
final Serde<byte[]> byteArraySerde = Serdes.ByteArray();
// building Kafka Streams Model
KStreamBuilder kStreamBuilder = new KStreamBuilder();
// the source of the streaming analysis is the topic with country messages
KStream<byte[], String> instream =
kStreamBuilder.stream(byteArraySerde, stringSerde, "sqlin");
final KStream<byte[], GenericRecord> outstream = instream.mapValues(new ValueMapper<String, GenericRecord>() {
#Override
public GenericRecord apply(final String record) {
System.out.println(record);
GenericRecord avroRecord = new GenericData.Record(schema);
String[] array = record.split(" ", -1);
for (int i = 0; i < array.length; i = i + 1) {
if (i == 0)
avroRecord.put("ID", Integer.parseInt(array[0]));
if (i == 1)
avroRecord.put("COL_NAME_1", array[1]);
if (i == 2)
avroRecord.put("COL_NAME_2", array[2]);
}
System.out.println(avroRecord);
return avroRecord;
}
});
outstream.to("sqlout");
Here's the output after I get a Null Pointer Exception:
java -cp streams-examples-3.2.1-standalone.jar io.confluent.examples.streams.sql
Kafka Streams Demonstration
Start
Now started CountriesStreams Example
5 this is
{"ID": 5, "COL_NAME_1": "this", "COL_NAME_2": "is"}
Exception in thread "StreamThread-1" java.lang.NullPointerException
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:81)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.kstream.internals.KStreamMapValues$KStreamMapProcessor.process(KStreamMapValues.java:42)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:48)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:134)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:83)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:70)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:197)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:627)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361)
The topic sqlin contains a few message which consists of a digit followed by two words. Note the two print lines: The function gets one message, and successfully parses it before catching a null pointer. The problem is I am new to Java, Kafka, and Avro so I'm not sure where I'm going wrong. Did I set up the Avro Schema right? Or am using kstream wrong? Any help here would be greatly appreciated.
I think the problem is the following line:
outstream.to("sqlout");
Your application is configured to, by default, use a String serde for record keys and record values:
settings.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
settings.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
Since outstream has type KStream<byte[], GenericRecord>, you must provide explicit serdes when calling to():
// sth like
outstream.to(Serdes.ByteArray(), yourGenericAvroSerde, "sqlout");
FYI: The next version of Confluent Platform (ETA: this month = June
2017) will ship with a ready-to-use generic + specific Avro
serde
that integrates with Confluent schema
registry. This
should make your life easier.
See my answer at https://stackoverflow.com/a/44433098/1743580 for further details.
I installed Kafka on DC/OS (Mesos) cluster on AWS. Enabled three brokers and created a topic called "topic1".
dcos kafka topic create topic1 --partitions 3 --replication 3
Then I wrote a Producer class to send messages and a Consumer class to receive them.
public class Producer {
public static void sendMessage(String msg) throws InterruptedException, ExecutionException {
Map<String, Object> producerConfig = new HashMap<>();
System.out.println("setting Producerconfig.");
producerConfig.put("bootstrap.servers",
"172.16.20.207:9946,172.16.20.234:9125,172.16.20.36:9636");
ByteArraySerializer serializer = new ByteArraySerializer();
System.out.println("Creating KafkaProcuder");
KafkaProducer<byte[], byte[]> kafkaProducer = new KafkaProducer<>(producerConfig, serializer, serializer);
for (int i = 0; i < 100; i++) {
String msgstr = msg + i;
byte[] message = msgstr.getBytes();
ProducerRecord<byte[], byte[]> record = new ProducerRecord<>("topic1", message);
System.out.println("Sent:" + msgstr);
kafkaProducer.send(record);
}
kafkaProducer.close();
}
public static void main(String[] args) throws InterruptedException, ExecutionException {
sendMessage("Kafka test message 2/27 3:32");
}
}
public class Consumer {
public static String getMessage() {
Map<String, Object> consumerConfig = new HashMap<>();
consumerConfig.put("bootstrap.servers",
"172.16.20.207:9946,172.16.20.234:9125,172.16.20.36:9636");
consumerConfig.put("group.id", "dj-group");
consumerConfig.put("enable.auto.commit", "true");
consumerConfig.put("auto.offset.reset", "earliest");
ByteArrayDeserializer deserializer = new ByteArrayDeserializer();
KafkaConsumer<byte[], byte[]> kafkaConsumer = new KafkaConsumer<>(consumerConfig, deserializer, deserializer);
kafkaConsumer.subscribe(Arrays.asList("topic1"));
while (true) {
ConsumerRecords<byte[], byte[]> records = kafkaConsumer.poll(100);
System.out.println(records.count() + " of records received.");
for (ConsumerRecord<byte[], byte[]> record : records) {
System.out.println(Arrays.toString(record.value()));
}
}
}
public static void main(String[] args) {
getMessage();
}
}
First I ran Producer on the cluster to send messages to topic1. However when I ran Consumer, it couldn't receive anything, just hang.
Producer is working since I was able to get all the messages by running the shell script that came with Kafka install
./bin/kafka-console-consumer.sh --zookeeper master.mesos:2181/dcos-service-kafka --topic topic1 --from-beginning
But why can't I receive with Consumer? This post suggests group.id with old offset might be a possible cause. I only create group.id in the consumer not the producer. How do I config the offset for this group?
As it turns out, kafkaConsumer.subscribe(Arrays.asList("topic1")); is causing poll() to hang. According to Kafka Consumer does not receive messages
, there are two ways to connect to a topic, assign and subscribe. After I replaced subscribe with the lines below, it started working.
TopicPartition tp = new TopicPartition("topic1", 0);
List<TopicPartition> tps = Arrays.asList(tp);
kafkaConsumer.assign(tps);
However the output shows arrays of numbers which is not expected (Producer sent Strings). But I guess this is a separate issue.
Make sure you gracefully shutdown your consumer:
consumer.close()
TLDR
When you have two consumers running with the same group id Kafka won't assign the same partition of your topic to both.
If you repeatedly run an app that spins up a consumer with the same group id and you don't shut them down gracefully, Kafka will take a while to consider a consumer from an earlier run as dead and reassign his partition to a new one.
If new messages come to that partition and it's never assigned to your new consumer, the consumer will never see the messages.
To debug:
How many partition your topic has:
./kafka-topics --zookeeper <host-port> --describe <topic>
How far have your group consumed from each partition:
./kafka-consumer-groups --bootstrap-server <host-port> --describe --group <group-id>
If you already have your partitions stuck on stale consumers, either wipe the state of your Kafka or use a new group id.
I am a student researching and playing around with Kafka. After following the examples on the Apache documentation, I'm playing around with the examples portion in the trunk of their current Github repo.
As of right now, the example implements an 'older' version of their Consumer and does not employ the new KafkaConsumer. Following the documentation, I have written my own version of the KafkaConsumer thinking that it would be faster.
This is a vague question, but on runthrough I produce 5000 simple messages such as "Message_CurrentMessageNumber" to a topic "test" and then use my consumer to fetch these messages and print them to stdout. When I run the example code replacing the provided consumer with the newer KafkaConsumer (v 0.8.2 and up) it works pretty quickly and comparably to the example in its first runthrough, but slows down considerably anytime after that.
I notice that my Kafka Server outputs
Rebalancing group group1 generation 3 (kafka.coordinator.ConsumerCoordinator)
or similar messages often which leads me to believe that Kafka has to do some sort of load balancing that slows stuff down but I was wondering if anyone else had insight as to what I am doing wrong.
public class AlternateConsumer extends Thread {
private final KafkaConsumer<Integer, String> consumer;
private final String topic;
private final Boolean isAsync = false;
public AlternateConsumer(String topic) {
Properties properties = new Properties();
properties.put("bootstrap.servers", "localhost:9092");
properties.put("group.id", "newestGroup");
properties.put("partition.assignment.strategy", "roundrobin");
properties.put("enable.auto.commit", "true");
properties.put("auto.commit.interval.ms", "1000");
properties.put("session.timeout.ms", "30000");
properties.put("key.deserializer", "org.apache.kafka.common.serialization.IntegerDeserializer");
properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
consumer = new KafkaConsumer<Integer, String>(properties);
consumer.subscribe(topic);
this.topic = topic;
}
public void run() {
while (true) {
ConsumerRecords<Integer, String> records = consumer.poll(100);
for (ConsumerRecord<Integer, String> record : records) {
System.out.println("We received message: " + record.value() + " from topic: " + record.topic());
}
}
// ConsumerRecords<Integer, String> records = consumer.poll(0);
// for (ConsumerRecord<Integer, String> record : records) {
// System.out.println("We received message: " + record.value() + " from topic: " + record.topic());
// }
// consumer.close();
}
}
To start:
package kafka.examples;
public class KafkaConsumerProducerDemo implements KafkaProperties
{
public static void main(String[] args) {
final boolean isAsync = args.length > 0 ? !args[0].trim().toLowerCase().equals("sync") : true;
Producer producerThread = new Producer("test", isAsync);
producerThread.start();
AlternateConsumer consumerThread = new AlternateConsumer("test");
consumerThread.start();
}
}
The producer is the default producer located here: https://github.com/apache/kafka/blob/trunk/examples/src/main/java/kafka/examples/Producer.java
This should not be the case. If the setup is similar between your two consumers you should expect better result with new consumer unless there is issue in the client/consumer implementation, which seems to be the case here.
Can you share your benchmark results and the frequency of reported rebalancing and/or any pattern (i.e. sluggish once at startup, after fixed message consumption, after the queue is drained, etc) you are observing. Also if you can share some details about your consumer implementation.