I have many objects of a class Say Test which I want to write to Kafka and process them using spark streaming App. I want to use the Kryo Serialization.
My application is in Java
JavaDStream<Test> testData = KafkaUtils
.createDirectStream(context , keyClass,valueClass ,keyDecoderClass ,valueDecoderClass , props,topics);
My question is what should I put for keyClass,valueClass ,keyDecoderClass ,valueDecoderClass ?
Say if your topic is "String " and value is "Test" then first you would need to create TestEncoder and TestDecoder classes by implementing kafka.serializer.Encoder and kafka.serializer.Decoder. Now in your createDirectStream method you can have
JavaPairInputDStream<String, Test> testData = KafkaUtils
.createDirectStream(context, String.class,Test.class ,StringDecoder.class,TestDecoder.class,props,topics);
You can refer KafkaKryoEncoder at https://www.tomsdev.com/blog/2015/storm-kafka-complex-types/
In your Kafka producer you would need to register your custom Encoder class like
Properties properties = new Properties();
properties.put("metadata.broker.list", brokerList);
properties.put("serializer.class", "com.my.TestEncoder");
Producer<String, Test> producer = new Producer<String, Test>(new ProducerConfig(properties));
Test test = new Test();
KeyedMessage<String, Test> data = new KeyedMessage<String, Test>("myTopic", test);
producer.send(data);
Related
Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.
I have a problem trying to run a DRPC topology containing one single bolt and query it through a local cluster. After debugging with IntelliJ, the bolt is indeed executed but the JCQueue is stuck in an infinite loop after that the bolt has been executed and until a timeout is sent to the server.
Here is the code used to build the topology builder:
public static LinearDRPCTopologyBuilder createBuilder()
{
var bolt = new MRedisLookupBolt(createRedisConfiguration(), new RedisTurnoverMapper());
var builder = new LinearDRPCTopologyBuilder("sales");
builder.addBolt(bolt, 1).localOrShuffleGrouping();
return builder;
}
The MRedisLookupBolt is just a very simple implementation of IBasicBolt executing a hget command against Jedis. The execute method of the MRedisLookupBolt is just emitting an instance of Values containing the value for two fields that are declared like this:
declarer.declare(new Fields("id", "Value"));
The topology is built and queried in an unit test like this:
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(1);
try(LocalDRPC drpc = new LocalDRPC())
{
LocalCluster cluster = new LocalCluster();
var builder = BasicRedisRPCTopology.createBuilder();
LocalCluster.LocalTopology topo = cluster.submitTopology(
"Sales-fetch", conf, builder.createLocalTopology(drpc));
var result = drpc.execute("sales", "XXXXX");
System.out.println("################ Result: " + result);
}
catch (Exception e)
{
e.printStackTrace();
}
When reading the logs, I am sure that the data is well red by the bolt and that everything is emitted
But at the end, I have this stack trace gently printed out by my test method. Of course, no value is allocated to the result variable and the process never reach the last print instructions:
There is something that I am missing here. What I understand: the JCQueue used by BoltExecutor to retrieve the id of which bolt to execute is never ending although there is only one parameters sent to the local DRPC and only one bolt declared into the topology. I have already tried to add more bolts to the topology or change the builder implementation used to create it but with no success.
I found a solution suitable for my use case using Apache Storm 2.1.0.
It seems that invoking the submitTopology method of the local cluster as proposed by the documentation does not end the executor correctly with version 2.1.0 using the LinearDRPCTopologyBuilder to build the topology.
By looking closer to the source code, it was possible to understand how to apply the LinearDRPCTopologyBuilder logic to the TopologyBuilder directly.
Here is the change applied to the createBuilder method:
public static TopologyBuilder createBuilder(ILocalDRPC localDRPC)
{
var spout = Optional.ofNullable(localDRPC)
.map(drpc -> new DRPCSpout("sales", drpc))
.orElse(new DRPCSpout("sales"));
var bolt = new MRedisLookupBolt(createRedisConfiguration(), new RedisTurnoverMapper());
var builder = new TopologyBuilder();
builder.setSpout("drpc", spout);
builder.setBolt("redisLookup", bolt, 1)
.shuffleGrouping("drpc");
builder.setBolt("return", new ReturnResults())
.shuffleGrouping("redisLookup");
return builder;
}
And here is an exemple of execution:
Config conf = new Config();
conf.setDebug(true);
conf.setNumWorkers(1);
try(LocalDRPC drpc = new LocalDRPC())
{
LocalCluster cluster = new LocalCluster();
var builder = BasicRedisRPCTopology.createBuilder(drpc);
cluster.submitTopology("Sales-fetch", conf, builder.createTopology());
var result = drpc.execute("sales", "XXXXX");
System.out.println("################ Result: " + result);
}
catch (Exception e)
{
e.printStackTrace();
}
Unfortunately this solution does not allow to use all the embedded tools of the LinearDRPCTopologyBuilder and implies to build all the topology flow 'by hand'. Is is necessary to change the mapper behavior to as the fields are not exposed in the same order as before.
While trying to configure a newly created kafka topic, using java kafka adminClient, values are overwritten.
I have tried to set the same topic configuration using console commands and it works. Unfortunately when I try through Java code some values collide and are overwritten.
ConfigResource resource = new ConfigResource(ConfigResource.Type.TOPIC, topicName);
Map<ConfigResource, Config> updateConfig = new HashMap<>();
// update retention Bytes for this topic
ConfigEntry retentionBytesEntry = new ConfigEntry(TopicConfig.RETENTION_BYTES_CONFIG, String.valueOf(retentionBytes));
updateConfig.put(resource, new Config(Collections.singleton(retentionBytesEntry)));
// update retention ms for this topic
ConfigEntry retentionMsEntry = new ConfigEntry(TopicConfig.RETENTION_MS_CONFIG, String.valueOf(retentionMs));
updateConfig.put(resource, new Config(Collections.singleton(retentionMsEntry)));
// update segment Bytes for this topic
ConfigEntry segmentBytesEntry = new ConfigEntry(TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(segmentbytes));
updateConfig.put(resource, new Config(Collections.singleton(segmentBytesEntry)));
// update segment ms for this topic
ConfigEntry segmentMsEntry = new ConfigEntry(TopicConfig.SEGMENT_MS_CONFIG, String.valueOf(segmentMs));
updateConfig.put(resource, new Config(Collections.singleton(segmentMsEntry)));
// Update the configuration
client.alterConfigs(updateConfig);
I expect the topic to have all given configuration values correctly.
Your logic is not working correctly because you call Map.put() several times with the same key. Hence only the last entry is kept.
The correct way to specify multiple topic configurations is to add them in the ConfigEntry object. Only after add the ConfigEntry to the Map.
For example:
// Your Topic Resource
ConfigResource cr = new ConfigResource(Type.TOPIC, "mytopic");
// Create all your configurations
Collection<ConfigEntry> entries = new ArrayList<>();
entries.add(new ConfigEntry(TopicConfig.SEGMENT_BYTES_CONFIG, String.valueOf(segmentbytes)));
entries.add(new ConfigEntry(TopicConfig.RETENTION_BYTES_CONFIG, String.valueOf(retentionBytes)));
...
// Create the Map
Config config = new Config(entries);
Map<ConfigResource, Config> configs = new HashMap<>();
configs.put(cr, config);
// Call alterConfigs()
admin.alterConfigs(configs);
I have a simple program because I'm trying to receive data using kafka. When I start a kafka producer and I send data, for example: "Hello", I get this when I print the message: (null, Hello). And I don't know why this null appears. Is there any way to avoid this null? I think it's due to Tuple2<String, String>, the first parameter, but I only want to print the second parameter. And another thing, when I print that using System.out.println("inside map "+ message); it does not appear any message, does someone know why? Thanks.
public static void main(String[] args){
SparkConf sparkConf = new SparkConf().setAppName("org.kakfa.spark.ConsumerData").setMaster("local[4]");
// Substitute 127.0.0.1 with the actual address of your Spark Master (or use "local" to run in local mode
sparkConf.set("spark.cassandra.connection.host", "127.0.0.1");
// Create the context with 2 seconds batch size
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new Duration(2000));
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = KafkaProperties.TOPIC.split(",");
for (String topic: topics) {
topicMap.put(topic, KafkaProperties.NUM_THREADS);
}
/* connection to cassandra */
CassandraConnector connector = CassandraConnector.apply(sparkConf);
System.out.println("+++++++++++ cassandra connector created ++++++++++++++++++++++++++++");
/* Receive kafka inputs */
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, KafkaProperties.ZOOKEEPER, KafkaProperties.GROUP_CONSUMER, topicMap);
System.out.println("+++++++++++++ streaming-kafka connection done +++++++++++++++++++++++++++");
JavaDStream<String> lines = messages.map(
new Function<Tuple2<String, String>, String>() {
public String call(Tuple2<String, String> message) {
System.out.println("inside map "+ message);
return message._2();
}
}
);
messages.print();
jssc.start();
jssc.awaitTermination();
}
Q1) Null values:
Messages in Kafka are Keyed, that means they all have a (Key, Value) structure.
When you see (null, Hello) is because the producer published a (null,"Hello") value in a topic.
If you want to omit the key in your process, map the original Dtream to remove the key: kafkaDStream.map( new Function<String,String>() {...})
Q2) System.out.println("inside map "+ message); does not print. A couple of classical reasons:
Transformations are applied in the executors, so when running in a cluster, that output will appear in the executors and not on the master.
Operations are lazy and DStreams need to be materialized for operations to be applied.
In this specific case, the JavaDStream<String> lines is never materialized i.e. not used for an output operation. Therefore the map is never executed.
I am using kafka 0.8 version and very much new to it.
I want to know the list of topics created in kafka server along with it's
metadata.
Is there any API available to find out this?
Basically, I need to write a Java consumer that should auto-discover any topic in kafka server.There is API to fetch TopicMetadata, but this needs name of topic as input
parameters.I need information for all topics present in server.
with Kafka 0.9.0
you can list the topics in the server with the provided consumer method listTopics();
eg.
Map<String, List<PartitionInfo> > topics;
Properties props = new Properties();
props.put("bootstrap.servers", "1.2.3.4:9092");
props.put("group.id", "test-consumer-group");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
topics = consumer.listTopics();
consumer.close();
I think this is the best way:
ZkClient zkClient = new ZkClient("zkHost:zkPort");
List<String> topics = JavaConversions.asJavaList(ZkUtils.getAllTopics(zkClient));
A good place to start would be the sample shell scripts shipped with Kafka.
In the /bin directory of the distribution there's some shell scripts you can use, one of which is ./kafka-topic-list.sh
If you run that without specifying a topic, it will return all topics with their metadata.
See:
https://github.com/apache/kafka/blob/0.8/bin/kafka-list-topic.sh
That shell script in turn runs:
https://github.com/apache/kafka/blob/0.8/core/src/main/scala/kafka/admin/ListTopicCommand.scala
The above are both references to the 0.8 Kafka version, so if you're using a different version (even a point difference), be sure to use the appropriate branch/tag on github
Using Scala:
import java.util.{Properties}
import org.apache.kafka.clients.consumer.KafkaConsumer
object KafkaTest {
def main(args: Array[String]): Unit = {
val brokers = args(0)
val props = new Properties();
props.put("bootstrap.servers", brokers);
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
val consumer = new KafkaConsumer[String, String](props);
val topics = consumer.listTopics().keySet();
println(topics)
}
}
If you want to pull broker or other-kafka information from Zookeeper then kafka.utils.ZkUtils provides a nice interface. Here is the code I have to list all zookeeper brokers (there are a ton of other methods there):
List<Broker> listBrokers() {
final ZkConnection zkConnection = new ZkConnection(connectionString);
final int sessionTimeoutMs = 10 * 1000;
final int connectionTimeoutMs = 20 * 1000;
final ZkClient zkClient = new ZkClient(connectionString,
sessionTimeoutMs,
connectionTimeoutMs,
ZKStringSerializer$.MODULE$);
final ZkUtils zkUtils = new ZkUtils(zkClient, zkConnection, false);
scala.collection.JavaConversions.seqAsJavaList(zkUtils.getAllBrokersInCluster());
}
You can use zookeeper API to get the list of brokers as mentioned below:
ZooKeeper zk = new ZooKeeper("zookeeperhost, 10000, null);
List<String> ids = zk.getChildren("/brokers/ids", false);
List<Map> brokerList = new ArrayList<>();
ObjectMapper objectMapper = new ObjectMapper();
for (String id : ids) {
Map map = objectMapper.readValue(zk.getData("/brokers/ids/" + id, false, null), Map.class);
brokerList.add(map);
}
Use this broker list to get all the topic using the following link
https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader