I'm a noob to Kafka and Avro. So i have been trying to get the Producer/Consumer running. So far i have been able to produce and consume simple Bytes and Strings, using the following :
Configuration for the Producer :
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(USER_SCHEMA);
Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(schema);
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(props);
for (int i = 0; i < 1000; i++) {
GenericData.Record avroRecord = new GenericData.Record(schema);
avroRecord.put("str1", "Str 1-" + i);
avroRecord.put("str2", "Str 2-" + i);
avroRecord.put("int1", i);
byte[] bytes = recordInjection.apply(avroRecord);
ProducerRecord<String, byte[]> record = new ProducerRecord<>("mytopic", bytes);
producer.send(record);
Thread.sleep(250);
}
producer.close();
}
Now this is all well and good, the problem comes when i'm trying to serialize a POJO.
So , i was able to get the AvroSchema from the POJO using the utility provided with Avro.
Hardcoded the schema, and then tried to create a Generic Record to send through the KafkaProducer
the producer is now set up as :
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.KafkaAvroSerializer");
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(USER_SCHEMA); // this is the Generated AvroSchema
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(props);
this is where the problem is : the moment i use KafkaAvroSerializer, the producer doesn't come up due to :
missing mandatory parameter : schema.registry.url
I read up on why this is required, so that my consumer is able to decipher whatever the producer is sending to me.
But isn't the schema already embedded in the AvroMessage?
Would be really great if someone can share a working example of using KafkaProducer with the KafkaAvroSerializer without having to specify schema.registry.url
would also really appreciate any insights/resources on the utility of the schema registry.
thanks!
Note first: KafkaAvroSerializer is not provided in vanilla apache kafka - it is provided by Confluent Platform. (https://www.confluent.io/), as part of its open source components (http://docs.confluent.io/current/platform.html#confluent-schema-registry)
Rapid answer: no, if you use KafkaAvroSerializer, you will need a schema registry. See some samples here:
http://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
The basic idea with schema registry is that each topic will refer to an avro schema (ie, you will only be able to send data coherent with each other. But a schema can have multiple version, so you still need to identify the schema for each record)
We don't want to write the schema for everydata like you imply - often, schema is bigger than your data! That would be a waste of time parsing it everytime when reading, and a waste of ressources (network, disk, cpu)
Instead, a schema registry instance will do a binding avro schema <-> int schemaId and the serializer will then write only this id before the data, after getting it from registry (and caching it for later use).
So inside kafka, your record will be [<id> <bytesavro>] (and magic byte for technical reason), which is an overhead of only 5 bytes (to compare to the size of your schema)
And when reading, your consumer will find the corresponding schema to the id, and deserializer avro bytes regarding it. You can find way more in confluent doc
If you really have a use where you want to write the schema for every record, you will need an other serializer (I think writing your own, but it will be easy, just reuse https://github.com/confluentinc/schema-registry/blob/master/avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroSerializer.java and remove the schema registry part to replace it with the schema, same for reading). But if you use avro, I would really discourage this - one day a later, you will need to implement something like avro registry to manage versioning
While the checked answer is all correct, it should also be mentioned that schema registration can be disabled.
Simply set auto.register.schemas to false.
You can create your Custom Avro serialiser, then even without Schema registry you would be able to produce records to topics. Check below article.
https://codenotfound.com/spring-kafka-apache-avro-serializer-deserializer-example.html
Here they have use Kafkatemplate . I have tried using
KafkaProducer<String, User> UserKafkaProducer
It is working fine
But if you want to use KafkaAvroSerialiser, you need to give Schema registryURL
As others have pointed out, KafkaAvroSerializer requires Schema Registry which is part of Confluent platform, and usage requires licensing.
The main advantage of using the schema registry is that your bytes on wire will smaller, as opposed to writing a binary payload with schema for every message.
I wrote a blog post detailing the advantages
You can always make your value classes to implement Serialiser<T>, Deserialiser<T> (and Serde<T> for Kafka Streams) manually. Java classes are usually generated from Avro files, so editing that directly isn't a good idea, but wrapping is maybe verbose but possible way.
Another way is to tune Arvo generator templates that are used for Java classes generation and generate implementation of all those interfaces automatically. Both Avro maven and gradle plugins supports custom templates, so it should be easy to configure.
I've created https://github.com/artemyarulin/avro-kafka-deserializable that has changed template files and simple CLI tool that you can use for file generation
Related
Using Kafka Admin from Java I'm trying to find how can I get the retention.bytes and retention.ms from a topic
The only thing I found in the API is this
adminClient.describeConfigs(Collection<ConfigRsources>???).values().get(ConfigResource???).get().get(String).
In case this is the way, not sure where I can get where it supposed to be passed.
The last string I guess it would be the config name retention.byte
where I can get
Read through the Javadoc. ConfigResource has a Type enum.
Then describing accepts a collection and returns a map via values, keyed by items in that collection. Therefore, pull out the reference to its own variable for reuse.
Then I suggest using constants
rather than strings.
Map<String, Object> config = new HashMap<>();
config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
AdminClient client = AdminClient.create(config);
ConfigResource resource = new ConfigResource(ConfigResource.Type.TOPIC, "example");
Config topicConfig = client.describeConfigs(Collections.singletonList(resource)).values().get(resource);
topicConfig.get(TopicConfig.RETENTION_BYTES_CONFIG);
topicConfig.get(TopicConfig.RETENTION_MS_CONFIG);
As stated in Avro Getting Started about deserialization without code generation: "The data will be read using the writer's schema included in the file, and the reader's schema provided to the GenericDatumReader". Here is how GenericDatumReader is created in the example
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
But when you look at this GenericDatumReader constructor Javadoc it states "Construct where the writer's and reader's schemas are the same." (and actual code corresponds to this).
So the writer's schema isn’t taken from a serialized file but from a constructor parameter? If yes, how to read data using written schema like described on the page?
I've received an answer on Avro mailing list:
...writer schema can be adjusted after creation. This is what the DataFileReader does.
So after the DataFileReader is initialised, the underlying GenericDatumReader uses the the schema in the file as write schema (to understand the data), and the schema you provided as read schema (to give data to you via dataFileReader.next(user)).
Can someone provide me with a simple example for setting up a flowfile in a custom Nifi processor so the payload can be sent out thru the PublishKafka processor?
I have a legacy messaging protocol that I wrote a custom processor for. Pretty simple structure, just a MessageID (String) and the MessageBody (byte[]). My custom processor handles the input with the messages being received fine. I'm now attempting to put this data into a flowfile so it can be sent on to the publishKafka processor but I've had trouble finding any resources online with how to do this. Here's my current code snippet of the relevant portion:
try {
this.getLogger().info("[INFO - ListenMW] - Message Received: " +
data.getMsgID().toString() + " Size: " +
data.getMsgData().length);
this.currentSession.adjustCounter("MW Counter", 1, true);
// Setup the flowfile to transfer
FlowFile flowfile = this.currentSession.create();
flowfile = this.currentSession.putAttribute(flowfile, "key",data.getMsgID().toString());
flowfile = this.currentSession.putAttribute(flowfile, "value", new String(data.getMsgData(),StandardCharsets.UTF_8));
this.currentSession.transfer(flowfile, SUCCESS);
}catch(Exception e) {
this.getLogger().error("[INFO - ListenMW] - "+e.getMessage());
this.currentSession.adjustCounter("MW Failure", 1, true);
}
I've been unable to determine what attribute(s) to use for the msgID and msgData so I created my own for now. I saw one post where someone recommended building your own json structure and sending that through as your payload, but again which attribute would you send that thru so it will get mapped properly to the kafka message? I'm pretty new to Kafka and have only experimented with rudimentary test cases to this point so forgive my ignorance for any wrong assumptions.
Thanks for any guidance! I'm using Kafka2.0.1 and the PublishKafka_2.0 processor.
Based on what you've shared, it looks like the main reason you're not getting anything published into Kafka is you're not actually writing anything to the flowfile contents. For a reference point, here is a copy of the javadocs for NiFi (also, here are the processor docs). What you should be doing is something like this:
flowFile = session.write(flowFile, outStream -> {
outStream.write("some string here".getBytes());
});
I use PublishKafkaRecord, but the PublishKafka processors are pretty similar conceptually. You can set the key for the message the way you're doing it there, but you need to set the value by writing it to the flowfile body.
Without knowing your broader use case here, it looks like you can do what you need to do with ExecuteScript. See this as a starting point for ExecuteScript with multiple scripting language references.
If you need further help, we have multiple options here for you.
Avro website has an example:
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();
What is the purpose of DatumWriter<User>? I mean what it does provide? It provides write method, but instead of using it we use DataFileWriter. Can someone explain the design purpose of it?
The DatumWriter class responsible for translating the given data object into an avro record with a given schema (which is extracted in your case from the User class).
Given this record the DataFileWriter is responsible to write it to a file.
I need to validate an XML against a local XSD and I do not have a internet connection on the target machine (on which this process runs). The code look like below :
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
File schemaLocation = new File(xsd);
Schema schema = factory.newSchema(schemaLocation);
Validator validator = schema.newValidator();
Source source = new StreamSource(new BufferedInputStream(new FileInputStream(new File(xml))));
validator.validate(source);
I always get a java.net.ConnectException when validate() is called.
Can you please let me know what is not being done correctly ?
Many Thanks.
Abhishek
Agreed with Mads' comment - there are likely many references here that will attempt outgoing connections to the Internet, and you will need to download local copies for them. However, I'd advise against changing references within the XML or schema files, etc. - but instead, provide an EntityResolver to return the contents of your local copies instead of connecting out to the Internet. (I previously wrote a little bit about this at http://blogger.ziesemer.com/2009/01/xml-and-xslt-tips-and-tricks-for-java.html#InputValidation.)
However, in your case, since you're using a Validator instead of Validator.setResourceResolver(...) - and pass-in a LSResourceResolver, before calling validate.