Kafka Protobuf: C++ serialization to java - java

I've developed a couple of C++ apps that produce and consume Kafka messages (using cppkafka) embedding Protobuf3 messages. Both work fine. The producer's relevant code is:
std::string kafkaString;
cppkafka::MessageBuilder *builder;
...
solidList->SerializeToString(&kafkaString);
builder->payload(kafkaString);
Protobuf objects are serialized to string and inserted as Kafka payload. Everything works fine up to this point. Now, I'm trying to develop a consumer for that in Java. The relevant code should be:
KafkaConsumer<Long, String> consumer=new KafkaConsumer<Long, String>(properties);
....
ConsumerRecords<Long, String> records = consumer.poll(100);
for (ConsumerRecord<Long, String> record : records) {
SolidList solidList = SolidList.parseFrom(record.value());
...
but that fails at compile time: parseFrom complains: The method parseFrom(ByteBuffer) in the type Solidlist.SolidList is not applicable for the arguments (String). So, I try using a ByteBuffer:
KafkaConsumer<Long, ByteBuffer> consumer=new KafkaConsumer<Long, ByteBuffer>(properties);
....
ConsumerRecords<Long, ByteBuffer> records = consumer.poll(100);
for (ConsumerRecord<Long, ByteBuffer> record : records) {
SolidList solidList = SolidList.parseFrom(record.value());
...
Now, the error is on execution time, still on parseFrom(): Exception in thread "main" java.lang.ClassCastException: java.lang.String cannot be cast to java.nio.ByteBuffer. I know it is a java.lang.String!!! So, I get back to the original, and try using it as a byte array:
SolidList solidList = SolidList.parseFrom(record.value().getBytes());
Now, the error is on execution time: Exception in thread "main" com.google.protobuf.InvalidProtocolBufferException$InvalidWireTypeException: Protocol message tag had invalid wire type..
The protobuf documentation states for the C++ serialization: bool SerializeToString(string output) const;: serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.*
TL;DR: In consequence, how should I interpret the protobuf C++ "binary bytes" in Java?
This seems related (it is the opposite) but doesn't help: Protobuf Java To C++ Serialization [Binary]
Thanks in advance.

Try implementing a Deserializer and pass it to KafkaConsumer constructor as value deserializer. It could look like this:
class SolidListDeserializer implements Deserializer<SolidList> {
public SolidList deserialize(final String topic, byte[] data) {
return SolidList.parseFrom(data);
}
...
}
...
KafkaConsumer<Long, SolidList> consumer = new KafkaConsumer<>(props, new LongDeserializer(), new SolidListDeserializer())

You can read kafka as ConsumerRecords<Long, String>. And then SolidList.parseFrom(ByteBuffer.wrap(record.value().getBytes("UTF-8")));

Related

Java: Protobuf byte to Json string to Pojo fast

I am receiving messages in protobuf format. I need to convert it to json format fast as all my business logic is written to handle json based POJO objects.
byte[] request = ..; // msg received
// convert to intermediate POJO
AdxOpenRtb.BidRequest bidRequestProto = AdxOpenRtb.BidRequest.parseFrom(request, reg);
// convert intermediate POJO to json string.
// THIS STEP IS VERY SLOW
Printer printer = JsonFormat.printer().printingEnumsAsInts().omittingInsignificantWhitespace();
String jsonBody = printer.print(bidRequestProto);
// convert json string to final POJO format
BidRequest bidRequest = super.parse(jsonBody.getBytes());
Proto object to json conversion step is very slow. Is there any faster approach for it?
can i reuse printer object? is it thread-safe?
Note: This POJO class (AdxOpenRtb.BidRequest & BidRequest) is very complex having many hierarchy and fields but contains similar data with slightly different fields name and data types.
I ran into some performance issues as well and ended up writing the QuickBuffers library. It generates dedicated JSON serialization methods (i.e. no reflection) and should give you a 10-30x speedup. It can be used side-by-side with Google's implementation. The code should look something like this:
// Initialization (objects can be reused if desired)
AdxOpenRtb.BidRequest bidRequestProto = AdxOpenRtb.BidRequest.newInstance();
ProtoSource protoSource = ProtoSource.newArraySource();
JsonSink jsonSink = JsonSink.newInstance().setWriteEnumsAsInts(true);
// Convert Protobuf to JSON
bidRequestProto.clearQuick() // or ::parseFrom if you want a new object
.mergeFrom(protoSource.setInput(request))
.writeTo(jsonSink.clear());
// Use the raw json bytes
RepeatedByte jsonBytes = jsonSink.getBytes();
JsonSinkBenchmark has some sample code for replacing the built-in JSON encoder with more battle-tested Gson/Jackson backends.
Edit: if you're doing this within a single process and are worried about performance, you're better off writing or generating code to convert the Java objects directly. JSON is not a very efficient format to go through.
I end up using MapStruct as suggested by some of you (#M.Deinum).
new code:
byte[] request = ..; // msg received
// convert to intermediate POJO
AdxOpenRtb.BidRequest bidRequestProto = AdxOpenRtb.BidRequest.parseFrom(request, reg);
// direct conversion from protobuf Pojo to my custom Pojo
BidRequest bidRequest = BidRequestMapper.INSTANCE.adxOpenRtbToBidRequest(bidRequestProto);
Code snippet of BidRequestMapper:
#Mapper(
collectionMappingStrategy = CollectionMappingStrategy.ADDER_PREFERRED, nullValueCheckStrategy = NullValueCheckStrategy.ALWAYS,
unmappedSourcePolicy = ReportingPolicy.WARN, unmappedTargetPolicy = ReportingPolicy.WARN)
#DecoratedWith(BidRequestMapperDecorator.class)
public abstract class BidRequestMapper {
public static final BidRequestMapper INSTANCE = Mappers.getMapper(BidRequestMapper.class);
#Mapping(source = "impList", target = "imp")
#Mapping(target = "impOverride", ignore = true)
#Mapping(target = "ext", ignore = true)
public abstract BidRequest adxOpenRtbToBidRequest(AdxOpenRtb.BidRequest adxOpenRtb);
...
...
}
// manage proto extensions
abstract class BidRequestMapperDecorator extends BidRequestMapper {
private final BidRequestMapper delegate;
BidRequestMapperDecorator(BidRequestMapper delegate) {
this.delegate = delegate;
}
#Override
public BidRequest adxOpenRtbToBidRequest(AdxOpenRtb.BidRequest bidRequestProto) {
// Covert protobuf msg to basic bid request object
BidRequest bidRequest = delegate.adxOpenRtbToBidRequest(bidRequestProto);
...
...
}
}
The new approach is 20-50x faster in my local test environment.
It's worth mentioning that MapStruct is an annotation processor which makes it much faster than other similar libraries which use reflection and it also has very good support for customization.

Get only a subset of fields from a Kafka topic using Apache Beam

Is there a way to read only specific fields of a Kafka topic?
I have a topic, say person with a schema personSchema. The schema contains many fields such as id, name, address, contact, dateOfBirth.
I want to get only id, name and address. How can I do that?
Currently I´m reading streams using Apache Beam and intend to write data to BigQuery afterwards. I am trying to use Filter but cannot get it to work because of Boolean return type
Here´s my code:
Pipeline pipeline = Pipeline.create();
PCollection<KV<String, Person>> kafkaStreams =
pipeline
.apply("read streams", dataIO.readStreams(topic))
.apply(Filter.by(new SerializableFunction<KV<String, Person>, Boolean>() {
#Override
public Boolean apply(KV<String, Order> input) {
return input.getValue().get("address").equals(true);
}
}));
where dataIO.readStreams is returning this:
return KafkaIO.<String, Person>read()
.withTopic(topic)
.withKeyDeserializer(StringDeserializer.class)
.withValueDeserializer(PersonAvroDeserializer.class)
.withConsumerConfigUpdates(consumer)
.withoutMetadata();
I would appreciate suggestions for a possible solution.
You can do this with ksqlDB, which also work directly with Kafka Connect for which there is a sink connector for BigQuery
CREATE STREAM MY_SOURCE WITH (KAFKA_TOPIC='person', VALUE_FORMAT=AVRO');
CREATE STREAM FILTERED_STREAM AS SELECT id, name, address FROM MY_SOURCE;
CREATE SINK CONNECTOR SINK_BQ_01 WITH (
'connector.class' = 'com.wepay.kafka.connect.bigquery.BigQuerySinkConnector',
'topics' = 'FILTERED_STREAM',
…
);
You can also do this by creating a new TableSchema by yourself with only the required fields. Later when you write to BigQuery, you can pass the newly created schema as an argument instead of the old one.
TableSchema schema = new TableSchema();
List<TableFieldSchema> tableFields = new ArrayList<TableFieldSchema>();
TableFieldSchema id =
new TableFieldSchema()
.setName("id")
.setType("STRING")
.setMode("NULLABLE");
tableFields.add(id);
schema.setFields(tableFields);
return schema;
I should also mention that if you are converting an AVRO record to BigQuery´s TableRow at some point, you may need to implement some checks there too.

Side input in global window as slowly changing cache questions

Context:
We have some schema files in Cloud Storage. In our Dataflow job, we need to refer to these schema files to transform our data. These schema files change on a daily/weekly basis. Our data source is PubSub and we window PubSub messages into a fixed window of 1 minutes. The schema files we need fit well into memory, they are about 90 MB.
What I have tried:
Referring to this doc from Apache Beam, we created a side input that writes into a global window with a GenerateSequence like so:
// Creates a side input that refreshes the schema every minute
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
Map<String, byte[]> dataBlobs = ImmutableMap.of(
"version", Longs.toByteArray(ctx.element().byteValue()),
"avroSchemaBlob", avroSchemaBlob,
"fileDescriptorSetBlob", fileDescriptorSetBlob,
"depsBlob", depsBlob);
ctx.output(dataBlobs);
}
}))
.apply(View.asSingleton());
"getAvroSchema", "getFileDescriptorSet" and "getFileDescriptorDeps" read files as byte[] from Cloud Storage.
However, this approach failed from the exception:
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.
I then tried writing my own Combine Globally function like so:
static class GetLatestVersion implements SerializableFunction<Iterable<Map<String, byte[]>>, Map<String, byte[]>> {
#Override
public Map<String, byte[]> apply(Iterable<Map<String, byte[]>> versions) {
Map<String, byte[]> result = Maps.newHashMap();
Long maxVersion = Long.MIN_VALUE;
for (Map<String, byte[]> version: versions){
Long currentVersion = Longs.fromByteArray(version.get("version"));
logger.info("Side input version: " + currentVersion);
if (currentVersion > maxVersion) {
result = version;
maxVersion = currentVersion;
}
}
return result;
}
}
But it still triggers the same exception........
I then came across this and this Beam email archives and it seems like what's suggested in the Beam doc does not work. And I have to use a MultiMap to avoid the exception I ran into above. With a MultiMap, I will also have to iterate through the values and have my own logic to pick my desired value (latest).
My questions:
Why do I still get the exception "PCollection with more than one element accessed as a singleton view" even after I globally combine everything into 1 result?
If I go with the MultiMap approach, wouldn't the job eventually run out of memory? Because everyday we are basically increasing the MultiMap by 90 MB (the size of our data blob), unless Dataflow has some smart MultiMap implementation behind the scene.
What is the recommended way to do this?
Thanks
Use .apply(View.asMap()) instead of .apply(View.asSingleton());
This is the full example:
PCollectionView<Map<String, byte[]>> dataBlobView =
pipeline.apply(GenerateSequence.from(0).withRate(1, Duration.standardDays(1L)))
.apply(Window.<Long>into(new GlobalWindows()).triggering(
Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, KV<String, byte[]>>() {
#ProcessElement
public void processElement(ProcessContext ctx) throws Exception {
byte[] avroSchemaBlob = getAvroSchema();
byte[] fileDescriptorSetBlob = getFileDescriptorSet();
byte[] depsBlob = getFileDescriptorDeps();
ctx.output(KV.of("version", Longs.toByteArray(ctx.element().byteValue())));
ctx.output(KV.of("avroSchemaBlob", avroSchemaBlob));
ctx.output(KV.of("fileDescriptorSetBlob", fileDescriptorSetBlob));
ctx.output(KV.of("depsBlob", depsBlob));
}
}))
.apply(View.asMap());
You can use the map from the side inputs as described in documentation.
Apache Beam version 2.34.0

gRPC metadata read from the server response are wrongly formatted

I have a Java application running on Spring and defining multiple gRPC endpoints. These endpoints are meant to be queried from multiple clients, one of which being in PHP, so I used the PHP lib for gRPC. Now I wonder how to properly get the metadata from the server in case of an invalid request, this metadata containing mostly constraint violations built by the Java validator and transformed into a collection of gRPC FieldViolation objects. In this example, the server is supposed to return one single field violation as metadata, with the key "violationKey" and the description "violationDescription":
try {
// doStuff
} catch (ConstraintViolationException e) {
Metadata trailers = new Metadata();
trailers.put(ProtoUtils.keyForProto(BadRequest.getDefaultInstance()), BadRequest
.newBuilder()
.addFieldViolations(FieldViolation
.newBuilder()
.setField("violationKey")
.setDescription("violationDescription")
.build()
)
.build()
);
responseObserver.onError(Status.INVALID_ARGUMENT.asRuntimeException(trailers));
}
On the PHP side, this is the implementation to retrieve the metadata:
class Client extends \Grpc\BaseStub
{
public function callService()
{
$call = $this->_simpleRequest(
'MyService/MyAction',
$argument,
['MyActionResponse', 'decode'],
$metadata, $options
);
list($response, $status) = $call->wait();
var_dump($status->metadata); // A
var_dump($call->getMetadata()); // B
}
}
Result: "A" outputs an empty array, "B" outputs the proper metadata, formatted as follows:
array(1) {
["google.rpc.badrequest-bin"]=>
array(1) {
[0]=>
string(75) "
I
testALicense plate number is not in a valid format for country code FR"
}
}
Why is the metadata in the status empty, and why is the metadata retrieved by $call->getMetadata() is formatted that way ("I" followed by the violation key, then "A" and finally the violation description) ? How can I avoid to make potentially tedious transformation of this metadata client-side?
Can you please log an issue on our grpc/grpc Github repo so that we can better follow up there? Thanks.

How to access Kafka headers while consuming a message?

Below are my configuration
<int-kafka:inbound-channel-adapter id="kafkaInboundChannelAdapter"
kafka-consumer-context-ref="consumerContext"
auto-startup="true"
channel="inputFromKafka">
<int:poller fixed-delay="1" time-unit="MILLISECONDS" />
</int-kafka:inbound-channel-adapter>
inputFromKafka goes through transformation below
public Message<?> transform(final Message<?> message) {
System.out.println( "KAFKA Message Headers " + message.getHeaders());
final Map<String, Map<Integer, List<Object>>> origData = (Map<String, Map<Integer, List<Object>>>) message.getPayload();
// some code to figure-out the nonPartitionedData
return MessageBuilder.withPayload(nonPartitionedData).build();
}
The print statement from above prints only two consistent headers regardless
KAFKA Message Headers {id=9c8f09e6-4b28-5aa1-c74c-ebfa53c01ae4, timestamp=1437066957272}
While Sending a Kafka message some headers were passed including KafkaHeaders.MESSAGE_KEY but I am not getting back that either, wondering if there is away to accomplish this?
Unfortunately it doesn't work that way...
The Producer part (KafkaProducerMessageHandler) looks like this:
this.kafkaProducerContext.send(topic, partitionId, messageKey, message.getPayload());
As you see we don't send any messageHeaders to the Kafka topic. Only payload and exactly under that messageKey as it specified by Kafka protocol.
From other side the Consumer side (KafkaHighLevelConsumerMessageSource) does this logic:
if (!payloadMap.containsKey(messageAndMetadata.partition())) {
final List<Object> payload = new ArrayList<Object>();
payload.add(messageAndMetadata.message());
payloadMap.put(messageAndMetadata.partition(), payload);
}
As you see we don't care here about messageKey.
The KafkaMessageDrivenChannelAdapter (<int-kafka:message-driven-channel-adapter>) is for you! It does this before sending the message to the channel:
KafkaMessageHeaders kafkaMessageHeaders = new KafkaMessageHeaders(generateMessageId, generateTimestamp);
Map<String, Object> rawHeaders = kafkaMessageHeaders.getRawHeaders();
rawHeaders.put(KafkaHeaders.MESSAGE_KEY, key);
rawHeaders.put(KafkaHeaders.TOPIC, metadata.getPartition().getTopic());
rawHeaders.put(KafkaHeaders.PARTITION_ID, metadata.getPartition().getId());
rawHeaders.put(KafkaHeaders.OFFSET, metadata.getOffset());
rawHeaders.put(KafkaHeaders.NEXT_OFFSET, metadata.getNextOffset());
if (!this.autoCommitOffset) {
rawHeaders.put(KafkaHeaders.ACKNOWLEDGMENT, acknowledgment);
}
As stated before there's no concept of message headers in Kafka. Because I struggled with that same problem in the past I've compiled a small library that helps tackle this issue. It may come handy to you.

Categories