UTF8-mb4 in Kafka

UTF8-mb4 in Kafka - java

I have a UTF8-mb4 char in mongo and have a java extractor which extracts data from mongo and put into kafka. When the data goes to Kafka the special char has been replaced with \u...
Sample text:- "\uDBFF\uDC15COMPANY"
I have another Java program which reads from one kafka topic and put it into another kafka topic after some operation. When the data is read from the actual topic, the \u... is been decoded to an actual special char and when the data is pushed to the target topic, it is is like some junk char. How to put the data back to the target topic as \u ...
The same message in the target topic is like,
"􏰕COMPANY"
Note:-
The message has lots of data(JSON data) and there could be a special char in any json value.
While reading from the source topic,
For consumer to consume from the source topic,
key.deserializer = "org.apache.kafka.common.serialization.ByteArrayDeserializer"
value.deserializer = "org.apache.kafka.common.serialization.ByteArrayDeserializer"
For produce to produce in the target topic,
key.serializer = "org.apache.kafka.common.serialization.ByteArraySerializer"
value.serializer = "org.apache.kafka.common.serialization.ByteArraySerializer"

Since you're using ByteArraySerializer to preserve the data exactly as written rather than StringSerializer, for example, which defaults to utf8 encoding, then you're going to potentially get those database control characters, which cannot be displayed as ASCII, so you end up with 􏰕 instead.
The same message in the target topic is like,
"􏰕COMPANY"
Unclear what you're using to view this data, but perhaps the issue exists in the encoding of that program, not Kafka itself, or your choice of serializer

You can try other serializers.
I think this would be usefulmongodb

Related

How to receive raw json string from Rabbit in Java without any modifications?

I have a spring boot app configured with a RabbitMqListener. It has to receive JSON data of the format below: (showing sample)
{ "name" :"abc",
"key" : "somekey",
"value" : {"data": {"notes": "**foo \u0026 bar"}}**
}
This data represents some info. which should be used only for read-only processing, and the receiver spring app should receive it as it is(raw form).
What I mean is if I assert value node in spring app with input that was published on queue it should be equal.
This is simply not happening.
I always get the value in spring app as
foo & bar but I wanted it in raw form without a conversation of \u codes.
I try several approaches,
Jackaon2JsonMessage converter,
passing bytes from Message.getBody() - byte[] to mapper.readValue() in Rabbit handler.
Using JSON-simple, gson libraries
Why is it so tricky to get the data as it is, without any conversion or translation.
Do I need to follow an alternative approach?
Any help is appreciated

Have you tried explicitly enabling the escaping of non-ascii characters on your ObjectMapper?
mapper.getFactory().configure(JsonGenerator.Feature.ESCAPE_NON_ASCII, true);

kafka byte[] input converted to java String very slow in overrided flink AbstractKafkaDeserializationSchema

i have flink app which reading mysql CDC json messages from Kafka. 5 tables json CDC strings are read and processed, and i used a overrided AbstractKafkaDeserializationSchema to turn Kafka byte[] into my customized BEAN object. but i found among 5 tables, there are 2 tables, their kafka input byte[] taken much time to converted to String than other 3 tables, worse case just stuck there minutes and even like forever and there are backpressure in flink web ui in Source subtask. the conversion is just String strValue = new String(valueByte). Also i tried new String(valueByte, "UTF-8"), new String(valueByte, StandardCharsets.US_ASCII), make no difference. the overrided method is just:
deserialize(byte[] keyBytes, byte[] valueBytes, String topic, int partition, long offset) throws IOException
this problem stopped me from release the app 1 week, since the conversion is so simple, i can't find any alternative ways to do it, search on stackoverflow and find some similiar complains but no working solution for me.

Websphere MQ: reading from DLQ with JMS

I have to process messages in Dead Letter Queue (DLQ) using JMS API. The goal is to read body of the original messages and it's user properties. I realize that such approach to DLQ processing might be considered as bad design, but I have to deal with it anyway.
Once read with JMS, body of DLQ message contains body of the original one, prepended with DL header and a structure very similar to RFH2 header of the original message (so containing all the needed user properties).
The question is, how to parse these 2 structures in java?
Yet I only found a doc about how DLH could be constructed from raw data (https://www.ibm.com/support/knowledgecenter/SS8JB4/com.ibm.wbpm.main.doc/topics/esbprog_bindings_wmq5.html). But while DLH seems to be a fixed-lenght structure, RFH2 is definitely not - so the most tricky part of parsing is there.
Any idea would be appreciated.
UPDATE
Here is what I have found:
1) DLH was parsed from raw byte array without any problem, as simple as follows:
MQDLH rfh = new MQDLH(new DataInputStream(new ByteArrayInputStream(bytes)));
Once constructed, all the properties are available.
2) MQRFH2 could be created in a similar manner, if MQLONG values were written there as usual, in big endian. But for some reason, completely unclear to me, in this case all MQLONG are little endian.
So, to create MQRFH2 from raw bytes I have to reverse bytes for all MQLONGs. Not a problem for a fixed part (as described in https://www.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.dev.doc/q032000_.htm), but a bit more complicated for variable part.
I haven't seen any confirmation in docs, but it seems that each folder in variable part is prepended with MQLONG (well, just 4-bytes integer) containing folder length. Once these values were converted from LE to BE as well, MQRFH2 seem to be working correctly.

I wouldn't process the DLQ with a JMS application. It will be so, so tricky and you will spend days or weeks trying to get it right. I would write a regular Java application to do it, far simpler.
i.e.
MQMessage rcvMsg = new MQMessage();
MQDLH dlh = new MQDLH(rcvMsg);
MQRFH2 rfh2 = new MQRFH2(rcvMsg);
byte[] bData = new byte[rcvMsg.getDataLength()];
rcvMsg.readFully(bData);
Updated on March 4, 2020.
I am normally not into banging my head against the wall but if you want to then here is the code that I would try:
ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
DataInput di = new DataInputStream(bais);
MQDLH dlh = new MQDLH(di);
MQRFH2 rfh2 = new MQRFH2(di)
// Get all folders
String[] folderStrings = rfh2.getFolderStrings();
// or you can get individual name/values using
// get***FieldValue() methods of the MQRFH2 class.
/*
* At this point, the cursor for "di" is pointing
* to the beginning of the message payload and I
* would normal do:
*/
byte[] bData = new byte[mqMsg.getDataLength()];
mqMsg.readFully(bData);

No processor found in splitter after validation.

I have a Camel route that needs to receive a XML file from FTP as a stream, validate it and split it.
Everything works fine all the way to the validation, but then the split doesn't work as expected. When debugging, I found the split process doesn't find any processor when the original message is a stream. It looks very much like a bug to me.
from("direct:start")
.pollEnrich("ftp://user#host:21?fileName=file.xml&streamDownload=true&password=xxxx&fastExistsCheck=true&soTimeout=300000&disconnect=true")
.to("validator:myXsd.xsd")
.split().tokenizeXML("myTag")
.to(to)
.end();
In this case I can see the Exchange getting in the splitter, but no processor is found and the split does nothing. the behavior is different if I remove the validation:
from("direct:start")
.pollEnrich("ftp://user#host:21?fileName=file.xml&streamDownload=true&password=xxxx&fastExistsCheck=true&soTimeout=300000&disconnect=true")
.split().tokenizeXML("myTag")
.to(to)
.end();
In this case, the splitter works fine.
Also, if the XML file doesn't come from a stream, then everything is fine.
from("file:file.xml")
.to("validator:myXsd.xsd")
.split().tokenizeXML("myTag")
.to(to)
.end();
I update my Camel version to 2.15.2 but still get the same error.

I don't know how validator works, but if is changing message body, try to store it as a header or property, for example: .setHeader("headerName",simple("${body}")) and after validator .setBody(simple("${header.headerName}"))

The problem that I was trying to pass a body that was a stream. (streamDownload=true). The validator will read the stream and validate the content. No problem.
But the problem comes when the split arrives, the stream was already read and closed. So the split can't do anything with the stream.
I already worked around the problem without a stream, but I guess working with streamcaching would also work if a stream is necessary.
See http://camel.apache.org/why-is-my-message-body-empty.html

How to fully read a file with delimited messages in Google Protobufs?

I'm trying to read a file, which has multiple delimited messages in it (in the thousands), how can I do this properly using Google protobufs?
This is how I'm writing the delimited:
MyMessage myMessage = MyMessage.parseFrom(byte[] msg);
myMessage.writeDelimitedTo(FileOutputStream);
and this is how I'm reading the delimited file;
CodedInputStream is = CodedInputStream.newInstance(new FileInputStream("/location/to/file"));
while (!is.isAtEnd()) {
int size = is.readRawVarint32();
MyMessage msg = MyMessage.parseFrom(is.readRawBytes(size));
//do stuff with your messages
}
I'm kind of confused because the accepted answer in this question say's to use .parseDelimitedFrom() to read the delimited bytes; Google Protocol Buffers - Storing messages into file
However, when using .parseDelimitedFrom(), it only reads the first message. (I don't know how to read the whole file using parseDelimitedFrom()).
This comment say's to write the delimited messages using CodedOutputStream: Google Protocol Buffers - Storing messages into file (i.e. writer.writeRawVariant()). I'm currently using the implementation of this comment to read the whole file. Does writeDelimitedTo() basically do the same thing as
writer.writeRawVarint32(bytes.length);
and
writer.writeRawBytes(bytes);
Also, if my way isn't the proper way of reading a whole file consisting of delimited messages, can you please show me what is?
thank you.

Yes, writeDelimitedTo() simply writes the length as a varint followed byte the bytes. There's no need to use CodedOutputStream directly if you're working in Java.
parseDelimitedFrom() parses one message, but you may call it repeatedly to parse all the messages in the InputStream. The method will return null when you reach the end of the stream.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.