How to fully read a file with delimited messages in Google Protobufs? - java

I'm trying to read a file, which has multiple delimited messages in it (in the thousands), how can I do this properly using Google protobufs?
This is how I'm writing the delimited:
MyMessage myMessage = MyMessage.parseFrom(byte[] msg);
myMessage.writeDelimitedTo(FileOutputStream);
and this is how I'm reading the delimited file;
CodedInputStream is = CodedInputStream.newInstance(new FileInputStream("/location/to/file"));
while (!is.isAtEnd()) {
int size = is.readRawVarint32();
MyMessage msg = MyMessage.parseFrom(is.readRawBytes(size));
//do stuff with your messages
}
I'm kind of confused because the accepted answer in this question say's to use .parseDelimitedFrom() to read the delimited bytes; Google Protocol Buffers - Storing messages into file
However, when using .parseDelimitedFrom(), it only reads the first message. (I don't know how to read the whole file using parseDelimitedFrom()).
This comment say's to write the delimited messages using CodedOutputStream: Google Protocol Buffers - Storing messages into file (i.e. writer.writeRawVariant()). I'm currently using the implementation of this comment to read the whole file. Does writeDelimitedTo() basically do the same thing as
writer.writeRawVarint32(bytes.length);
and
writer.writeRawBytes(bytes);
Also, if my way isn't the proper way of reading a whole file consisting of delimited messages, can you please show me what is?
thank you.

Yes, writeDelimitedTo() simply writes the length as a varint followed byte the bytes. There's no need to use CodedOutputStream directly if you're working in Java.
parseDelimitedFrom() parses one message, but you may call it repeatedly to parse all the messages in the InputStream. The method will return null when you reach the end of the stream.

Related

UTF8-mb4 in Kafka

I have a UTF8-mb4 char in mongo and have a java extractor which extracts data from mongo and put into kafka. When the data goes to Kafka the special char has been replaced with \u...
Sample text:- "\uDBFF\uDC15COMPANY"
I have another Java program which reads from one kafka topic and put it into another kafka topic after some operation. When the data is read from the actual topic, the \u... is been decoded to an actual special char and when the data is pushed to the target topic, it is is like some junk char. How to put the data back to the target topic as \u ...
The same message in the target topic is like,
"􏰕COMPANY"
Note:-
The message has lots of data(JSON data) and there could be a special char in any json value.
While reading from the source topic,
For consumer to consume from the source topic,
key.deserializer = "org.apache.kafka.common.serialization.ByteArrayDeserializer"
value.deserializer = "org.apache.kafka.common.serialization.ByteArrayDeserializer"
For produce to produce in the target topic,
key.serializer = "org.apache.kafka.common.serialization.ByteArraySerializer"
value.serializer = "org.apache.kafka.common.serialization.ByteArraySerializer"
Since you're using ByteArraySerializer to preserve the data exactly as written rather than StringSerializer, for example, which defaults to utf8 encoding, then you're going to potentially get those database control characters, which cannot be displayed as ASCII, so you end up with 􏰕 instead.
The same message in the target topic is like,
"􏰕COMPANY"
Unclear what you're using to view this data, but perhaps the issue exists in the encoding of that program, not Kafka itself, or your choice of serializer
You can try other serializers.
I think this would be usefulmongodb

Write and read via protobuf CodedOutputStream and CodedInputStream from a socket in java

So I have a socket , I send some data on it using protobuf CodedOutputStream like this:
int size = myMessage.getSerializedBytes();
out.writeRawVarint32(size) ; // out = CodedOutputStream created from a java.io.OutputStream
myMessage.writeTo(out);
out.flush();
Client code compiles and writes. How do i read this correctly on server side? If i use DataInputStream readByte() to read size I get a negative value for size... If I use CodedInputStream to read size via readRawVarint32 I get a large value 10x the size of the serialized message i sent.
How do I read a protobuf message from an InputStream in java??
CodedInputStream ?
DataInputStream ?
I read the docs cannot find this documented anywhere. Do i need to drop down to protocol level and start debugging bytes ?
Be careful what you type in your IDE, I made this little blunder in my OutputStream definition:
OutputStream out = new ObjectOutputStream( client.getOutputStream() );
I was figuring out how to serialize protobuf message. I guess at first I figured I should use ObjectOutputStream, added it and forgot about it. And left it, then was wondering why i could not read data on receiving end of protobuf. When I fixed it and used the right OutputStream with the protobuf methods to read/write data, everything started working.

How to read output file for collecting stats (post) processing

Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)
Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/

Apache Beam - Reading JSON and Stream

I am writing Apache beam code, where I have to read a JSON file which has placed in the project folder, and read the data and Stream it.
This is the sample code to read JSON. Is this correct way of doing it?
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(SparkRunner.class);
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply("ReadMyFile", TextIO.read().from("/Users/xyz/eclipse-workspace/beam-prototype/test.json"));
System.out.println("lines: " + lines);
or I should use,
p.apply(FileIO.match().filepattern("/Users/xyz/eclipse-workspace/beam-prototype/test.json"))
I just need to read the below json file. Read the complete testdata from this file and then Stream it.
{
“testdata":{
“siteOwner”:”xxx”,
“siteInfo”:{
“siteID”:”id_member",
"siteplatform”:”web”,
"siteType”:”soap”,
"siteURL”:”www”,
}
}
}
The above code is not reading the json file, it is printing like
lines: ReadMyFile/Read.out [PCollection]
, could you please guide me with sample reference?
This is the sample code to read JSON. Is this correct way of doing it?
To quickly answer your question, yes. Your sample code is the correct way to read a file containing JSON, where each line of the file contains a single JSON element. The TextIO input transform reads a file line by line, so if a single JSON element spans multiple lines, then it will not be parseable.
The second code sample has the same effect.
The above code is not reading the json file, it is printing like
The printed result is expected. The variable lines does not actually contain the JSON strings in the file. lines is a PCollection of Strings; it simply represents the state of the pipeline after a transform is applied. Accessing elements in the pipeline can be done by applying subsequent transforms. The actual JSON string can be access in the implementation of a transform.

How to change endianness when unmarshalling a CDR stream containing valuetype objects in Java

I've got marshaled CDR data all by itself in the form of a file (i.e., not packed in a GIOP message) which I need to unmarshal and display on the screen. I get to know what type the data is and have working code to do this successfully by the following:
ValueFactory myFactory = (ValueFactory)myConstructor.newInstance( objParam );
StreamableValue myObject = myFactory.init();
myObject._read( myCDRInputStream );
where init() calls the constructor of myObjectImpl(). and _read is the org.omg.CORBA.portable.Streamable _read(InputStream) method.
This works as long as the marshaled data is of the same endianness as the computer running my reader program, but I will need to be able to handle cases where the endianness of the data is different than the endianness of the computer running the reader. I know that endianness is in GIOP messages, which I don't have. Assuming I figure out that I need to change the endianness, how can I tell this to the stream reader?
Thanks!
If you access to the underlying ByteBuffer of your input stream, and then you can set the endianness. For example I use this to open matlab files myself
File file = new File("swiss_roll_data.matlab5");
FileChannel channel = new FileInputStream(file).getChannel();
ByteBuffer scan = channel.map(MapMode.READ_ONLY,0,channel.size());
scan.order(ByteOrder.BIG_ENDIAN);
However, I dont know if you corba framework is happy to read from a bytebuffer (corba is so 90ies). So maybe that does not work for you.

Categories