Exceptions when reading protobuf messages in Java - java

I am using protobuf now for some weeks, but I still keep getting exceptions when parsing protobuf messages in Java.
I use C++ to create my protobuf messages and send them with boost sockets to a server socket where the Java client ist listening. The C++ code for transmitting the message is this:
boost::asio::streambuf b;
std::ostream os(&b);
ZeroCopyOutputStream *raw_output = new OstreamOutputStream(&os);
CodedOutputStream *coded_output = new CodedOutputStream(raw_output);
coded_output->WriteVarint32(agentMessage.ByteSize());
agentMessage.SerializeToCodedStream(coded_output);
delete coded_output;
delete raw_output;
boost::system::error_code ignored_error;
boost::asio::async_write(socket, b.data(), boost::bind(
&MessageService::handle_write, this,
boost::asio::placeholders::error));
As you can see I write with WriteVarint32 the length of the message, thus the Java side should know by using parseDelimitedFrom how far it should read:
AgentMessage agentMessage = AgentMessageProtos.AgentMessage
.parseDelimitedFrom(socket.getInputStream());
But it's no help, I keep getting these kind of Exceptions:
Protocol message contained an invalid tag (zero).
Message missing required fields: ...
Protocol message tag had invalid wire type.
Protocol message end-group tag did not match expected tag.
While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length.
It is important to know, that these exceptions are not thrown on every message. This is only a fraction of the messages I receive the most work out just fine - still I would like to fix this since I do not want to omit the messages.
I would be really gratful if someone could help me out or spent his ideas.
Another interesting fact is the number of messages I receive. A total messages of 1.000 in 2 seconds is normally for my program. In 20 seconds about 100.000 and so on. I reduced the messages sent artificially and when only 6-8 messages are transmitted, there are no errors at all. So might this be a buffering problem on the Java client socket side?
On, let's say 60.000 messages, 5 of them are corrupted on average.

[I'm not really a TCP expert, this may be way off]
Problem is, [Java] TCP Socket's read(byte[] buffer) will return after reading to the end of the TCP frame. If that happens to be mid-message (I mean, protobuf message), parser will choke and throw an InvalidProtocolBufferException.
Any protobuf parsing call uses CodedInputStream internally (src here), which, in case the source is an InputStream, relies on read() -- and, consequently, is subject to the TCP socket issue.
So, when you stuff big amounts of data through your socket, some messages are bound to be split in two frames -- and that's where they get corrupted.
I'm guessing, when you lower message transfer rate (as you said to 6-8 messages per second), each frame gets sent before the next data piece is put into the stream, so each message always gets its very own TCP frame, i.e. none get split and don't get errors. (Or maybe it's just that the errors are rare and low rate just means you need more time to see them)
As for the solution, your best bet would be to handle the buffer yourself, i.e. read a byte[] from the socket (probably using readFully() instead of read() because the former will block until either there's enough data to fill the buffer [or a EOF is encountered], so it's kind of resistant to the mid-message frame end thing), ensure it's got enough data to be parsed into a whole message, and then feed the buffer to the parser.
Also, there's some good read on the subject in this Google Groups topic -- that's where I got the readFully() part.

I am not familiar with the Java API, but I wonder how Java deals with an uint32 value denoting the message length, because Java only has signed 32-bit integers. A quick look at the Java API reference told me an unsigned 32-bit value is stored within a signed 32-bit variable. So how is the case handled where an unsigned 32-bit value denotes the message length? Also, there seems to be support for varint signed integers in the Java implementation. They are called ZigZag32/64. AFAIK, the C++ version doesn't know about such encodings. So maybe the cause for your problem might be related with these things?

Related

Serialize multiple protobuf messages in java and desesrialize them in Python

I want to store a bunch of protobuf messages in a file, and read them later.
In java, I can just use 'writeDelimitedTo' and 'parseDelimitedFrom' to read and write to a file. However, I want to read it in Python, which only seems to have a 'ParseFromString' method.
Some SO questions are very similar, such as, Parsing Protocol Buffers, written in Java and read in Python, but that is only for a single message: not for multiple.
From the proto guide it is written that you need to deal yourself with the size of your message:
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)
https://developers.google.com/protocol-buffers/docs/techniques
A simple solution could be for you to serialize each proto in base64, on a new line in your file.
Doing so, it would be pretty easy on python to parse and use them.

Java Protocol Buffers - Message sizes

So, for the past few weeks I've been learning very simple network programming and protocol buffers. Right now, I have a Java client and a C# server that are communicating back and forth using the protocol buffers. It's all working fine, but to make it work on the client (Java) side I had to create my byte array with the exact size of the incoming message or else the parser would throw an error of "Protocol message contained an invalid tag (zero)"
After doing some research, I came to find out that the array I had created (1024bytes) for my DatagramPacket had tons of trailing zeros (since my incoming data from the server was 27bytes long), and that's why I now, as previously mentioned, have to create the array with the exact size of the incoming data.
As for the question, is there any way to find out the size of all of my proto "messages" in my .proto files? If there isn't some sort of static getSize(), is there a way I can calculate that just by the types of fields I have within the "message"?
My message I'm using right now contains 3 doubles, and now that I'm thinking about it, but I want a for sure answer from someone who knows what's going on, is it 27 because 8bytes per double and the 1byte per "tag" on each message field?
The root object in protobuf data is not self-terminating; it is designed to be appendable (with append===merge), so normally the library simply reads until it runs out of data. If you have spare zeros, it will fail to parse the next field header. There are two ways of addressing this:
if you only want to send one message, simply close the outbound socket at the end of you message; the client should detect the end of the socket and compensate accordingly (note, you still don't want to use an oversized buffer unless you are using a length-limited stream wrapper)
use some kind of "framing" protocol; the simplest of which is simply to prefix each message with the number of bytes in that message (note that in the general case this size is not fixed, but in the case off 3 doubles, each with a field-header of a field-number no-greater-than 16, then yes: it will be 27 bytes); you would then either create the buffer the right size (noting that repeated array allocations can be expensive), or more typically: use a length-limited stream wrapper, or a memory-backed in-memory stream
I believe your problem lies in your socket receive code. Having an array with trailing zeroes is not a problem, but when receiving you should check the number of bytes received (it is the return value of a receive call) and only consider the bytes of the buffer array from the beginning up to "bytes received".

How do i start reading byte through input stream from a specific Location in the stream?

I am using URL class in java and I want to read bytes through Input Stream from a specific byte location in the stream instead of using skip() function which takes a lot of time to get to that specific location.
I suppose it is not possible and here is why: when you send GET request, remote server does not know that you are interested in bytes from 100 till 200 - he sends you full document/file. So you need to read them, but don't need to handle them - that is why skip is slow.
But: I am sure that you can tell server (some of them support it, some - don't) that you want 100+ bytes of file.
Also: see this to get in-depth knowledge about skip mechanics: How does the skip() method in InputStream work?
The nature of streams mean you will need to read through all the data to get to the specific place you want to start from. You will not get faster than skip() unfortunately.
The simple answer is that you can't.
If you perform a GET that requests the entire file, you will have to use skip() to get to the part that you want. (And in fact, the slowness is most likely because the server has to send all of the data that is being skipped to the client. That is how TCP/IP works ...)
However, there is a possible alternative. The HTTP 1.1 specification supports partial fetching documents using the Range header. If your server supports this, then you can request the server to send you just the range of the document that you are interested in. However, you may need to deal with the case where the server ignores the Range header and sends the entire document anyway.

Unknown AMF type & other errors when transfering files from flash to java

I'm using a Flash tool to transfer data to Java. I'm having problems when it comes to sending multiple objects at once. The objects being sent are just generic Object objects, so it's not a case of needed to register a class alias or anything.
Sending one object works fine. Once I start sending multiple objects (putting the same Objects in an Array and sending that), it starts getting weird. Up to three objects in an Array seems to work fine. More than that I get different errors in the readObject() function, such as:
Unknown AMF type '47'
Unknown AMF type '40'
Unknown AMF type '20'
OutOfBoundsExceptions index 23, size 0
NullPointerException
etc.
Sending 3 objects will work, sending 4 gives me the error. If I delete one of the previous 3 that worked (while keeping the fourth that was added), it'll work again. Anyone know what's going on?
Some more info:
Communication goes through a Socket class on the Flash side. This is pure AS3, no flex.
Messages are compressed before being sent and decompressed on the server, so I'm pretty sure it's not a buffer size issue (unless I'm missing something)
BlazeDS version seems to be 4.0.0.14931 on the jar
Flash version is 10.1 (it's an AIR app)
Update with rough code
Examples of the objects being sent:
var o:Object = { };
o._key = this._key.toString();
o.someParam = someString;
o.someParam2 = someInt;
o.someParam3 = [someString1, someString2, someString3];
...
It's added to our event object (which we use to determine the event to call, the data etc to pass). The event object is been registered as a class alias
That object is sent to the server through a Socket like so:
myByteArray.writeObject( eventObj );
myByteArray.compress();
mySocket.writeBytes( myByteArray );
mySocket.flush();
On the server side, we receive the bytes, and decompress them. We create a Amf3Input object and set the input stream, then read it:
Amf3Input amf3Input = new Amf3Input( mySerializationContext );
amf3Input.setInputStream( new ByteArrayInputStream( buffer ) ); // buffer is a byte[]
MyEventObj eventObj = (MyEventObj)amf3Input.readObject(); // MyEventObj is the server version of the client event object
If it's going to crash with an "unknown AMF type error", it does so immediately i.e. when we try to read the object, and not when it's trying to read a subobject.
In stepping through the read code, it seems when I pass an array of objects, if the length is <= 4, it reads the length right. If the length is bigger than that, it reads it's length as 4.
If you're getting AMF deserialization errors, there are several possible issues that could be contributing to the problem. Here are several techniques for doing further diagnostics:
Use a network traffic sniffer to make sure that what you are sending matches what you are receiving. On the Mac I'll use CocoaPacketAnalyzer, or you can try Charles, which can actually decode AMF packets that it notices.
Feed the data to a different AMF library, like PyAMF or RocketAMF to see if it's a problem with BlazeDS or with how you're calling it. It's also possible that you may get a different error message that will give you a better idea of where it's failing.
Check the format of the AMF packet. AMF server calls have some additional wrapping around them that would throw off a deserializer if it's not expecting to encounter that wrapping, and vice versa for purely serialized objects. Server call packets always start off with a 0x00, followed by the AMF version (0x00, 0x03, or in rare cases 0x02).
Ok, I figured out the problem. Basically messages are compressed before being sent and decompressed on the server. What I didn't see was that the byte[] buffer that the message was being decompressed to was always 1024 length, which was fine for small arrays of objects. Once that was passed however, it would overwrite the buffer (I'm not quite sure what happens in Java when you try to write more bytes than are available - whether it loops back around, or shifts the data).
When it came to reading the amf object, the first thing it does it read an int, and uses this to determine what type of object it's trying to decode. As this int was gibberish (47, 110, -10), it was failing.
Time to start prepending message lengths I think :)
Thanks for the help.

How to verify that an object transfer is complete

I want to send a message (a serializable object) from a java instance to another instance over a network. I would like to verify that the whole object has been sent correctly.
I suppose my first step is to calculate the checksum of the object. Then I include that checksum in the object OR build a container object for the message and its checksum.
Then my second step should be to verify the checksum against the object on the other side.
My third step would be for the receiver to send a confirmation message saying that the object in question was received and that the checksum has passed (or not). If I receive a failed checksum warning, I try to resend it a few times.
After a little while, if I never received a confirmation, I try to resend it a few times as well.
Questions :
Does my protocol sounds right to verify that an object was transferred correctly ?
I would also like to know how am I supposed to implement this in java ? Do I use the CRC32 class to generate the checksum ?
Bonus question : If I were to compress each message, do I generate the checksum before of after the compression, and how do I compress an object in java ?
If you have a reasonably reliable network with a low error rate, you shouldn't need to add an additional checksum. I would implement your protocol without a checksum first and add if you are sure you need it.
You can compress the data with DeflatorOutputStream, InflatorInputStream. If the compressed data is corrupted the Object is highly likely to throw an exception when unpacking it. i.e. it is very unlikely to have a subtle error.
However, unless your objects are large, they may not compress very well and could endup being larger with compression.
For compression, I would like to recommend the Apache Zip Utilities:
http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/package-summary.html
If you are compressing then you can skip the checksum. Compression algorithms are quite sensitive to data damage. If the object fails to decompress on the other end, then you lost some information during transmission.
I agree with your steps, with one addition (this is what I do, over ObjectIO streams) -
[1] Read the incoming stuff as Object, find its class with "instanceof". If it is not the expected class, time to debug what is coming in.
In some other situations, I also send out Strings that contain info about the contents of the object to arrive next. Parse this string, read the object, typecast it, make sure it has what the info in the String said, and write out the confirmation :)

Categories