Java Protocol Buffers - Message sizes

Java Protocol Buffers - Message sizes - java

So, for the past few weeks I've been learning very simple network programming and protocol buffers. Right now, I have a Java client and a C# server that are communicating back and forth using the protocol buffers. It's all working fine, but to make it work on the client (Java) side I had to create my byte array with the exact size of the incoming message or else the parser would throw an error of "Protocol message contained an invalid tag (zero)"
After doing some research, I came to find out that the array I had created (1024bytes) for my DatagramPacket had tons of trailing zeros (since my incoming data from the server was 27bytes long), and that's why I now, as previously mentioned, have to create the array with the exact size of the incoming data.
As for the question, is there any way to find out the size of all of my proto "messages" in my .proto files? If there isn't some sort of static getSize(), is there a way I can calculate that just by the types of fields I have within the "message"?
My message I'm using right now contains 3 doubles, and now that I'm thinking about it, but I want a for sure answer from someone who knows what's going on, is it 27 because 8bytes per double and the 1byte per "tag" on each message field?

The root object in protobuf data is not self-terminating; it is designed to be appendable (with append===merge), so normally the library simply reads until it runs out of data. If you have spare zeros, it will fail to parse the next field header. There are two ways of addressing this:
if you only want to send one message, simply close the outbound socket at the end of you message; the client should detect the end of the socket and compensate accordingly (note, you still don't want to use an oversized buffer unless you are using a length-limited stream wrapper)
use some kind of "framing" protocol; the simplest of which is simply to prefix each message with the number of bytes in that message (note that in the general case this size is not fixed, but in the case off 3 doubles, each with a field-header of a field-number no-greater-than 16, then yes: it will be 27 bytes); you would then either create the buffer the right size (noting that repeated array allocations can be expensive), or more typically: use a length-limited stream wrapper, or a memory-backed in-memory stream

I believe your problem lies in your socket receive code. Having an array with trailing zeroes is not a problem, but when receiving you should check the number of bytes received (it is the return value of a receive call) and only consider the bytes of the buffer array from the beginning up to "bytes received".

Related

Serialize multiple protobuf messages in java and desesrialize them in Python

I want to store a bunch of protobuf messages in a file, and read them later.
In java, I can just use 'writeDelimitedTo' and 'parseDelimitedFrom' to read and write to a file. However, I want to read it in Python, which only seems to have a 'ParseFromString' method.
Some SO questions are very similar, such as, Parsing Protocol Buffers, written in Java and read in Python, but that is only for a single message: not for multiple.

From the proto guide it is written that you need to deal yourself with the size of your message:
Streaming Multiple Messages
If you want to write multiple messages to a single file or stream, it
is up to you to keep track of where one message ends and the next
begins. The Protocol Buffer wire format is not self-delimiting, so
protocol buffer parsers cannot determine where a message ends on their
own. The easiest way to solve this problem is to write the size of
each message before you write the message itself. When you read the
messages back in, you read the size, then read the bytes into a
separate buffer, then parse from that buffer. (If you want to avoid
copying bytes to a separate buffer, check out the CodedInputStream
class (in both C++ and Java) which can be told to limit reads to a
certain number of bytes.)
https://developers.google.com/protocol-buffers/docs/techniques
A simple solution could be for you to serialize each proto in base64, on a new line in your file.
Doing so, it would be pretty easy on python to parse and use them.

Read from TcpClient.GetStream() without knowing the length

I'm working on a tcp base communication protocol . As i know
there are many ways to determine when to end reading.
Closing the connection at the end of the message
Putting the length of the message before the data itself
Using a separator; some value which will never occur in the normal data (or would always be escaped somehow)
Typically i'm trying to send a file over the WiFi network (that may be Unstable and Low speed)
Cause of RSA and AES communication I don't like to close the connection each time (Can't use 1)
It's a large file that i cant predict the length of it so i cant act
as method (Can't use 2)
Checking for something special when reading and escape it when writing need a lot of process (Can't use 3)
This method should be compatible with both c# and java.
What you suggest ?
More general problems :
How to identify end of InputStream in java
C# - TcpClient - Detecting end of stream?
More Iformation
I'm coding a TCP client server communication
At first server generates and sends a RSA public code to the client.
Then the client will generate AES(key,IV) and send it back using RSA encryption.
Till here everything is fine.
But i want to send a file over this network. here is my current packet EncryptUsingAES(new AES.IV(16 byte) +file.content(any size))
In the server i can't capture all the data sent by client. So i need to know how much data to read with (TcpClient.GetStream().read(buffer , 0 , buffersize) )
Current code:
List<byte> message = new List<byte>();
int bytes = -1;
do
{
byte[] buffer = new byte[bufferrSize];
bytes = stream.Read(buffer, 0, bufferrSize);
if (bytes > 0)
{
byte[] tmp = new byte[bytes];
Array.Copy(buffer, tmp, bytes);
message.AddRange(tmp);
}
} while (bytes == bufferrSize);

Your second method is the best one. Prefixing each packet with the packet's length will create a reliable message framing protocol which will, if done correctly, ensure that all your data is received even in the same size you sent it (that is, no partial data or data being lumped together).
Recommended packet structure:
[Data length (4 bytes)][Header (1 byte)][Data (?? bytes)]
- The header in question is a single byte you can use to indicate what kind of packet this is, so that the endpoint will know what to do with it.
Sending files
The sender of a file is in 90% of the cases aware of the amount of data it is about to send (after all, it usually has the file stored locally), which means there will be no problem knowing how much of the file has been sent or not.
The method I use and recommend is that you start by sending an "info packet", which explains to the endpoint that it is about to receive a file and also how many bytes that file consists of. After that you start sending the actual data - most preferrably in chunks since it's inefficient to proccess the entire file at once (at least if it's a large file).
Always keep track of how many bytes of the file you've received so far. By doing so the receiver can automatically tell when it has received the whole file.
Send a file a few kilobytes at a time (I use 8192 bytes = 8 kB as a file buffer). That way you don't have to read the entire file into memory nor encrypt it all at the same time.
Encrypting the data
Dealing with encryption will not be a problem. If you use length-prefixing just encrypt the data itself and leave the data length header untouched. The data length header must then be generated by the size of the encrypted data, like so:
Encrypt the data.
Get the length of the encrypted data.
Produce the following packet:
[Encrypted data length][Encrypted data]
(Insert a header byte in there if you need to)
Receiving an encrypted file
Receiving an encrypted file and knowing when everything has been received is infact not very hard. Assuming you're using the above the described method for sending the file, you would just have to:
Receive the encrypted packet → decrypt it.
Get the length of the decrypted data.
Increment a variable keeping track of the amount of file-bytes received.
If the received amount is equal to the expected amount: close the file.
Additional resources/references
You can refer to two of my previous answers that I wrote about TCP length-prefixed message framing:
C# Deserializing a struct after receiving it through TCP
TCP Client to Server communication

The easiest way would be to use your #2. If you cannot predict message length, buffer up to a certain amount of bytes (like 1 KiB or something along those lines), and insert a length header for every one of those chunks instead of prefixing the whole message once.

Large byte arrays over RMI

I'm currently trying to find out whether it's a good idea to transfer rather large byte arrays (<50MB) over RMI.
I read that it is slow and the data needs to be hold in memory both on the client and the server. This could result into a problem when there are multiple calls.
Are there any (simple) alternatives to this?

RMIIO lets you stream objects in chunked fashion.
EDIT : you can also use KRYO to serialize and compress the object to send across the wire.

RMI is intended to transfer objects. If you have a byte array object on the server and want it on the client you must have it both places until it has been delivered successfully (then you can let the original go away).
A more reasonable approach might be repeated calls populating a remote object transferring only a small chunk at a time. This will then in turn require multiple trips making it slower.
What is the actual (non-technical) problem you want to solve?

Consider java streams which support compression to send/receive large amounts of data.
For instance GZipOutputStream to send data and GZipInputStream to receive sent data block.

It's a very bad idea. The byte array has to be formed in memory before calling the remote method; then it has to be transmitted in the call; then it has to be read by the server; then it has to exist in the server; then it can be processed by the server. You never want to deal with data items this large in a single chunk. It wastes both time and space. Consider a streaming API where you can use moderate sized buffers at both ends; send the data in chunks that are convenient to the sender; and receive it in chunks that are convenient to the receiver.

Unknown AMF type & other errors when transfering files from flash to java

I'm using a Flash tool to transfer data to Java. I'm having problems when it comes to sending multiple objects at once. The objects being sent are just generic Object objects, so it's not a case of needed to register a class alias or anything.
Sending one object works fine. Once I start sending multiple objects (putting the same Objects in an Array and sending that), it starts getting weird. Up to three objects in an Array seems to work fine. More than that I get different errors in the readObject() function, such as:
Unknown AMF type '47'
Unknown AMF type '40'
Unknown AMF type '20'
OutOfBoundsExceptions index 23, size 0
NullPointerException
etc.
Sending 3 objects will work, sending 4 gives me the error. If I delete one of the previous 3 that worked (while keeping the fourth that was added), it'll work again. Anyone know what's going on?
Some more info:
Communication goes through a Socket class on the Flash side. This is pure AS3, no flex.
Messages are compressed before being sent and decompressed on the server, so I'm pretty sure it's not a buffer size issue (unless I'm missing something)
BlazeDS version seems to be 4.0.0.14931 on the jar
Flash version is 10.1 (it's an AIR app)
Update with rough code
Examples of the objects being sent:
var o:Object = { };
o._key = this._key.toString();
o.someParam = someString;
o.someParam2 = someInt;
o.someParam3 = [someString1, someString2, someString3];
...
It's added to our event object (which we use to determine the event to call, the data etc to pass). The event object is been registered as a class alias
That object is sent to the server through a Socket like so:
myByteArray.writeObject( eventObj );
myByteArray.compress();
mySocket.writeBytes( myByteArray );
mySocket.flush();
On the server side, we receive the bytes, and decompress them. We create a Amf3Input object and set the input stream, then read it:
Amf3Input amf3Input = new Amf3Input( mySerializationContext );
amf3Input.setInputStream( new ByteArrayInputStream( buffer ) ); // buffer is a byte[]
MyEventObj eventObj = (MyEventObj)amf3Input.readObject(); // MyEventObj is the server version of the client event object
If it's going to crash with an "unknown AMF type error", it does so immediately i.e. when we try to read the object, and not when it's trying to read a subobject.
In stepping through the read code, it seems when I pass an array of objects, if the length is <= 4, it reads the length right. If the length is bigger than that, it reads it's length as 4.

If you're getting AMF deserialization errors, there are several possible issues that could be contributing to the problem. Here are several techniques for doing further diagnostics:
Use a network traffic sniffer to make sure that what you are sending matches what you are receiving. On the Mac I'll use CocoaPacketAnalyzer, or you can try Charles, which can actually decode AMF packets that it notices.
Feed the data to a different AMF library, like PyAMF or RocketAMF to see if it's a problem with BlazeDS or with how you're calling it. It's also possible that you may get a different error message that will give you a better idea of where it's failing.
Check the format of the AMF packet. AMF server calls have some additional wrapping around them that would throw off a deserializer if it's not expecting to encounter that wrapping, and vice versa for purely serialized objects. Server call packets always start off with a 0x00, followed by the AMF version (0x00, 0x03, or in rare cases 0x02).

Ok, I figured out the problem. Basically messages are compressed before being sent and decompressed on the server. What I didn't see was that the byte[] buffer that the message was being decompressed to was always 1024 length, which was fine for small arrays of objects. Once that was passed however, it would overwrite the buffer (I'm not quite sure what happens in Java when you try to write more bytes than are available - whether it loops back around, or shifts the data).
When it came to reading the amf object, the first thing it does it read an int, and uses this to determine what type of object it's trying to decode. As this int was gibberish (47, 110, -10), it was failing.
Time to start prepending message lengths I think :)
Thanks for the help.

Exceptions when reading protobuf messages in Java

I am using protobuf now for some weeks, but I still keep getting exceptions when parsing protobuf messages in Java.
I use C++ to create my protobuf messages and send them with boost sockets to a server socket where the Java client ist listening. The C++ code for transmitting the message is this:
boost::asio::streambuf b;
std::ostream os(&b);
ZeroCopyOutputStream *raw_output = new OstreamOutputStream(&os);
CodedOutputStream *coded_output = new CodedOutputStream(raw_output);
coded_output->WriteVarint32(agentMessage.ByteSize());
agentMessage.SerializeToCodedStream(coded_output);
delete coded_output;
delete raw_output;
boost::system::error_code ignored_error;
boost::asio::async_write(socket, b.data(), boost::bind(
&MessageService::handle_write, this,
boost::asio::placeholders::error));
As you can see I write with WriteVarint32 the length of the message, thus the Java side should know by using parseDelimitedFrom how far it should read:
AgentMessage agentMessage = AgentMessageProtos.AgentMessage
.parseDelimitedFrom(socket.getInputStream());
But it's no help, I keep getting these kind of Exceptions:
Protocol message contained an invalid tag (zero).
Message missing required fields: ...
Protocol message tag had invalid wire type.
Protocol message end-group tag did not match expected tag.
While parsing a protocol message, the input ended unexpectedly in the middle of a field. This could mean either than the input has been truncated or that an embedded message misreported its own length.
It is important to know, that these exceptions are not thrown on every message. This is only a fraction of the messages I receive the most work out just fine - still I would like to fix this since I do not want to omit the messages.
I would be really gratful if someone could help me out or spent his ideas.
Another interesting fact is the number of messages I receive. A total messages of 1.000 in 2 seconds is normally for my program. In 20 seconds about 100.000 and so on. I reduced the messages sent artificially and when only 6-8 messages are transmitted, there are no errors at all. So might this be a buffering problem on the Java client socket side?
On, let's say 60.000 messages, 5 of them are corrupted on average.

[I'm not really a TCP expert, this may be way off]
Problem is, [Java] TCP Socket's read(byte[] buffer) will return after reading to the end of the TCP frame. If that happens to be mid-message (I mean, protobuf message), parser will choke and throw an InvalidProtocolBufferException.
Any protobuf parsing call uses CodedInputStream internally (src here), which, in case the source is an InputStream, relies on read() -- and, consequently, is subject to the TCP socket issue.
So, when you stuff big amounts of data through your socket, some messages are bound to be split in two frames -- and that's where they get corrupted.
I'm guessing, when you lower message transfer rate (as you said to 6-8 messages per second), each frame gets sent before the next data piece is put into the stream, so each message always gets its very own TCP frame, i.e. none get split and don't get errors. (Or maybe it's just that the errors are rare and low rate just means you need more time to see them)
As for the solution, your best bet would be to handle the buffer yourself, i.e. read a byte[] from the socket (probably using readFully() instead of read() because the former will block until either there's enough data to fill the buffer [or a EOF is encountered], so it's kind of resistant to the mid-message frame end thing), ensure it's got enough data to be parsed into a whole message, and then feed the buffer to the parser.
Also, there's some good read on the subject in this Google Groups topic -- that's where I got the readFully() part.

I am not familiar with the Java API, but I wonder how Java deals with an uint32 value denoting the message length, because Java only has signed 32-bit integers. A quick look at the Java API reference told me an unsigned 32-bit value is stored within a signed 32-bit variable. So how is the case handled where an unsigned 32-bit value denotes the message length? Also, there seems to be support for varint signed integers in the Java implementation. They are called ZigZag32/64. AFAIK, the C++ version doesn't know about such encodings. So maybe the cause for your problem might be related with these things?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.