Google ProtoBuf serialization / deserialization

Google ProtoBuf serialization / deserialization - java

I am reading Google Protocol Buffers. I want to know Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Rather I want to send objects from any language to Java Server. and deserialize it there.
Assume following is my .proto file
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
I ran protoc on this and created a C++ object.
Basically Now i want to send the serialized stream to java server.
Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value

Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value
You will need to know the schema in advance. Firstly, protobuf does not transmit names; all it uses as identifiers is the numeric key (1, 2 and 3 in your example) of each field. Secondly, it does not explicitly specify the type; there are only a very few wire-types in protobuf (varint, 32-bit, 64-bit, length-prefix, group); actual data types are mapped onto those, but you cannot unambiguously decode data without the schema
varint is "some form of integer", but could be signed, unsigned or "zigzag" (which allows negative numbers of small magnitude to be cheaply encoded), and could be intended to represent any width of data (64 bit, 32 bit, etc)
32-bit could be an integer, but could be signed or unsigned - or it could be a 32-bit floating-point number
64-bit could be an integer, but could be signed or unsigned - or it could be a 64-bit floating-point number
length-prefix could be a UTF-8 string, a sequence or raw bytes (without any particular meaning), a "packed" set of repeated values of some primitive type (integer, floating point, etc), or could be a structured sub-message in protobuf format
groups - hoorah! this is always unambigous! this can only mean one thing; but that one thing is largely deprecated by google :(
So fundamentally: you need the schema. The encoded data does not include what you want. It does this to avoid unnecessary space - if the protocol assumes that the encoder and decoder both know what the message is meant to look like, then a lot less information needs to be sent.
Note, however, that the information that is included is enough to safely round-trip a message even if there are fields that are not expected; it is not necessary to know the name or type if you only need to re-encode it to pass it along / back.
What you can do is use the parser API to scan over the data to reveal that there are three fields, field 1 is a varint, field 2 is length-prefixed, field 3 is length-prefixed. You could make educated guesses about the data beyond that (for example, you could see whether a UTF-8 decode produces something that looks roughly text-like, and verify that UTF-8 encoding that gives you back the original bytes; if it does, it is possible it is a string)

Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Yes, it is the very goal of protobuf.
Serialize data in an application developed in any supported language, and deserialize data in an application developed in any supported language. Serialization and deserialization languages can be the same, or be different.
Keep in mind that protocol buffers are not self describing, so both sides of your application needs to have serializers/deserializers generated from the .proto file.

In short: yes you can.
You will need to create .proto files which define the data structures that you want to share. By using the Google Protocol Buffers compiler you can then generate interfaces and (de)serialization code for your structures for both Java and C++ (and almost any other language you can think of).
To transfer your data over the wire you can use for instance ZeroMQ which is an extremely versatile communications framework which also sports a slew of different language API's, among them Java and C++.
See this question for more details.

Related

XSD based binary serialisation

Hello i am looking for an binary serialisation for java that:
- use xsd for schema
- output very small byte streams
- the byte stream should not contain the field names and data types
- pojos should be generated like it is possible gor jaxb
- nice to have: an implantation in java script
Do any some one know a solution for this?

If you're using an XSD schema, the conventional expectation is that you'd be serialising to/from XML. That's not a very small byte stream; plain text is quite inefficient for representing binary data (ints, floats, etc).
However there is an option. XSD schema and ASN.1 schema are interchangeable, more or less. There's even an official translation between the two defined by the ITU. There are tools that translate between the two.
Why is this relevant? Well, with ASN.1 you have access to a variety of different wire formats. There's a bunch of binary ones, as well as text ones (including, yes, XML and JSON). The important thing is that one of the binary ones is uPER (unaligned Packed Encoding Rules), which will use the bare minimum of bits to represent the data being sent.
For example, suppose that you'd got a class with an integer field, and you'd constrained its value to be between 0 and 7. uPER would use only 3 bits for that field.
What you can have is an XSD schema being i) tranlated to ASN.1 and compiled by an ASN.1 compiler (OSS), or ii) compiled directly by an ASN.1 compiler (Obj-Sys), producing Java POJO classes that can be serialised to/from ASN.1's uPER wireformat (and all the other binary formats, and XML and JSON too for that matter, depending on the ASN.1 compiler one's using). It's a similar way of working as with jaxb / xjc.
The tooling I've suggested in the previous paragraph requires, AFAIK, the ASN.1 proprietary compilers and tools from either Objective Systems (obj-sys.com) or OSS Nokalva (www.oss.com), and they're not free (n.b. I've been a customer of both, not otherwise associated with them). I think that there's a free online converter for XSD<-->ASN1 schema, and there are some free ASN1 compilers (though they commonly target C, not Java).
Links: OSS's XSD translator, Objective System's Compiler reference for XSD translation, OSS Java support, Obj-Sys's Java support

Having wittered on about XSD, ASN.1, etc there are other options that might be usable but probably mean dropping the XSD schema and using something else.
Google Protocol Buffers
There are Java (and a load of others) bindings for Google Protocol Buffers, and GBP wireformat is binary. It's not as good as ASN.1's uPER for data size, but certainly smaller than XML text. See here. It has its own schema language, and as far as I know there's no translator between XSD and GPB's.
Capn Proto
Another interesting option (see this project), again a binary format. It won't quite beat uPER for size, but it is fast to serialise / deserialise (or at least it is in C/C++). Again, I know of no translation between its schema language and XSD.

How to use ANTLR4 with binary data?

From the homepage:
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for [...] or binary files.
I have read through the docs now for some hours and think that I have some basic understanding of ANTLR, but I have a hard time to find any references to processing binary files. And I'm not the only one as it seems.
I need to create a parser for some binary data and would like to decide if ANTLR is of any help or not.
Binary data structure
That binary data is structured in logical fields like field1, which is followed by field2, which is followed by field3 etc. and all those fields have a special purpose. The length of all those fields may differ AND may not be known at the time the parser is generated, so e.g. I do know that field1 is e.g. 4 bytes always, field2 might simply be 1 byte and field3 might be 1 to 10 bytes and might be followed by additional field3s with n bytes, depending on the actual value of the data. That is the second problem, I know the fields are there and e.g. with field1 I know it's 4 bytes, but I don't know the actual value, but that is what I'm interested in. Same goes for the other fields, I need the values from all of those.
What I need in ANTLR
This sounds like a common structure and use case for some arbitrary binary data to me, but I don't see any special handling of such data in ANTLR. All examples are using some kind of texts and I don't see some value extraction callbacks or such. Additionally, I think I would need some callbacks influencing the parsing process itself, so for e.g. one callback is called on the first byte of field3, I check that, decide that one to N additional bytes need to be consumed and that those are logically part of field3 and tell the parser that, so it's able to proceed "somehow".
In the end, I would get some higher level "field" objects and ANTLR would provide the underlying parse logic with callbacks and listener infrastructure, walking abilities etc.
Did anyone ever do something like that and can provide some hints to examples or the concrete documentation I seem to have missed? Thanks!
EN 13757-3:2012
I don't think it makes understanding my question really easier, but the binary data I'm referring to is defined in the standard EN 13757-3:2012:
Communication systems for and remote reading of meters - Part
3: Dedicated application layer
The standard is not freely available on the net (anymore?), but the following PDF might provide you an overview of how example data looks like in page 4. Especially that bytes of the mentioned fields are not constant, only the overall structure of the datagram is defined.
http://fastforward.ag/downloads/docu/FAST_EnergyCam-Protocol-wirelessMBUS.pdf
The tokens for the grammar would be the fields, implemented by a different amount of bytes, but with a value etc. Regarding the self-description of ANTLR, I would expected such things to work somehow...
Alternative: Kaitai.io
Whoever is in a comparable position like me currently, have a look at Kaitai.io, which reads very promising:
https://stackoverflow.com/a/40527106/2055163

As400Text class vs MQC.MQGMO_CONVERT

Why some people prefer to use As400Text object to handle EBCDIC/ASCII conversion (Java code with IBM MQ jars) if we already have MQC.MQGMO_CONVERT option to handle this?
My requirement is to convert ASCII->EBCDIC during the PUT operation which I am doing by setting the character set to 37 and the write format to "STRING" and using MQC.MQGMO_CONVERT option to automatically convert EBCDIC ->ASCII during the GET operation.
Is there any downfall of using convert option? Could anyone please let me know if this is not 100 percent safe option?

Best practice is to write the MQ message in your local code page (where the CCSID and Encoding will normally be filled in automatically as the correct values) and to set the Format field. Then the getter will should use MQGMO_CONVERT to request the message in the CCSID and Encoding they need it in.
Get with Convert is safe, and will be correct so long as you provide the correct CCSID a and Encoding that describes the message, when you put it.
In the description of what you are doing in your question you convert from ASCII->EBCDIC before putting the message, and then getter is converting from EBCDIC->ASCII on the MQGET. This means you have paid for two data conversion operations, when you could have done none (or if two different ASCIIs, only one).

default number system and charecter set in java

Thi is a fundamental questuion about how java works and so i dont have any code to support it.
I am new to java development and want to know how the different number systems, charecter sets like UTF 8 and unicode come together in Java.
Lets say a user creates a new string and int with the same value.
int i=100;
String S="100";
The hardware of a computer understands zeros and ones. so it has to be converted to binary?(correct me if im wrong). this conversion should be done by the JVM(correct me if im wrong)? and to represent charecters of different languages into charecters that can be typed into the keyboard (english) UTF-8 and such conversions are used(correction needed)?
now how does this whole flow fit into the bigger picture of running a java web application?
how does a string/int get converted to a binary for the machine's hardware to understand?
how does it get converted to UTF-8 for a browser to understand?
and what are the default number format and charecterset in java? if im reading contents of a file? will they be read into binary or utf-8?

All computers run in binary. The conversion is done by the JVM and the computer that you have. You shouldn't worry about converting the code into the coordinating 1's and 0's. The browser has its own conversion hard code to change the universal 1's and 0's(used by all programs and computer software) into however it decides to display the given information. All languages are just a translation guide for the user to "speak" with the computer. And vice versa. Hope this helps though I don't think I really answered anything.

How java represents any data type in memory is the choice of the actual JVM. In practice, the JVM will chose the format native to the processor (e.g. chose between little/big endian for int), simply because it offers the best performance on that platform.
Basically, the JLS makes certain guarantees (like that a byte has 8 bits and the values range from -128 to 127) - the VM just maps that to the platform as it deems suitable (the JLS was specified to match common computing technology closely, so there is usually no magic needed to guess how primitive types map to the platform).
You should never care how the VM represents data in memory, java does not offer any legal way to access the data in a manner where you would need to know (bypassing most of the VM's logic by using sun.misc.Unsafe is not considered legal).
If you care for educational purposes, learn what binary representations the underlying platform (e.g. x86) uses and take a look at the VM. It has little to do with java really, its all VM and platform specific.
For java.lang.String, its the implementation of the class that defines how the String is stored internally - it went through quite some changes over major java versions - but what that String exposes is quite narrowly defined (see JDK javadoc for String.length(), String.charAt()).
As for how user input is translated to java standard types, thats actually platform specific. The JVM selects the default encoding (e.g. String.toBytes() can return quite different results for the same string, depending on the platform - thats why its recommended to explictly specify the desired encoding). Same goes for many other things (time zone, number format etc.).
CharSets and Formats are building blocks the program wires up to translate data from the outside world (file, http or user input) into javas representation of data (or vice versa). For example, a Web application will use the encoding from a HTTP header to determine what CharSet to use when interpreting the contents (the HTTP headers encoding is defined to be US-ASCII by the spec).

Query regarding Google Protocol Buffer

message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
}
The above is a snippet from addrbook.proto file mentioned in Google Protocol Buffer tutorials.
The requirement is that, the application that is being developed will need to decode binary data received from a socket.For example,name, id and e-mail represented as binary data.
Now, id can be read as an integer. But I am really not sure how to read name and email considering that these two can be of variable lengths. (And unfortunately, I do not get lengths prefixed before these two fields)
Application is expected to read these kind of data from a variety of sources. Our objective is to make a decoder/adapter for the different types of data originating from these sources. Then there could be different types of messages from same source too.
Thanks in advance

But I am really not sure how to read name and email considering that these two can be of variable lengths.
The entire point of a serializer such as protobuf is that you don't need to worry about that. Specifically, in the case of protobuf strings are always prefixed by their length in bytes (using UTF-8 encoding, and varint encoding of the length).
And unfortunately, I do not get lengths prefixed before these two fields
Then you aren't processing protobuf data. Protobuf is a specific data format, in the same way that xml or json is a data format. Any conversation involving "protocol buffers" only makes sense if you are actually discussing the protobuf format, or data serialized using that format.
Protocol buffers is not an arbitrary data handling API. It will not allow you to process data in any format other than protobuf.

It sounds like you might be trying to re-implement Protocol Buffers by hand; you don't need to do that (though I'm sure it would be fun). Google provides C, Java, and Python implementations to serialize and de-serialize content in protobuf format as part of the Protocol Buffers project.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.