XSD based binary serialisation

XSD based binary serialisation - java

Hello i am looking for an binary serialisation for java that:
- use xsd for schema
- output very small byte streams
- the byte stream should not contain the field names and data types
- pojos should be generated like it is possible gor jaxb
- nice to have: an implantation in java script
Do any some one know a solution for this?

If you're using an XSD schema, the conventional expectation is that you'd be serialising to/from XML. That's not a very small byte stream; plain text is quite inefficient for representing binary data (ints, floats, etc).
However there is an option. XSD schema and ASN.1 schema are interchangeable, more or less. There's even an official translation between the two defined by the ITU. There are tools that translate between the two.
Why is this relevant? Well, with ASN.1 you have access to a variety of different wire formats. There's a bunch of binary ones, as well as text ones (including, yes, XML and JSON). The important thing is that one of the binary ones is uPER (unaligned Packed Encoding Rules), which will use the bare minimum of bits to represent the data being sent.
For example, suppose that you'd got a class with an integer field, and you'd constrained its value to be between 0 and 7. uPER would use only 3 bits for that field.
What you can have is an XSD schema being i) tranlated to ASN.1 and compiled by an ASN.1 compiler (OSS), or ii) compiled directly by an ASN.1 compiler (Obj-Sys), producing Java POJO classes that can be serialised to/from ASN.1's uPER wireformat (and all the other binary formats, and XML and JSON too for that matter, depending on the ASN.1 compiler one's using). It's a similar way of working as with jaxb / xjc.
The tooling I've suggested in the previous paragraph requires, AFAIK, the ASN.1 proprietary compilers and tools from either Objective Systems (obj-sys.com) or OSS Nokalva (www.oss.com), and they're not free (n.b. I've been a customer of both, not otherwise associated with them). I think that there's a free online converter for XSD<-->ASN1 schema, and there are some free ASN1 compilers (though they commonly target C, not Java).
Links: OSS's XSD translator, Objective System's Compiler reference for XSD translation, OSS Java support, Obj-Sys's Java support

Having wittered on about XSD, ASN.1, etc there are other options that might be usable but probably mean dropping the XSD schema and using something else.
Google Protocol Buffers
There are Java (and a load of others) bindings for Google Protocol Buffers, and GBP wireformat is binary. It's not as good as ASN.1's uPER for data size, but certainly smaller than XML text. See here. It has its own schema language, and as far as I know there's no translator between XSD and GPB's.
Capn Proto
Another interesting option (see this project), again a binary format. It won't quite beat uPER for size, but it is fast to serialise / deserialise (or at least it is in C/C++). Again, I know of no translation between its schema language and XSD.

Related

How to use ANTLR4 with binary data?

From the homepage:
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for [...] or binary files.
I have read through the docs now for some hours and think that I have some basic understanding of ANTLR, but I have a hard time to find any references to processing binary files. And I'm not the only one as it seems.
I need to create a parser for some binary data and would like to decide if ANTLR is of any help or not.
Binary data structure
That binary data is structured in logical fields like field1, which is followed by field2, which is followed by field3 etc. and all those fields have a special purpose. The length of all those fields may differ AND may not be known at the time the parser is generated, so e.g. I do know that field1 is e.g. 4 bytes always, field2 might simply be 1 byte and field3 might be 1 to 10 bytes and might be followed by additional field3s with n bytes, depending on the actual value of the data. That is the second problem, I know the fields are there and e.g. with field1 I know it's 4 bytes, but I don't know the actual value, but that is what I'm interested in. Same goes for the other fields, I need the values from all of those.
What I need in ANTLR
This sounds like a common structure and use case for some arbitrary binary data to me, but I don't see any special handling of such data in ANTLR. All examples are using some kind of texts and I don't see some value extraction callbacks or such. Additionally, I think I would need some callbacks influencing the parsing process itself, so for e.g. one callback is called on the first byte of field3, I check that, decide that one to N additional bytes need to be consumed and that those are logically part of field3 and tell the parser that, so it's able to proceed "somehow".
In the end, I would get some higher level "field" objects and ANTLR would provide the underlying parse logic with callbacks and listener infrastructure, walking abilities etc.
Did anyone ever do something like that and can provide some hints to examples or the concrete documentation I seem to have missed? Thanks!
EN 13757-3:2012
I don't think it makes understanding my question really easier, but the binary data I'm referring to is defined in the standard EN 13757-3:2012:
Communication systems for and remote reading of meters - Part
3: Dedicated application layer
The standard is not freely available on the net (anymore?), but the following PDF might provide you an overview of how example data looks like in page 4. Especially that bytes of the mentioned fields are not constant, only the overall structure of the datagram is defined.
http://fastforward.ag/downloads/docu/FAST_EnergyCam-Protocol-wirelessMBUS.pdf
The tokens for the grammar would be the fields, implemented by a different amount of bytes, but with a value etc. Regarding the self-description of ANTLR, I would expected such things to work somehow...
Alternative: Kaitai.io
Whoever is in a comparable position like me currently, have a look at Kaitai.io, which reads very promising:
https://stackoverflow.com/a/40527106/2055163

Difference between SolrJ's ResponseParsers

The SolrJ library offers different parsers for Solr's responses.
Namely:
BinaryResponseParser
StreamingBinaryResponseParser
NoOpResponseParser
XMLResponseParser
Sadly the documentation doesn't say much about them, other than:
SolrJ uses a binary format, rather than XML, as its default format.
Users of earlier Solr releases who wish to continue working with XML
must explicitly set the parser to the XMLResponseParser, like so:
server.setParser(new XMLResponseParser());
So it looks like the XMLResponseParser is there mainly for legacy purposes.
What are the differences between the others parsers?
Can I expect performance improvements by using an other parser over the XMLResponseParser?

The Binary Stream Parsers is meant to work directly with the Java Object Format (the binary POJO format) to make the creation of data objects as smooth as possible on the client side.
The XML parser was designed to work with the old response format where there wasn't any real alternatives (as there was no binary response format in Solr). It's a lot more work to consider all the options for an XML format than use the binary format directly.
The StreamingBinaryResponseParser does the same work as the BinaryResponseParser, but has been designed to make streaming documents (i.e. not creating a list of documents and returning that list, but instead return each document by itself without having to hold them all in memory at the same time) possible. See SOLR-2112 for a description of the feature and why it was added.
Lastly, yes, if you're using SolrJ, use the binary response format, unless you have a very good reason for using the XML based one. If you have to ask the question, you're probably better off with the binary format.

Google ProtoBuf serialization / deserialization

I am reading Google Protocol Buffers. I want to know Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Rather I want to send objects from any language to Java Server. and deserialize it there.
Assume following is my .proto file
message Person {
required int32 id = 1;
required string name = 2;
optional string email = 3;
}
I ran protoc on this and created a C++ object.
Basically Now i want to send the serialized stream to java server.
Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value

Here on java side can I deserialized the stream , so that I can find out there are 3 fields in the stream and its respective name, type, and value
You will need to know the schema in advance. Firstly, protobuf does not transmit names; all it uses as identifiers is the numeric key (1, 2 and 3 in your example) of each field. Secondly, it does not explicitly specify the type; there are only a very few wire-types in protobuf (varint, 32-bit, 64-bit, length-prefix, group); actual data types are mapped onto those, but you cannot unambiguously decode data without the schema
varint is "some form of integer", but could be signed, unsigned or "zigzag" (which allows negative numbers of small magnitude to be cheaply encoded), and could be intended to represent any width of data (64 bit, 32 bit, etc)
32-bit could be an integer, but could be signed or unsigned - or it could be a 32-bit floating-point number
64-bit could be an integer, but could be signed or unsigned - or it could be a 64-bit floating-point number
length-prefix could be a UTF-8 string, a sequence or raw bytes (without any particular meaning), a "packed" set of repeated values of some primitive type (integer, floating point, etc), or could be a structured sub-message in protobuf format
groups - hoorah! this is always unambigous! this can only mean one thing; but that one thing is largely deprecated by google :(
So fundamentally: you need the schema. The encoded data does not include what you want. It does this to avoid unnecessary space - if the protocol assumes that the encoder and decoder both know what the message is meant to look like, then a lot less information needs to be sent.
Note, however, that the information that is included is enough to safely round-trip a message even if there are fields that are not expected; it is not necessary to know the name or type if you only need to re-encode it to pass it along / back.
What you can do is use the parser API to scan over the data to reveal that there are three fields, field 1 is a varint, field 2 is length-prefixed, field 3 is length-prefixed. You could make educated guesses about the data beyond that (for example, you could see whether a UTF-8 decode produces something that looks roughly text-like, and verify that UTF-8 encoding that gives you back the original bytes; if it does, it is possible it is a string)

Can I Serialize C++ object and send it on the wire to Java server and Deserialize there in java and introspect the fields.
Yes, it is the very goal of protobuf.
Serialize data in an application developed in any supported language, and deserialize data in an application developed in any supported language. Serialization and deserialization languages can be the same, or be different.
Keep in mind that protocol buffers are not self describing, so both sides of your application needs to have serializers/deserializers generated from the .proto file.

In short: yes you can.
You will need to create .proto files which define the data structures that you want to share. By using the Google Protocol Buffers compiler you can then generate interfaces and (de)serialization code for your structures for both Java and C++ (and almost any other language you can think of).
To transfer your data over the wire you can use for instance ZeroMQ which is an extremely versatile communications framework which also sports a slew of different language API's, among them Java and C++.
See this question for more details.

Parsing ASN.1 binary data with Java

I have binary ASN.1 data objects I need to parse into my Java project. I just want the ASN.1 structure and data as it is parsed for example by the BER viewer:
The ASN.1 parser of BouncyCastle is not able to parse this structure (only returns application specific binary data type).
What ASN.1 library can I use to get such a result? Does anybody has sample code that demonstrates how to parse an ASN.1 object?
BTW: I also tried several free ASN.1 Java compilers but none is able to generate working Java code given may ASN.1 specification.

I have to correct myself - it is possible to read out the data using ASN.1 parser included in BouncyCastle - however the process is not that simple.
If you only want to print the data contained in an ASN.1 structure I recommend you to use the class org.bouncycastle.asn1.util.ASN1Dump. It can be used by the following simple code snippet:
ASN1InputStream bIn = new ASN1InputStream(new ByteArrayInputStream(data));
ASN1Primitive obj = bIn.readObject();
System.out.println(ASN1Dump.dumpAsString(obj));
It prints the structure but not the data - but by copying the ASN1Dump into an own class and modifying it to print out for example OCTET_STRINGS this can be done easily.
Additionally the code in ASN1Dump demonstrates to parse ASN.1 structures. For the example the data used in my question can be parsed one level deeper using the following code:
DERApplicationSpecific app = (DERApplicationSpecific) obj;
ASN1Sequence seq = (ASN1Sequence) app.getObject(BERTags.SEQUENCE);
Enumeration secEnum = seq.getObjects();
while (secEnum.hasMoreElements()) {
ASN1Primitive seqObj = (ASN1Primitive) secEnum.nextElement();
System.out.println(seqObj);
}

Just use "true" to print values
ASN1InputStream ais = new ASN1InputStream(
new FileInputStream(new File("d:/myfile.cdr")));
while (ais.available() > 0) {
ASN1Primitive obj = ais.readObject();
System.out.println(ASN1Dump.dumpAsString(obj, true));
}
ais.close();

It is not clear from your question whether or not you have the ASN.1 specification for the BER you are trying to parse. Please note that without the ASN.1 specification, you can only make partial sense of the data if EXPLICIT TAGS were used in the ASN.1 specification from which it was generated. Some tools, such as the one from OSS Nokalva have a library (jar file) called JIAAPI which allows you to traverse and manipulate BER encodings without prior knowledge of the ASN.1 specification.
If you do have the ASN.1 specification, any ASN.1 Java compiler should be able to handle this.
You can download a free trial of the OSS ASN.1 Tools for Java from http://www.oss.com/asn1/products/asn1-download.html to see if works better for you than the others you unsuccessfully tried.

I need to be able to parse any kind of ASN.1 data in krypt. Although krypt is a Ruby project, you may want to have a look at the JRuby extension - the code for handling ASN.1 parsing/encoding is written entirely in Java and modular enough for easy extraction.
I also made a Java-only version, but it is missing some of the higher-level functionality of the former. But since it's concise, maybe it's a good opportunity to get you started.

If you just want to decode the BER-encoded data, there are numerous parsers out there. Have you tried any? There are even two in the Sun JDK - com.sun.jmx.snmp.BerDecoder and com.sun.jndi.ldap.BerDecoder.

Developing a (file) exchange format for java

I want to come up with a binary format for passing data between application instances in a form of POFs (Plain Old Files ;)).
Prerequisites:
should be cross-platform
information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
only sequential access is required
should be a way to check data consistency
should be small and fast
should prevent an average user with archiver + notepad from modifying the data
Currently I'm using DeflaterOutputStream + OutputStreamWriter together with InflaterInputStream + InputStreamReader to save/restore objects serialized with XStream, one object per file. Readers/Writers use UTF8.
Now, need to extend this to support the previously described.
My idea of format:
{serialized to XML object}
{delimiter}
{String file name}{delimiter}{byte[] file data}
{delimiter}
{another String file name}{delimiter}{another byte[] file data}
...
{delimiter}
{delimiter}
{MD5 hash for the entire file}
Does this look sane?
What would you use for a delimiter and how would you determine it?
The right way to calculate MD5 in this case?
What would you suggest to read on the subject?
TIA.

It looks INsane.
why invent a new file format?
why try to prevent only stupid users from changing file?
why use a binary format ( hard to compress ) ?
why use a format that cannot be parsed while being received? (receiver has to receive entire file before being able to act on the file. )
XML is already a serialization format that is compressable. So you are serializing a serialized format.

Would serialization of the model (if you are into MVC) not be another way? I'd prefer to use things in the language (or standard libraries) rather then roll my own if possible. The only issue I can see with that is that the file size may be larger than you want.

1) Does this look sane?
It looks fairly sane. However, if you are going to invent your own format rather than just using Java serialization then you should have a good reason. Do you have any good reasons (they do exist in some cases)? One of the standard reasons for using XStream is to make the result human readable, which a binary format immediately loses. Do you have a good reason for a binary format rather than a human readable one? See this question for why human readable is good (and bad).
Wouldn't it be easier just to put everything in a signed jar. There are already standard Java libraries and tools to do this, and you get compression and verification provided.
2) What would you use for a delimiter and how determine it?
Rather than a delimiter I'd explicitly store the length of each block before the block. It's just as easy, and prevents you having to escape the delimiter if it comes up on its own.
3) The right way to calculate MD5 in this case?
There is example code here which looks sensible.
4) What would you suggest to read on the subject?
On the subject of serialization? I'd read about the Java serialization, JSON, and XStream serialization so I understood the pros and cons of each, especially the benefits of human readable files. I'd also look at a classic file format, for example from Microsoft, to understand possible design decisions from back in the days that every byte mattered, and how these have been extended. For example: The WAV file format.

Let's see this should be pretty straightforward.
Prerequisites:
0. should be cross-platform
1. information to be persisted includes a single POJO & arbitrary byte[]s (files actually, the POJO stores it's names in a String[])
2. only sequential access is required
3. should be a way to check data consistency
4. should be small and fast
5. should prevent an average user with archiver + notepad from modifying the data
Well guess what, you pretty much have it already, it's built-in the platform already:Object Serialization
If you need to reduce the amount of data sent in the wire and provide a custom serialization ( for instance you can sent only 1,2,3 for a given object without using the attribute name or nothing similar, and read them in the same sequence, ) you can use this somehow "Hidden feature"
If you really need it in "text plain" you can also encode it, it takes almost the same amount of bytes.
For instance this bean:
import java.io.*;
public class SimpleBean implements Serializable {
private String website = "http://stackoverflow.com";
public String toString() {
return website;
}
}
Could be represented like this:
rO0ABXNyAApTaW1wbGVCZWFuPB4W2ZRCqRICAAFMAAd3ZWJzaXRldAASTGphdmEvbGFuZy9TdHJpbmc7eHB0ABhodHRwOi8vc3RhY2tvdmVyZmxvdy5jb20=
See this answer
Additionally, if you need a sounded protocol you can also check to Protobuf, Google's internal exchange format.

You could use a zip (rar / 7z / tar.gz / ...) library. Many exists, most are well tested and it'll likely save you some time.
Possibly not as much fun though.

I agree in that it doesn't really sound like you need a new format, or a binary one.
If you truly want a binary format, why not consider one of these first:
Binary XML (fast infoset, Bnux)
Hessian
google packet buffers
But besides that, many textual formats should work just fine (or perhaps better) too; easier to debug, extensive tool support, compresses to about same size as binary (binary compresses poorly, and information theory suggests that for same effective information, same compression rate is achieved -- and this has been true in my testing).
So perhaps also consider:
Json works well; binary support via base64 (with, say, http://jackson.codehaus.org/)
XML not too bad either; efficient streaming parsers, some with base64 support (http://woodstox.codehaus.org/, "typed access API" under 'org.codehaus.stax2.typed.TypedXMLStreamReader').
So it kind of sounds like you just want to build something of your own. Nothing wrong with that, as a hobby, but if so you need to consider it as such.
It likely is not a requirement for the system you are building.

Perhaps you could explain how this is better than using an existing file format such as JAR.
Most standard files formats of this type just use CRC as its faster to calculate. MD5 is more appropriate if you want to prevent deliberate modification.

Bencode could be the way to go.
Here's an excellent implementation by Daniel Spiewak.
Unfortunately, bencode spec doesn't support utf8 which is a showstopper for me.
Might come to this later but currently xml seems like a better choice (with blobs serialized as a Map).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.