I'm working on a proprietary TCP protocol. This protocol sends and receive messages with a specific sequence of bytes.
I should be complaiant to this protocol, and i cant change it.
So my input / output results are something like that :
\x01\x08\x00\x01\x00\x00\x01\xFF
\x01 - Message type
\x01 - Message type
\x00\x01 - Length
\x00\x00\x01 - Transaction
\xFF - Body
The sequence of field is important. And i want only the values of the fields in my serialization, and nothing about the structure of the class.
I'm working on a Java controller that use this protocol and I've thought to define the message structures in specific classes and serialize/deserialize them, but I was naive.
First of all I tried ObjectOutputStream, but it output the entire structure of the object, when I need only the values in a specific order.
Someone already faced this problem:
Java - Object to Fixed Byte Array
and solved it with a dedicated Marshaller.
But I was searching for a more flexible solution.
For text serialization and deserialization I've found:
http://jeyben.github.io/fixedformat4j/
that with annotation defines the schema of the line. But it outputs a String, not a byte[]. So 1 is output like "1" that is represented differently based on encoding, and often with more bytes.
What I was searching for is something that given the order of my class properties will convert each property in a bunch of bytes (based on the internal representation) and append them to a byte[].
Do you know some library used for that purpose?
Or a simple way to do that, without coding a serialization algorithm for each of my entities?
Serialization just isn't easy; it sounds from your question like you feel you can just invoke something and out rolls compact, simple, versionable, universal data you can then put on the wire. What you need to fix is to scratch the word 'just' from that sentence. You're going to have to invest some time and care.
As you figured out already, java's baked in serialization has a ton of downsides. Don't use that.
There are various serializers. The popular ones are things like GSON or Jackson, which lets you serialize java objects into JSON. This isn't particularly efficient, and is string based. This sounds like crucial downsides but they really aren't, see below.
You can also spend a little more time specifying the exact format and use protobuf which lets you write a quite lean and simple data protocol (and protobuf is available for many languages, if eventually you want to write an participant in this protocol in non-java later).
So, those are the good options: Go to JSON via Jackson or GSON, or, use protobuf.
But JSON is a string.
You can turn a string to bytes trivially using str.getBytes(StandardCharsets.UTF_8). This cannot fail due to charset encoding differences (as long as you also 'decode' in the same fashion: Turn the bytes into a string with new String(theBytes, StandardCharsets.UTF_8). UTF-8 is guaranteed to be available on all JVMs; if it is not there, your JVM is as broken as a JVM that is missing the String class - not something to worry about.
But JSON is inefficient.
Zip it up, of course. You can trivially wrap an InputStream and an OutputStream so that gzip compression is applied which is simple, available on just about every platform, and fast (it's not the most efficient cutting edge compression algorithm, but usually squeezing the last few bytes out is not worth it) - and zipped-up JSON can often be more efficient that carefully handrolled protobuf, even.
The one downside is that it's 'slow', but on modern hardware, note that the overhead of encrypting and decrypting this data (which you should obviously be doing!!) is usually multiple orders of magnitude more involved. A modern CPU is simply very, very fast - creating JSON and zipping it up is going to take 1% of CPU or less even if you are shipping the collected works of shakespeare every second.
If an arduino running on batteries needs to process this data, go with uncompressed, unencrypted protobuf-based data. If you are facebook and writing the whatsapp protocol, the IAAS creds saved by not having to unzip and decode JSON is tiny and pales in comparison to the creds you spend just running the servers, but at that scale its worth the development effort.
In just about every other case, just toss gzipped JSON on the line.
Related
Scenario is like i need to print very large json data set. This json data is consumed by mobile application's. In Java service application, used as below to print the json.
response.getWriter().println(mainjson);
getWriter taking too much time to print all the data.
I heard about getOutputStream also. Which is faster in case of large json data?
Any help will be appreciated :-)
It depends on how you retrieve the data and whether your JSON serializer has a streaming api available.
At the moment you are probably operating in three seperate steps
Retrieving all your data
Serializing it to JSON string
Writing the JSON response.
If you are spending a substantial amount of time on the retrieval and serialization part by itself then you can potentially speed things up by using streams. However, this requires your data retrieval and json serializer to support streams.
When using streams, instead of doing everything in sequential steps you basically setup a pipeline that allows you to start writing the response a bit earlier. This is not guaranteed to be faster though, it depends on where your particular bottleneck occurs. If its almost all an issue with the IO to the client then you are not going to see a substantial difference.
Also
Something else to look into is check if your are compressing your response to the user. Gzip can have a substantial impact on the size of text data and may reduce your payload sufficiently to make this a non issue.
I know deserialization can be vulnerable when an object is serialized with the standard "Serializable" interface (refer to this). But is this vulnerability applied when an object is serialized to XML or JSON? And if it is, how does that happen?
I can't really see how that could happen, so I would appreciate some examples.
Thanks in advance.
That quite specifically depends on the serialization library that you use to deserialize objects and often the parameters used, so it's hard to provide a single answer.
As to "is it possible", yes, it's possible. Here's a sample exploit for XStream, for example:
http://blog.diniscruz.com/2013/12/xstream-remote-code-execution-exploit.html
A general chat around the topic follows:
A good defence against bad data is to use a serialisation technology that allows one to write a full specification. By full specification, I mean not only the structure / content of objects, but that every value field's valid range can be specified, and every list/array length specified.
There's not many that do this. ASN.1, XSD (XML), and AFAIK JSON schemas can all have value and size constraints. Interestingly there is a formally defined translation between ASN.1 and XSD schemas.
It's then down to whether or not the tools you use actually do anything with these. Most of the ASN.1 tools I've seen do this very well, and will also tell you if you're trying to serialise an object that doesn't conform to the schema. The idea is that bad data is rejected as its read (so you never get an invalid object in memory) and you can never accidentally send / write bad data yourself, even if you wanted to.
Some XSD tools do constraints checking. I think xsd2code++ does. AFAIK xsd.exe from Microsoft does not.
I'm not so familiar with the land of JSON, but as far as I can tell one tends to read in whole objects and then compare them to the schema (which strikes me as being "too late"), rather than have some autogenerated code read the data and check it for you as it does so. When serialising objects it's up to the programmer to compare the result to the schema.
In contrast, technologies like Google Protocol Buffers don't let you do constraints checking at all. With GPB the best you can do is comment the .proto file and hope developers read it.
The code first approach, directly writing serialisable classes in C# / Java can do constraints checks, but only if you write the code yourself.
Useful Old Technology
Of all the serialisations I've ever used, by far the most rigorous has been ASN.1 (using decent ASN.1 tools). It's old and very telecommunications-ish (late 1980s, from the ITU; if you have trouble sleeping, go read one of their standards). However, despite its age it's still bang up to date, continually evolving.
For example, since it's original days it has grown several surprisingly modern wire formats; XML and JSON. Yes that's right; you can have an ASN.1 schema that gets compiled to code (C++, Java, C#) that will serialise to XML or JSON data formats (as well as ASN.1's more traditional binary formats like BER, uPER, etc).
The constraints rigour and the data format flexibility is surprisingly useful; you can receive some ultra-compact bit encoded uPER message from a radio, have it constraints checked as you read it, and then pass it on elsewhere as JSON/XML, all without having to write any code by hand.
When it comes to complex systems integration problems, I've not found anything to beat it.
Useful old technology
I am just getting in to writing networked code using Sockets in Java. I'm just making some test programs. Originally I was going to send data as comma separated values, but I recently discovered ObjectOutputStream. Which method would be faster or more bandwidth efficient? For example, if I'm making a game where I have to send x and y coordinates very often, should I send it through PrintWriter separated by a comma, or make a Position class and send an instance over ObjectOutputStream. What if I change my code and need to send a lot more data?
What are the pros and cons of sending data as CSV over PrintWriter vs as fields in an object over ObjectOutputStream?
An ad-hoc binary format has a good chance of being more bandwidth-efficient than the default serialization format, which should be (but it's a wild guess, and it depends on the nature and amount of data: you should measure it if it matters) more or less as bandwidth efficient than a text-based format.
But bandwidth efficiency is not the only thing that matters.
Using serialization, the client and the server must be written in Java, and have the classes of the serialized objects in their classpath. If you intend to have clients written in any language, you shouldn't consider it.
If serialization is OK, it's of course a really easy way to transform almost any Java object into bytes, which allows you to avoid defining a format.
Note that there are alternatives that provide almost the same flexibility, but don't have the Java-only disadvantage of serialization. For example, JSON, XML, or protobuf.
I think CSV is smaller.
If you want to check data size,please try to output to a File.
and I don't recommend ObjectOutputStream to you by other reason.
Because you have to keep Objects compatibility.
Did you research about serialize and serialVersionUID?
Please check java.io.Serializable
I have an JAX-RS web service that calls a db2 z/os database and returns about 240mb of data in a resultset. I am then creating an OutputStream to send this data to the client by looping through the resultset and adding a few XML tags for my output.
I am confused about what to use PrintWriter, BufferedWriter or OutputStreamWriter. I am looking for the fastest way to deliver the data. I also don't want the JVM to hold onto this data any longer than it needs to, so I don't use up it's memory.
Any help is appreciated.
You should use
BufferedWriter
Call .flush() frequently
Enable gzip for best compression
Start thinking about a different way of doing this. Can your data be paginated? Do you need all the data in one request.
If you are sending a large binary data, you probably don't want to use xml. When xml is used, binary data is usually represented using base64 which becomes larger than the original binary and uses quite a lot of CPU for the conversion into base64.
If I were you, I'd send the binary separate from the xml. If you are using WebService, MTOM attachment could help. Otherwise you could send the reference to the binary data in the xml, and let the app. download the binary data separately.
As for the fastest way to send binary, if you are using weblogic, just writing on the response's outputstram would be ok. That output stream is most probably buffered and whatever you do probably won't change the performance anyways.
Turning on gzip could also help depending on what you are sending (e.g. if you are sending jpeg (stuff that is already compressed) or something, it won't help a lot but if you are sending raw text then it can help a lot, etc.).
One solution (which might not work for you) is to spawn a job / thread that creates a file and then notifies the user when the file is ready to download, in this way you're not tied to the bandwidth of the client connection (and you can even compress the file properly, before the client downloads it)
Some Business Intelligence and data crunching applications do this, specially if the process takes some time to generate the data.
The output max speed will me limited by network bandwith and i am shure any Java OutputStream will be much more faster than you will notice the difference.
The choice depends on the data to send: is that text (lines) PrintWriter is easy, is that a byte array take OutputStream.
To hold not too much data in the buffers you should call flush() any x kb maybe.
You should never use PrintWriter to output data over a network. First of all, it creates platform-dependent line breaks. Second, it silently catches all I/O exceptions, which makes it hard for you to deal with those exceptions.
And if you're sending 240 MB as XML, then you're definitely doing something wrong. Before you start worrying about which stream class to use, try to reduce the amount of data.
EDIT:
The advice about PrintWriter (and PrintStream) came from a book by Elliotte Rusty Harold. I can't remember which one, but it was a few years ago. I think that ServletResponse.getWriter() was added to the API after that book was written - so it looks like Sun didn't follow Rusty's advice. I still think it was good advice - for the reasons stated above, and because it can tempt implementation authors to violate the API contract
in order to get predictable behavior.
I have a large data-structure which i'm serializing.At certain times i need to edit the values in the data-structure.But just for changing a small value i'll have to re-serialize it again instead of updating the changed value in file.I've heard of Google protocol buffer's.Will using it solve my problem of rewriting the file ? Is it a better option for me to use protocol buffer instead of Java serialization ?
Protocol buffers are themselves a serialization format, so they won't fundamentally change the picture (you'll still need to re-serialize after you change a value).
Google's docs claim that protocol buffers are more compact and faster to parse than XML (which seems plausible); don't know how they compare to native Java serialization.
Advantages of protocol buffers might be portability (if programs written in other languages need to read the file) and upgradability (you can add new fields to the data structure without breaking the file format).
A couple of points
There is an editor for Protocol Buffers binary format (http://code.google.com/p/protobufeditor/)
Protocol buffers has a text format that looks like:
# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
name: "John Doe"
email: "jdoe#example.com"
}
See:
Discussion: http://groups.google.com/group/protobuf/browse_thread/thread/04fc478088137bf3
Class: http://code.google.com/apis/protocolbuffers/docs/reference/java/com/google/protobuf/TextForm
Having said that, I would use a technology (JSon, Xml etc) that is already in use unless one of the following applies
You need the performance of protocol buffers
You already / plan to use protocol buffers
If you care about performance, don't use a text format for your data. If you want to modify the data without deserializing, you'll want to use a fixed record data format. You'll probably have to invent this manually. Then seek to the correct position in the file and rewrite just the changed field. You might look at DataOutputStream to get started or instead use a database such as HSQLDB to store and edit your data.
Thinking about this more, Unless your objects are very simple, I think a database would be a better way to go.
More info on DataOutputStream:
http://download.oracle.com/javase/tutorial/essential/io/datastreams.html
Java Databases:
http://java-source.net/open-source/database-engines
You need a serialization format that can directly be modified for example XML or JSON. Google protocol buffer is a binary format -- as the java serialization -- and thus can not be modifier directly...