For some caching I'm thinking of doing for an upcoming project, I've been thinking about Java serialization. Namely, should it be used?
Now I've previously written custom serialization and deserialization (Externalizable) for various reasons in years past. These days interoperability has become even more of an issue and I can foresee a need to interact with .Net applications so I've thought of using a platform-independant solution.
Has anyone had any experience with high-performance use of GPB? How does it compare in terms of speed and efficiency with Java's native serialization? Alternatively, are there any other schemes worth considering?
I haven't compared Protocol Buffers with Java's native serialization in terms of speed, but for interoperability Java's native serialization is a serious no-no. It's also not going to be as efficient in terms of space as Protocol Buffers in most cases. Of course, it's somewhat more flexible in terms of what it can store, and in terms of references etc. Protocol Buffers is very good at what it's intended for, and when it fits your need it's great - but there are obvious restrictions due to interoperability (and other things).
I've recently posted a Protocol Buffers benchmarking framework in Java and .NET. The Java version is in the main Google project (in the benchmarks directory), the .NET version is in my C# port project. If you want to compare PB speed with Java serialization speed you could write similar classes and benchmark them. If you're interested in interop though, I really wouldn't give native Java serialization (or .NET native binary serialization) a second thought.
There are other options for interoperable serialization besides Protocol Buffers though - Thrift, JSON and YAML spring to mind, and there are doubtless others.
EDIT: Okay, with interop not being so important, it's worth trying to list the different qualities you want out of a serialization framework. One thing you should think about is versioning - this is another thing that PB is designed to handle well, both backwards and forwards (so new software can read old data and vice versa) - when you stick to the suggested rules, of course :)
Having tried to be cautious about the Java performance vs native serialization, I really wouldn't be surprised to find that PB was faster anyway. If you have the chance, use the server vm - my recent benchmarks showed the server VM to be over twice as fast at serializing and deserializing the sample data. I think the PB code suits the server VM's JIT very nicely :)
Just as sample performance figures, serializing and deserializing two messages (one 228 bytes, one 84750 bytes) I got these results on my laptop using the server VM:
Benchmarking benchmarks.GoogleSize$SizeMessage1 with file google_message1.dat
Serialize to byte string: 2581851 iterations in 30.16s; 18.613789MB/s
Serialize to byte array: 2583547 iterations in 29.842s; 18.824497MB/s
Serialize to memory stream: 2210320 iterations in 30.125s; 15.953759MB/s
Deserialize from byte string: 3356517 iterations in 30.088s; 24.256632MB/s
Deserialize from byte array: 3356517 iterations in 29.958s; 24.361889MB/s
Deserialize from memory stream: 2618821 iterations in 29.821s; 19.094952MB/s
Benchmarking benchmarks.GoogleSpeed$SpeedMessage1 with file google_message1.dat
Serialize to byte string: 17068518 iterations in 29.978s; 123.802124MB/s
Serialize to byte array: 17520066 iterations in 30.043s; 126.802376MB/s
Serialize to memory stream: 7736665 iterations in 30.076s; 55.93307MB/s
Deserialize from byte string: 16123669 iterations in 30.073s; 116.57947MB/s
Deserialize from byte array: 16082453 iterations in 30.109s; 116.14243MB/s
Deserialize from memory stream: 7496968 iterations in 30.03s; 54.283176MB/s
Benchmarking benchmarks.GoogleSize$SizeMessage2 with file google_message2.dat
Serialize to byte string: 6266 iterations in 30.034s; 16.826494MB/s
Serialize to byte array: 6246 iterations in 30.027s; 16.776697MB/s
Serialize to memory stream: 6042 iterations in 29.916s; 16.288969MB/s
Deserialize from byte string: 4675 iterations in 29.819s; 12.644595MB/s
Deserialize from byte array: 4694 iterations in 30.093s; 12.580387MB/s
Deserialize from memory stream: 4544 iterations in 29.579s; 12.389998MB/s
Benchmarking benchmarks.GoogleSpeed$SpeedMessage2 with file google_message2.dat
Serialize to byte string: 39562 iterations in 30.055s; 106.16416MB/s
Serialize to byte array: 39715 iterations in 30.178s; 106.14035MB/s
Serialize to memory stream: 34161 iterations in 30.032s; 91.74085MB/s
Deserialize from byte string: 36934 iterations in 29.794s; 99.98019MB/s
Deserialize from byte array: 37191 iterations in 29.915s; 100.26867MB/s
Deserialize from memory stream: 36237 iterations in 29.846s; 97.92251MB/s
The "speed" vs "size" is whether the generated code is optimised for speed or code size. (The serialized data is the same in both cases. The "size" version is provided for the case where you've got a lot of messages defined and don't want to take a lot of memory for the code.)
As you can see, for the smaller message it can be very fast - over 500 small messages serialized or deserialized per millisecond. Even with the 87K message it's taking less than a millisecond per message.
One more data point: this project:
http://code.google.com/p/thrift-protobuf-compare/
gives some idea of expected performance for small objects, including Java serialization on PB.
Results vary a lot depending on your platform, but there are some general trends.
You might also have a look at FST, a drop-in replacement for built-in JDK serialization that should be faster and have smaller output.
raw estimations on the frequent benchmarking i have done in recent years:
100% = binary/struct based approaches (e.g. SBE, fst-structs)
inconvenient
postprocessing (build up "real" obejcts at receiver side) may eat up performance advantages and is never included in benchmarks
~10%-35% protobuf & derivates
~10%-30% fast serializers such as FST and KRYO
convenient, deserialized objects can be used most often directly without additional manual translation code.
can be pimped for performance (annotations, class registering)
preserve links in object graph (no object serialized twice)
can handle cyclic structures
generic solution, FST is fully compatible to JDK serialization
~2%-15% JDK serialization
~1%-15% fast JSon (e.g. Jackson)
cannot handle any object graph but only a small subset of java data structures
no ref restoring
0.001-1% full graph JSon/XML (e.g. JSON.io)
These numbers are meant to give a very rough order-of-magnitude impression.
Note that performance depends A LOT on the data structures being serialized/benchmarked. So single simple class benchmarks are mostly useless (but popular: e.g. ignoring unicode, no collections, ..).
see also
http://java-is-the-new-c.blogspot.de/2014/12/a-persistent-keyvalue-server-in-40.html
http://java-is-the-new-c.blogspot.de/2013/10/still-using-externalizable-to-get.html
What do you means by high performance? If you want milli-second serialization, I suggest you use the serialization approach which is simplest. If you want sub milli-second you are likely to need a binary format. If you want much below 10 micro-seconds you are likely to need a custom serialization.
I haven't seen many benchmarks for serialization/deserialization but few support less that 200 micro-seconds for serialization/deserialization.
Platform independent formats come at a cost (in effort on your part and latency) you may have to decide whether you want performance or platform independence. However, there is no reason you cannot have both as a configuration option which you switch between as required.
If you are confusing between PB & native java serialization on speed and efficiency, just go for PB.
PB was designed to achieve such factors. See http://code.google.com/apis/protocolbuffers/docs/overview.html
PB data is very small while java serialization tends to replicate a whole object, including its signature. Why I always get my class name, field name... serialized, even though I know it inside out at receiver?
Think about across language development. It's getting hard if one side uses Java, one side uses C++...
Some developers suggest Thrift, but I would use Google PB because "I believe in google" :-).. Anyway, it's worth for a look:
http://stuartsierra.com/2008/07/10/thrift-vs-protocol-buffers
Here is the off the wall suggestion of the day :-) (you just tweaked something in my head that I now want to try)...
If you can go for the whole caching solution via this it might work: Project Darkstar. It is designed as very high performance game server, specifically so that reads are fast (so good for a cache). It has Java and C APIs so I believe (thought it has been a long time since I looked at it, and I wasn't thinking of this then) that you could save objects with Java and read them back in C and vice versa.
If nothing else it'll give you something to read up on today :-)
For wire-friendly serialisation, consider using the Externalizable interface. Used cleverly, you'll have intimate knowlege to decide how to optimally marshall and unmarshall specific fields. That said, you'll need to manage the versioning of each object correctly - easy to un-marshall, but re-marshalling a V2 object when your code supports V1 will either break, lose information, or worse corrupt data in a way your apps aren't able to correctly process. If you're looking for an optimal path, beware no library will solve your problem without some compromises. Generally libraries will fit most use-cases and will come with the added benefit that they'll adapt and enhance over time without your input, if you've opted for an active open source project. And they might add performance problems, introduce bugs, and even fix bugs that haven't affected you yet!
Related
I need to reconstruct an object on the client side from a byte[] which stores bytes coming from an InputStream(TCP/IP). The Server is in C and structures are sent across as bytes. It is from these series of bytes that I have to reconstruct the object.
I can do this by reading chunks of bytes and converting them to variables of the object I want to reconstruct, but this method is tedious and I was wondering if there is an easy way out?
But this method is tedious and I was wondering if there is an easy way out?
Not that I'm aware of. But if you find yourself writing the same code multiple times, you may well find that if you extract some helper methods it actually becomes pretty simple. Yes, you'll need to call a method to read each field value... but the code should end up being easy to read and understand, and not rely on anything magical.
You could do all of this with reflection, possibly using annotations to specify the order in which fields have been serialized etc. But that's likely to be a lot of code to write - unless you've got a lot of different types to deserialize, it will probably be more code - and more complicated code - than the "dumb but straightforward" approach.
I hope the format of the bytes from the C side of things is well-specified though: if it's basically just dumping the in-memory representation, that can end up being pretty fragile in the face of change.
Take a look at JNA. You'll have to dig around a bit. JNA is designed to map C shared libraries (.DLL, .so, etc.) into Java. But it has various helper classes and methods that can be used to map a C structure in memory to a Java object of similar structure. I am almost 100% certain you could read these structures off the wire, write the bytes into a ByteBuffer (direct or otherwise), and then map a Java object over them.
I am doing a work that i need to measure the time to write and read with Object Streams and with Text Streams. I was expecting that the Object Streams was faster than Text Streams but , my results was exactly the opposite situation for both situations(read and write).
Can someone tell me which is normally faster?
Thanks.
Why did you think that Object streams would be faster? They have high overhead. Many people prefer other serialization mechanisms.
Object streams carry quite a bit of overhead since they need to serialize and deserialize class information. They can be reasonably efficient for large object graphs and arrays where the number of unique classes is small, but are notoriously bad for small messages. Object serialisation also has to do quite a bit of bookkeeping (e.g. to detect cycles in object graphs and ensure each object sent only once when there are multiple references to it)
Text streams on the other hand are very simple and carry little overhead. It's not surprising that they are faster in your tests.
Though it does depend a lot on how you encode your data into text: some naive text representations of object graphs would actually be much worse than regular Java object serialisation. Basically, it would be a bad idea to try and reinvent Java object serialisation in text form.....
If you are interested in fast and efficient serialisation of objects, you should also consider:
Advanced objects serialization libraries like Goggle's Protocol Buffers or Kryo
Efficient textual data representation formats like JSON or Clojure s-expressions (both of which have good library support and are proven in the field)
I'm looking for some info on the best approach serialize a graph of object based on the following (Java):
Two objects of the same class must be binary equal (bit by bit) compared to true if their state is equal. (Must not depend on JVM field ordering).
Collections are only modeled with arrays (nothing Collections).
All instances are immutable
Serialization format should be in byte[] format instead of text based.
I am in control of all the classes in the graph.
I don't want to put an empty constructor in the classes just to support serialization.
I have looked at implementing a solution based my own traversal an on Objenisis but my problem does not seem that unique. Better checking for any existing/complete solution first.
Updated details:
First, thanks for your help!
Objects must serialize to exactly the same bit order based on the objects state. This is important since the binary content will be digitally signed. Reconstruction of the serialized format will be based on the state of the object and not that the original bits are stored.
Interoperability between different technologies is important. I do see the software running on ex. .Net in the future. No Java flavour in the serialized format.
Note on comments of immutability: The values of the arrays are copied from the argument to the inner fields in the constructor. Less important.
Best regards,
Niclas Lindberg
You could write the data yourself, using reflections or hand coded methods. I use methods which are look hand code, except they are generated. (The performance of hand coded, and the convience of not having to rewrite the code when it changes)
Often developers talk about the builtin java serialization, but you can have a custom serialization to do whatever you want, any way you want.
To give you are more detailed answer, it would depend on what you want to do exactly.
BTW: You can serialize your data into byte[] and still make it human readable/text like/editable in a text editor. All you have to do is use a binary format which looks like text. ;)
Maybe you want to familiarize yourself with the serialization frameworks available for Java. A good starting point for that is the thift-protobuf-compare project, whose name is misleading: It compares the performance of more than 10 ways of serializing data using Java.
It seems that the hardest constraint you have is Interoperability between different technologies. I know that Googles Protobuffers and Thrift deliver here. Avro might also fit.
The important thing to know about serialization is that it is not guaranteed to be consistent across multiple versions of Java. It's not meant as a way to store data on a disk or anywhere permanent.
It's used internally to send classes from one JVM to another during RMI or some other network protocol. These are the types of applications that you should use Serialization for. If this describes your problem - short term communication between two different JVM's - then you should try to get Serialization going.
If you're looking for a way to store the data more permanently or you will need the data to survive in forward versions of Java, then you should find your own solution. Given your requirements, you should create some sort of method of converting each object into a byte stream yourself and reading it back into objects. You will then be responsible for making sure the format is forward compatible with future objects and features.
I highly recommend Chapter 11 of Effective Java by Joshua Bloch.
Is the Externalizable interface what you're looking for ? You fully control the way your objects are persisted and you do that the OO-style, with methods that are inherited and all (unlike the private read-/write-Object methods used with Serializable). But still, you cannot get rid of the no-arg accessible constructor requirement.
The only way you would get this is:
A/ USE UTF8 text, I.E. XML or JSON, binary turned to base64(http/xml safe variety).
B/ Enforce UTF8 binary ordering of all data.
C/ Pack the contents except all unescaped white space.
D/ Hash the content and provide that hash in a positionally standard location in the file.
Anybody knows a faster way to do what java.nio.charset.Charset.decode(..)/encode(..) does?
It's currently one of the bottleneck of a technology that I'm using.
[EDIT]
Specifically, in my application, I changed one segment from a java-solution to a JNI-solution (because there was a C++ technology that was most suitable for my needs than the Java technology that I was using).
This change brought about some significant decrease in speed (and significant increase in cpu & mem usage).
Looking deeper into the JNI-solution that I used, the java application is communicating with the C++ application via byte[]. These byte[] are produced by Charset.encode(..) from the java side and passed to the C++ side. Then when the C++ response with a byte[], it gets decoded in the java side via Charset.decode(..).
Running this against a profiler, I see that Charset.decode(..) and Charset.encode(..) both took a significantly long time compared to the whole execution time of the JNI-solution (I profiled only the JNI-solution because it's something I could whip up quite quickly. I'll profile the whole application on a latter date once I free up my schedule :-) ).
Upon reading further regarding my problem, it's seems that it's a known problem with Charset.encode(..) and decode(..) and it's being addressed in Java7. However, moving to Java7 is not an option for me (for now) due to some constraints.
Which is why I ask here if somebody knows a Java5 solution / alternative to this (Sorry, should have mentioned that this was for Java5 sooner) ? :-)
The javadoc for encode() and decode() make it clear that these are convenience methods. For example, for encode():
Convenience method that encodes
Unicode characters into bytes in this
charset.
An invocation of this method upon a
charset cs returns the same result as
the expression
cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE)
.encode(bb);
except that it is potentially more
efficient because it can cache
encoders between successive
invocations.
The language is a bit vague there, but you might get a performance boost by not using these convenience methods. Create and configure the encoder once, and then re-use it:
CharsetEncoder encoder = cs.newEncoder()
.onMalformedInput(CodingErrorAction.REPLACE)
.onUnmappableCharacter(CodingErrorAction.REPLACE);
encoder.encode(...);
encoder.encode(...);
encoder.encode(...);
encoder.encode(...);
It always pays to read the javadoc, even if you think you already know the answer.
First part - it is bad idea in general to pass arrays into JNI code. Because of GC, Java has to copy arrays. In the worth case array will be copied two times - on the way to JNI code and on the way back :)
Because of that Buffer class hierarchy was introduced. And of course Java dev team creates a nice way to encode/decode chars:
Charser#newDecoder returns you CharsetDecoder, which could be used to comvert ByteBuffer to CharBuffer according to a Charset. There are two main method versions:
CoderResult decode(ByteBuffer in, CharBuffer out, boolean endOfInput)
CharBuffer decode(ByteBuffer in)
For the max performance you need the first one. It has no hidden memory allocations inside.
You need to note that Encoder/Decoder could maintance internal state, so be careful (for example if you map from 2byte encoding and input buffer has one half of char...). Also encoder/decoder are not threadsafe
There are very few reasons to "squeeze" a string in a byte array.
I would recommend to write the C functions to take utf-16 strings as parameters.
This way there is no need for any conversion.
I've been lately trying to learn more and generally test Java's serialization for both work and personal projects and I must say that the more I know about it, the less I like it. This may be caused by misinformation though so that's why I'm asking these two things from you all:
1: On byte level, how does serialization know how to match serialized values with some class?
One of my problems right here is that I made a small test with ArrayList containing values "one", "two", "three". After serialization the byte array took 78 bytes which seems awfully lot for such low amount of information(19+3+3+4 bytes). Granted there's bound to be some overhead but this leads to my second question:
2: Can serialization be considered a good method for persisting objects at all? Now obviously if I'd use some homemade XML format the persistence data would be something like this
<object>
<class="java.util.ArrayList">
<!-- Object array inside Arraylist is called elementData -->
<field name="elementData">
<value>One</value>
<value>Two</value>
<value>Three</value>
</field>
</object>
which, like XML in general, is a bit bloated and takes 138 bytes(without whitespaces, that is). The same in JSON could be
{
"java.util.ArrayList": {
"elementData": [
"one",
"two",
"three"
]
}
}
which is 75 bytes so already slightly smaller than Java's serialization. With these text-based formats it's of course obvious that there has to be a way to represent your basic data as text, numbers or any combination of both.
So to recap, how does serialization work on byte/bit level, when it should be used and when it shouldn't be used and what are real benefits of serialization besides that it comes standard in Java?
I would personally try to avoid Java's "built-in" serialization:
It's not portable to other platforms
It's not hugely efficient
It's fragile - getting it to cope with multiple versions of a class is somewhat tricky. Even changing compilers can break serialization unless you're careful.
For details of what the actual bytes mean, see the Java Object Serialization Specification.
There are various alternatives, such as:
XML and JSON, as you've shown (various XML flavours, of course)
YAML
Facebook's Thrift (RPC as well as serialization)
Google Protocol Buffers
Hessian (web services as well as serialization)
Apache Avro
Your own custom format
(Disclaimer: I work for Google, and I'm doing a port of Protocol Buffers to C# as my 20% project, so clearly I think that's a good bit of technology :)
Cross-platform formats are almost always more restrictive than platform-specific formats for obvious reasons - Protocol Buffers has a pretty limited set of native types, for example - but the interoperability can be incredibly useful. You also need to consider the impact of versioning, with backward and forward compatibility, etc. The text formats are generally hand-editable, but tend to be less efficient in both space and time.
Basically, you need to look at your requirements carefully.
The main advantage of serialization is that it is extremely easy to use, relatively fast, and preserves actual Java object meshes.
But you have to realize that it's not really meant to be used for storing data, but mainly as a way for different JVM instances to communicate over a network using the RMI protocol.
see the Java Object Serialization Stream Protocol for a description of the file format an grammar used for serialized objects.
Personally I think the built-in serialization is acceptable to persist short-lived data (e.g. store the state of a session object between to http-requests) which is not relevant outside your application.
For data that has a longer live-time or should be used outside your application, I'd persist either into a database or at least use a more commonly used format...
How does Java's built-in serialization works?
Whenever we want to serialize an object, we implement java.io.Serializable interface. The interface which does not have any methods to implement, even though we are implementing it to indicate something to compiler or JVM (known as Marker Interface). So if JVM sees a Class is Serializable it perform some pre-processing operation on those classes. The operation is, it adds the following two sample methods.
private void writeObject(java.io.ObjectOutputStream stream)
throws IOException {
stream.writeObject(name); // object property
stream.writeObject(address); // object property
}
private void readObject(java.io.ObjectInputStream stream)
throws IOException, ClassNotFoundException {
name = (String) stream.readObject(); // object property
address = (String) stream.readObject();// object property
}
When it should be used instead of some other persistence technique?
The built in Serialization is useful when sender and receiver both are Java. If you want to avoid the above kind of problems, we use XML or JSON with the help of frameworks.
I bumped into this dilemma about a month ago (see the question I asked).
The main lesson I learned from it is use Java serialization only when necessary and if there's no other option. Like Jon said, it has it's downfalls, while other serialization techniques are much easier, faster and more portable.
Serializing means that you put your structured data in your classes into a flat order of bytecode to save it.
You should generally use other techniques than the buildin java-method, it is just made to work out of the box but if you have some changing contents or changing orders in future in your serialized classes, you get into trouble because you'll cannot load them correctly.
The advantage of Java Object Serialization (JOS) is that it just works. There are also tools out there that do the same as JOS, but use an XML format instead of a binary format.
About the length: JOS writes some class information at the start, instead of as part of each instance - e.g. the full field names are recorded once, and an index into that list of names is used for instances of the class. This makes the output longer if you write only one instance of the class, but is more efficient if you write several (different) instances of it. It's not clear to me if your example actually uses a class, but this is the general reason why JOS is longer than one would expect.
BTW: this is incidental, but I don't think JSON records class names (as you have in your example), and so it might not do what you need.
The reason why storing a tiny amount of information is serial form is relatively large is that it stores information about the classes of the objects it is serialising. If you store a duplicate of your list, then you'll see that the file hasn't grown by much. Store the same object twice and the difference is tiny.
The important pros are: relatively easy to use, quite fast and can evolve (just like XML). However, the data is rather opaque, it is Java-only, tightly couples data to classes and untrusted data can easily cause DoS. You should think about the serialised form, rather than just slapping implements Serializable everywhere.
If you don't have too much data, you can save objects into a java.util.Properties object. An example of a key/value pair would be user_1234_firstname = Peter. Using reflection to save and load objects can make things easier.