We're looking for a high performance compact serialization solution for Java objects on GAE.
Native Java serialization doesn't perform all that well and it's terrible at compatibility i.e. it can't unserialize an old object if a field is added to the class or removed.
We tried Kryo which performs well in other environments and supports back compatibility when fields are added, but unfortunately the GAE SecurityManager slows it down terribly by adding a check to every method call in the recursion. I'm concerned that might be the issue with all serialization libraries.
Any ideas please? Thanks!
Beware, premature optimisation is the root of all evil.
You should first try one of the standard solutions and then decide if it fits your performance requirements. I did test several serialization solutions on GAE (java serialisation, JSON, JSON+ZIP) and they were an order of magnitude faster than datastore access.
So if serialising data takes 10ms and writing it to datastore takes 100ms, there is very little added benefit in trying to optimise the 10ms.
Btw, did you try Jackson?
Also, all API calls on GAE are implemented as RPC calls to other servers, where payload is serialised as protobuf.
Do you need cross-language ability? and re. high-performance are you referring to speed only, or including optimized memory management for less GC, or including serialized object size?
If you need cross-language I think Google's protobuf is a solution. However, it can hardly be called "high performance" because the UTF-8 strings created on Java side causes constant GCs.
In case the data you are supporting is mostly simple objects and you don't need composition, I would recommend you to write your own serialization layer (not kidding).
Using an enum to index your fields so you can serialize fields that contains value only
Create maps for primitive types using trove4j collections.
Using cached ByteBuffer objects if you could predict size for most of your objects to be under a certain value.
Using string dictionary to reduce string object re-creation and use cached StringBuilder during deserialization
That's what we did for our "high-performance" java serialization layer. Essentially we could achieve almost object-less serialization/de-serialization on a reasonably good timing.
Related
I was wondering how the serialization of MicroStream works in detail.
Since it is described as "Super-Fast" it has to rely on code-generation, right? Or is it based on reflections?
How would it perform in comparison to the Protobuf-Serialization, which relies on Code-generation that directly reads out of the java-fields and writes them into a bytebuffer and vice-versa.
Using reflections would drastically decrease the performance when serializing objects on a huge scale, wouldn't it?
I'm looking for a fast way to transmit and persist objects for a multiplayer-game and every millisecond counts. :)
Thanks in advance!
PS: Since I don't have enough reputation, I can not create the "microstream"-tag. https://microstream.one/
I am the lead developer of MicroStream.
(This is not an alias account. I really just created it. I'm reading on StackOverflow for 10 years or so but never had a reason to create an account. Until now.)
On every initialization, MicroStream analyzes the current runtime's versions of all required entity and value type classes and derives optimized metadata from them.
The same is done when encountering a class at runtime that was unknown so far.
The analysis is done per reflection, but since it is only done once for every handled class, the reflection performance cost is negligible.
The actual storing and loading or serialization and deserialization is done via optimized framework code based on the created metadata.
If a class layout changes, the type analysis creates a mapping from the field layout that the class' instances are stored in to that of the current class.
Automatically if possible (unambiguous changes or via some configurable heuristics), otherwise via a user-provided mapping. Performance stays the same since the JVM does not care if it (simplified speaking) copies a loaded value #3 to position #3 or to position #5. It's all in the metadata.
ByteBuffers are used, more precisely direct ByteBuffers, but only as an anchor for off-heap memory to work on via direct "Unsafe" low-level operations. If you are not familiar with "Unsafe" operations, a short and simple notion is: "It's as direct and fast as C++ code.". You can do anything you want very fast and close to memory, but you are also responsible for everything. For more details, google "sun.misc.Unsafe".
No code is generated. No byte code hacking, tacit replacement of instances by proxies or similar monkey business is used. On the technical level, it's just a Java library (including "Unsafe" usage), but with a lot of properly devised logic.
As a side note: reflection is not as slow as it is commonly considered to be. Not any more. It was, but it has been optimized pretty much in some past Java version(s?).
It's only slow if every operation has to do all the class analysis, field lookups, etc. anew (which an awful lot of frameworks seem to do because they are just badly written). If the fields are collected (set accessible, etc.) once and then cached, reflection is actually surprisingly fast.
Regarding the comparison to Protobuf-Serialization:
I can't say anything specific about it since I haven't used Protocol Buffers and I don't know how it works internally.
As usual with complex technologies, a truly meaningful comparison might be pretty difficult to do since different technologies have different optimization priorities and limitations.
Most serialization approaches give up referential consistency but only store "data" (i.e. if two objects reference a third, deserialization will create TWO instances of that third object.
Like this: A->C<-B ==serialization==> A->C1 B->C2.
This basically breaks/ruins/destroys object graphs and makes serialization of cyclic graphs impossible, since it creates and endlessly cascading replication. See JSON serialization, for example. Funny stuff.)
Even Brian Goetz' draft for a Java "Serialization 2.0" includes that limitation (see "Limitations" at http://cr.openjdk.java.net/~briangoetz/amber/serialization.html) (and another one which breaks the separation of concerns).
MicroStream does not have that limitation. It handles arbitrary object graphs properly without ruining their references.
Keeping referential consistency intact is by far not "trying to do too much", as he writes. It is "doing it properly". One just has to know how to do it properly. And it even is rather trivial if done correctly.
So, depending on how many limitations Protobuf-Serialization has ("pacts with the devil"), it might be hardly or even not at all comparable to MicroStream in general.
Of course, you can always create some performance comparison tests for your particular requirements and see which technology suits you best. Just make sure you are aware of the limitations a certain technology imposes on you (ruined referential consistency, forbidden types, required annotations, required default constructor / getters / setters, etc.).
MicroStream has none*.
(*) within reason: Serializing/storing system-internals (e.g. Thread) or non-entities (like lambdas or proxy instances) is, while technically possible, intentionally excluded.
Hi I am serializing an object using different VM (Oracle hotspot,jse) and deserializing it with android VM(dalvik). will there be any problem?
Assuming that by "serialization" you mean Serializable, then yes. Serialization is not guaranteed to be the same across distinct VMs. Please use something else (e.g., XML, JSON).
UPDATE
Your first comment is so flawed that I cannot fit my response in 500 characters.
ofcourse yes. without implementing Serializable we cannot serialize
Talented programmers can. Talented programmers can serialize data to XML, JSON, Protocol Buffers, Thrift, ASN.l, YAML, and any number of other formats.
what actually i am doing is i am writing an object on to the network using ObjectOutputStream(oracle hotspot) and reading that object on the android using ObjectInputStream
Talented programmers use platform-independent serialization approaches, such as any of the ones I listed above. That is because talented programmers realize that, in the future, there may be need to have clients or servers that are not based in Java.
So you mean to say as of now this is fine but in the future it is not guaranteed.
No. I wrote:
Serialization is not guaranteed to be the same across distinct VMs.
An object serialized using one VM (e.g., Oracle) should be able to be de-serialized using that VM. There is no guarantee that an object serialized with one VM can be de-serialized using another VM. In fact, developers have gotten in trouble trying to do precisely what you are trying to do. This is another example of why talented programmers use platform-independent serialization structures.
I am working on a project where Java's native serialization is slow, so we want to move to implementing Externalize interface on the classes for superior performance.
However, these classes have lots of data members, and we have realized its easy to make mistakes while writing these two methods. We are just reading/writing all of the members of the class in these functions, nothing fancy. Is there some way of generating the readExternal() writeExternal() blocks for externalize automatically in an offline process, or at compile time?
I had a look at http://projectlombok.org/, and something like that would have been ideal.
Similarly, we would like to keep these classes immutable, but immutable classes can not implement the externalizable interface - we want to use the proxy class pattern from effective java - having that generated would be useful too.
I am working on a project where Java's native serialization is slow
How slow? Why? Making it faster with lots of hand coding is most unlikely to be either economically feasible or maintainable in the long run. Serialization overheads should really come down to time and space bounds in transmisssion. There's no particular reason why Java's default serialziation should be startlingly slower than the result of all the hand coding you are planning. You would be better off investigating causes. You might find for example that a well-placed BufferedOutputStream would solve all your problems.
Regarding Project Lombok it's rejected the feature.
I'd consider leveraging alternative frameworks:
SBE - for financial transactions;
kryo project - for Java compatibility;
FlatBuffers - for lazy zero copy parsing and ability to skip nested structures;
Protobuf - bit more compact than flat buffers but missing parsing in random access regions (possible in flat buffers e.g. on memory mapped files).
Java serialisation is very inefficient both throughput, size, portability, and schema migration.
MapStruct - can be good option to map something mutable into immutable if required with minimum custom code (and IDE support).
I am working in a highly distributed environment. A lot of network access and a lot of db access.
I have some classes that are send over and over the network, and are serialized and de-serialized.
Most of the classes are quite simple in their nature, like :
class A{
long a;
long b;
}
And some are more complex (Compound - Collections).
There are some people in the company I work that claim that all the classes should implement Externalizable rather than Serializable , and that would make a major impact on the performance of the application.
Although the impact on the performance is very difficult to measure, since the application is so big and so distributed and not fully ready, I can't really simulate a full load right now.
So maybe some of you know some interesting article that would reveal anything to me. Or maybe you can share some thoughts.
My basic intuition was that is would not make any difference serializing and deserializing simple classes (like the one above) over the network/db, lets say when the IO process of the whole app are around 10%. ( I mean 90% of the time the system is doing other stuff than IO )
My basic intuition was that is would not make any difference serializing and deserializing simple classes (like the one above) over the network/db, lets say when the IO process of the whole app are around 10%. ( I mean 90% of the time the system is doing other stuff than IO )
Your intuition sounds reasonable. But what exactly is taking 10% of the time? Is it just the serialization / deserialization? Or does the 10% include the real (clock) time to do the I/O?
EDIT
If you have actual profiling measurements to back up your "10% to 15%" clock time doing serialization + deserialization + I/O, then logic tells you that the maximum performance improvement you can get will be less than that. If you can separate the I/O from the serialization / deserialization, you can refine that upper bound. My guess is that the actual improvement will be less than 5%.
I suggest that you create a small benchmark to send and receive one of your data types using serialization and externalization and see what percentage difference it actually makes.
It must be said that there is a (relatively) significant overhead in generic serialization versus optimally implemented externalization. A lot of this due to the general properties of serialization.
There is the overhead of marshaling / unmarshaling the type descriptors for each class used in the object being transmitted.
There is the overhead of adding each marshaled object to a hash table so that the serialization faithfully records cycles, etc.
However, serialization / deserialization is only a small part of the total I/O overheads, and these are only a small part of your application.
This is a pretty good website comparing many different Java serialization mechanisms.
http://github.com/eishay/jvm-serializers/wiki
I would ask them to come up with some measurements to support their claims. Then everybody will have a basis for a rational discussion. At present you don't have one. Note that it is those with the claims that should produce the supporting evidence: don't get sucked into being responsible for proving them wrong.
Java serialization is flexible and standard, but its not designed to be fast, especially for simple objects. If you want speed I suggest you try hessian or protobuf. These can be 5x faster for simple objects. Or youc an write a custom serializer which can be as much as 10x faster.
For us, custom serialization is the way to go. We let Java do what it does well, or at least good enough, for free, and provide custom support for what it does badly. This is a lot less code than full Externalizable support.
I can't imagine in what case custom serialization couldn't be done but using Externalizable could (referring to Roman's comment on Peter L's answer). Specifically, I'm referring to, e.g., implementation of writeObject/readObject.
Generic serialization can be fast
see http://java-is-the-new-c.blogspot.de/2013/10/still-using-externalizable-to-get.html
It depends on your concrete system if serialization performance is significant. I have seen systems which gained a lot of performance by speeding up serialization. It's not only about CPU but also about latencies. E.g. if a distributed system does a lot of blocking request/responses (requestor waiting for result), serializations adds up to overall request response time which can be significant as its (1) encode request (2) decode request (3) encode response (4) decode response. So 4 (de-)serialization's happing per request/response
I have a certain POJO which needs to be persisted on a database, current design specifies its field as a single string column, and adding additional fields to the table is not an option.
Meaning, the objects need to be serialized in some way. So just for the basic implementation I went and designed my own serialized form of the object which meant concatenating all it's fields into one nice string, separated by a delimiter I chose. But this is rather ugly, and can cause problems, say if one of the fields contains my delimiter.
So I tried basic Java serialization, but from a basic test I conducted, this somehow becomes a very costly operation (building a ByteArrayOutputStream, an ObjectOutputStream, and so on, same for the deserialization).
So what are my options? What is the preferred way for serializing objects to go on a database?
Edit: this is going to be a very common operation in my project, so overhead must be kept to a minimum, and performance is crucial. Also, third-party solutions are nice, but irrelevant (and usually generate overhead which I am trying to avoid)
Elliot Rusty Harold wrote up a nice argument against using Java Object serialization for the objects in his XOM library. The same principles apply to you. The built-in Java serialization is Java-specific, fragile, and slow, and so is best avoided.
You have roughly the right idea in using a String-based format. The problem, as you state, is that you're running into formatting/syntax problems with delimiters. The solution is to use a format that is already built to handle this. If this is a standardized format, then you can also potentially use other libraries/languages to manipulate it. Also, a string-based format means that you have a hope of understanding it just by eyeballing the data; binary formats remove that option.
XML and JSON are two great options here; they're standardized, text-based, flexible, readable, and have lots of library support. They'll also perform surprisingly well (sometimes even faster than Java serialization).
You might try Protocol Buffers, it is a open-source project from Google, it is said to be fast (generates shorter serialized form than XML, and works faster). It also handles addition of new field gently (inserts default values).
You need to consider versioning in your solution. Data incompatibility is a problem you will experience with any solution that involves the use of a binary serialization of the Object. How do you load an older row of data into a newer version of the object?
So, the solutions above which involve serializing to a name/value pairs is the approach you probably want to use.
One solution is to include a version number as one of field values. As new fields are added, modified or removed then the version can be modified.
When deserializing the data, you can have different deserialization handlers for each version which can be used to convert data from one version to another.
XStream or YAML or OGNL come to mind as easy serialization techniques. XML has been the most common, but OGNL provides the most flexibility with the least amount of metadata.
Consider putting the data in a Properties object and use its load()/store() serialization. That's a text-based technique so it's still readable in the database:
public String getFieldsAsString() {
Properties data = new Properties();
data.setProperty( "foo", this.getFoo() );
data.setProperty( "bar", this.getBar() );
...
ByteArrayOutputStream out = new ByteArrayOutputStream();
data.store( out, "" );
return new String( out.toByteArray(), "8859-1" ); //store() always uses this encoding
}
To load from string, do similar using a new Properties object and load() the data.
This is better than Java serialization because it's very readable and compact.
If you need support for different data types (i.e. not just String), use BeanUtils to convert each field to and from a string representation.
I'd say your initial approach is not all that bad if your POJO consists of Strings and primitive types. You could enforce escaping of the delimiter to prevent corruptions. Also if you use Hibernate you encapsulate the serialization in a custom type.
If you do not mind another dependency, Hessian is supposedly a more efficient way of serializing Java objects.
How about the standard JavaBeans persistence mechanism:
java.beans.XMLEncoder
java.beans.XMLDecoder
These are able to create Java POJOs from XML (which have been persisted to XML). From memory, it looks (something) like...
<object class="java.util.HashMap">
<void method="put">
<string>Hello</string>
<float>1</float>
</void>
</object>
You have to provide PersistenceDelegate classes so that it knows how to persist user-defined classes. Assuming you don't remove any public methods, it is resilient to schema changes.
You can optimize the serialization by externalizing your object. That will give you complete control over how it is serialized and improve the performance of process. This is simple to do, as long as your POJO is simple (i.e. doesn't have references to other objects), otherwise you can easily break serialization.
tutorial here
EDIT: Not implying this is the preferred approach, but you are very limited in your options if ti is performance critical and you can only use a string column in the table.
If you are using a delimiter you could use a character which you know would never occur in your text such as \0, or special symbols http://unicode.org/charts/symbols.html
However the time spent sending the data to the database and persisting it is likely to be much larger than the cost of serialization. So I would suggest starting with some thing simple and easy to read (like XStream) and look at where your application is spending most of its time and optimise that.
I have a certain POJO which needs to be persisted on a database, current design specifies its field as a single string column, and adding additional fields to the table is not an option.
Could you create a new table and put a foreign key into that column!?!? :)
I suspect not, but let's cover all the bases!
Serialization:
We've recently had this discussion so that if our application crashes we can resurrect it in the same state as previously. We essentially dispatch a persistance event onto a queue, and then this grabs the object, locks it, and then serializes it. This seems pretty quick. How much data are you serializing? Can you make any variables transient (i.e. cached variables)? Can you consider splitting up your serialization?
Beware: what happens if your objects change (locking) or classes change (diferent serialization id)? You'll need to upgrade everything that's serialized to latest classes. Perhaps you only need to store this overnight so it doesn't matter?
XML:
You could use something like xstream to achieve this. Building something custom is doable (a nice interview question!), but I'd probably not do it myself. Why bother? Remember if you have cyclic links or if you have referencs to objects more than once. Rebuilding the objects isn't quite so trivial.
Database storage:
If you're using Oracle 10g to store blobs, upgrade to the latest version, since c/blob performance is massively increased. If we're talking large amounts of data, then perhaps zip the output stream?
Is this a realtime app, or will there be a second or two pauses where you can safely persist the actual object? If you've got time, then you could clone it and then persist the clone on another thread. What's the persistance for? Is it critical it's done inside a transaction?
Consider changing your schema. Even if you find a quick way to serialize a POJO to a string how do you handle different versions? How do you migrate the database from X->Y? Or worse from A->D? I am seeing issues where we stored a serialize object into a BLOB field and have to migrate a customer across multiple versions.
Have you looked into JAXB? It is a mechanism by which you can define a suite of java objects that are created from an XML Schema. It allows you to marshal from an object hierarchy to XML or unmarshal the XML back into an object hierarchy.
I'll second suggestion to use JAXB, or possibly XStream (former is faster, latter has more focus on object serialization part).
Plus, I'll further suggest a decent JSON-based alternative, Jackson (http://jackson.codehaus.org/Tutorial), which can fully serializer/deserialize beans to JSON text to store in the column.
Oh and I absolutely agree in that do not use Java binary serialization under any circumstances for long-term data storage. Same goes for Protocol Buffers; both are too fragile for this purpose (they are better for data transfer between tigtly coupled systems).
You might try Preon. Preon aims to be to binary encoded data what Hibernate is to relational databases and JAXB to XML.