Java: Serialization performance for deep copy?

Java: Serialization performance for deep copy? - java

I need to deep-copy Java objects for one of my projects; and I was wondering if it's okay to use Serialization for this.
The objects are fairly small (< 1kb each), total #objects per copy would be < 500, and the copy routine will be invoked a few times a day. In this case, will it be okay to use Serialization for this in production, or is that still a really bad idea given the performance of Serialization?
If it's still a bad idea, I can think of using copy constructor/static copy method per such class to improve performance. Are there any other ways for it?

Maybe. Performance will not be an issue - dependencies will be.
The usual problem with serialization is that an innocent reference to some core class in your application can give you a copy of 90% of all the life instances because they suddenly become reachable. So you must be really careful with transient and the references which you use.
Worse, if you need serialization for deep copy and real state saving, you might end up with incompatible goals:
Deep copy needs to be fast
Deep copy can handle open resources (database connections)
State saving needs to handle API evolution (the state is stored on disk, it could be restored with a new version of the code).
State saving can benefit from using a readable form (so a human could fix mistakes)
Therefore, it's often better to use copy constructors than using a "clever hack" even though it might safe you some/a lot of time right now.

Related

How does MicroStream (de)serialization work?

I was wondering how the serialization of MicroStream works in detail.
Since it is described as "Super-Fast" it has to rely on code-generation, right? Or is it based on reflections?
How would it perform in comparison to the Protobuf-Serialization, which relies on Code-generation that directly reads out of the java-fields and writes them into a bytebuffer and vice-versa.
Using reflections would drastically decrease the performance when serializing objects on a huge scale, wouldn't it?
I'm looking for a fast way to transmit and persist objects for a multiplayer-game and every millisecond counts. :)
Thanks in advance!
PS: Since I don't have enough reputation, I can not create the "microstream"-tag. https://microstream.one/

I am the lead developer of MicroStream.
(This is not an alias account. I really just created it. I'm reading on StackOverflow for 10 years or so but never had a reason to create an account. Until now.)
On every initialization, MicroStream analyzes the current runtime's versions of all required entity and value type classes and derives optimized metadata from them.
The same is done when encountering a class at runtime that was unknown so far.
The analysis is done per reflection, but since it is only done once for every handled class, the reflection performance cost is negligible.
The actual storing and loading or serialization and deserialization is done via optimized framework code based on the created metadata.
If a class layout changes, the type analysis creates a mapping from the field layout that the class' instances are stored in to that of the current class.
Automatically if possible (unambiguous changes or via some configurable heuristics), otherwise via a user-provided mapping. Performance stays the same since the JVM does not care if it (simplified speaking) copies a loaded value #3 to position #3 or to position #5. It's all in the metadata.
ByteBuffers are used, more precisely direct ByteBuffers, but only as an anchor for off-heap memory to work on via direct "Unsafe" low-level operations. If you are not familiar with "Unsafe" operations, a short and simple notion is: "It's as direct and fast as C++ code.". You can do anything you want very fast and close to memory, but you are also responsible for everything. For more details, google "sun.misc.Unsafe".
No code is generated. No byte code hacking, tacit replacement of instances by proxies or similar monkey business is used. On the technical level, it's just a Java library (including "Unsafe" usage), but with a lot of properly devised logic.
As a side note: reflection is not as slow as it is commonly considered to be. Not any more. It was, but it has been optimized pretty much in some past Java version(s?).
It's only slow if every operation has to do all the class analysis, field lookups, etc. anew (which an awful lot of frameworks seem to do because they are just badly written). If the fields are collected (set accessible, etc.) once and then cached, reflection is actually surprisingly fast.
Regarding the comparison to Protobuf-Serialization:
I can't say anything specific about it since I haven't used Protocol Buffers and I don't know how it works internally.
As usual with complex technologies, a truly meaningful comparison might be pretty difficult to do since different technologies have different optimization priorities and limitations.
Most serialization approaches give up referential consistency but only store "data" (i.e. if two objects reference a third, deserialization will create TWO instances of that third object.
Like this: A->C<-B ==serialization==> A->C1 B->C2.
This basically breaks/ruins/destroys object graphs and makes serialization of cyclic graphs impossible, since it creates and endlessly cascading replication. See JSON serialization, for example. Funny stuff.)
Even Brian Goetz' draft for a Java "Serialization 2.0" includes that limitation (see "Limitations" at http://cr.openjdk.java.net/~briangoetz/amber/serialization.html) (and another one which breaks the separation of concerns).
MicroStream does not have that limitation. It handles arbitrary object graphs properly without ruining their references.
Keeping referential consistency intact is by far not "trying to do too much", as he writes. It is "doing it properly". One just has to know how to do it properly. And it even is rather trivial if done correctly.
So, depending on how many limitations Protobuf-Serialization has ("pacts with the devil"), it might be hardly or even not at all comparable to MicroStream in general.
Of course, you can always create some performance comparison tests for your particular requirements and see which technology suits you best. Just make sure you are aware of the limitations a certain technology imposes on you (ruined referential consistency, forbidden types, required annotations, required default constructor / getters / setters, etc.).
MicroStream has none*.
(*) within reason: Serializing/storing system-internals (e.g. Thread) or non-entities (like lambdas or proxy instances) is, while technically possible, intentionally excluded.

Saving all results in Scala REPL

Is there an easy way to save all the values of variables in scala REPL?
There is :save command in scala but it just saves the history of commands and the next time we need to recalculate everything from scratch.
I know that I can manually serialize/de-serialize everything I'm interested in, but there are two main difficulties (also applicable to java):
It is hard to manually write serialize/de-serialize code for every defined (Serializeable) variable and it is not extensible for later use.
It is only possible to save Serializable objects. I know that saving (hibernating) an arbitrary object may results in problems (especially for objects working with external resources), but whether there would be a problem or not, depends on the state of the program. Sometimes the programmer is sure that in the current situation there would be no problem saving the variables. I think there should be a way for the programmer to take the responsibility of saving everything, even the objects not explicitly defined Serializable.
I appreciate answers solving any of these problems.

A better performing way retrieving select attributes Collections of Large Objects in Java

Is there a method where I can iterate a Collection and only retrieve just a subset of attributes without loading/unloading the each of the full object to cache? 'Cos it seems like a waste to load/unload the WHOLE (possibly big) object when I need only some attribute(s), especially if the objects are big. It might cause unnecessary cache conflicts when loading such unnecessary data, right?
When I meant to 'load to cache' I mean to 'process' that object via the processor. So there would be objects of ex: 10 attributes. In the iterating loop I only use 1 of those. In such a scenario, I think its a waste to load all the other 9 attributes to the processor from the memory. Isn't there a solution to only extract the attributes without loading the full object?
Also, does something like Google's Guava solve the problem internally?
THANK YOU!

It's not usually the first place to look, but it's not certainly impossible that you're running into cache sharing problems. If you're really convinced (from realistic profiling or analysis of hardware counters) that this is a bottleneck worth addressing, you might consider altering your data structures to use parallel arrays of primitives (akin to column-based database storage in some DB architectures). e.g. one 'column' as a float[], another as a short[], a third as a String[], all indexed by the same identifier. This structure allows you to 'query' individual columns without loading into cache any columns that aren't currently needed.
I have some low-level algorithmic code that would really benefit from C's struct. I ran some microbenchmarks on various alternatives and found that parallel arrays was the most effective option for my algorithms (that may or may not apply to your own).
Note that a parallel-array structure will be considerably more complex to maintain and mutate than using Objects in java.util collections. So I'll reiterate - I'd only take this approach after you've convinced yourself that the benefit will be worth the pain.

There is no way in Java to manage loading to processor caches, and there is no way to change how the JVM works with objects, so the answer is no.
Java is not a low-level language and hides such details from the programmer.
The JVM will decide how much of the object it loads. It might load the whole object as some kind of read-ahead optimization, or load only the fields you actually access, or analyze the code during JIT compilation and do a combination of both.
Also, how large do you worry your objects are? I have rarely seen classes with more than a few fields, so I would not consider that big.

Simulating Destructors in Clojure

Problem Statement
I have two machines, A and B, both running Clojure.
B has some in memory data structure.
A holds an object A_P which is a reference/pointer to some object B_O in B's memory.
Now, as long as A_P is NOT GC-ed by A, I do not want B_O GC-ed by B.
However, once A_P has been GC-ed by A (and nothing else in A referes to B_O, and nothing else in B refers to B_O), then I want B_O to be elegible to be GC-ed.
Solution in Languages with Destructors
In C++, this is easy -- I use destructors. When A_P gets GC-ed, A sends B a msg to decrement the number of external references to B_O, and when that's 0, and internal refernes to B_0 is also 0, then B_O gets GC-ed.
Solution in Java/Clojure?
Now, I know that Java does not have destructors. However, I'm wondering if Clojure has a way around this problem.
Thanks!

No good solution exists, without a real distributed garbage collector. Even in C++, you cannot do this safely, because you implemented reference counting and pretended it was a real garbage collector; but if two objects point to each other across the machine divide, and are both unreferenced locally, they still both have a nonzero reference count and cannot be collected.

No, Clojure (based on JVM, CLR) doesn't have the "C++ type destructors" because of the automatic memory management model of JVM. There are things like finalizers but it is recommended to not use them. Instead you should model your solution based on message passing mechanism rather then A machine holding "pointer/reference" to data in B. I know this answer is very high level because you haven't provide any specific problem details in your question. If you need more details about how to solve a particular problem please provide the complete context and I am sure someone will able to help you.

This is an inherently difficult problem: distributed garbage problem is really hard if not impossible to get right.
However you might just be able to make it work using Java finalisers and overriding the finalize() method. You can then implement a messaging technique similar to the one you describe for C++.
This will have issues in the more general case (it won't help you with circular references across machines as amalloy points out) and there are some other quirks to be aware of (mostly around your lack of control over exactly when the finaliser gets called) but you might be able to get it to work in your specific situation.

Assuming you're using a data structure like a ref or atom for holding data structure A somewhere inside it, you can use listeners for monitoring the state of that structure for removals of A, and those listeners can send appropriate message to B. clojure.data/diff could be really useful for finding the structures that were removed.
The other option would be to have, immediately after the A structure is dereferenced, the function responsible for doing so send the message. As part of this though, make sure that that code was actually responsible for the removal of A, and not some other update.

How i can know how much memory my cached objects are using?

We are trying to cache the results of database selects (in hash map), so we wouldn’t have to execute them multiple times. and whenever we are changing data base, so for getting the changes in app we have added refresh list functionality.
Now we have a large no of list to fetch, so it taking too much time to load pick list from the data base.
So I have some question regarding this issue:
How I can find out how much memory the list is using? (I have used the method where we are using garbage collector for collecting the memory and taking the difference but there are many list and so it is taking too much time)
How I can optimize the refresh list?
Thanks for the help.

how i can find how much memory the list is using
JProfiler
VisualVM
how i can optimize the refresh list.
Make sure you're using the correct collection type for your data.
Have a look here.
Also have a look at the Guava collections.
One last thing, ignis is very right by advising you not to use System.gc() this might be the very reason you're having performance problems. This is why.

First, while not wanting to generalize when it comes to performance problems, the issue you're seeing are unlikely to be purely down to memory use, though if the lists are large this could come into play when they're refreshed and a large number of objects become eligible for collection.
To solve issues relating to garbage collection there's a few rules of thumb, but it always comes down to breaking out a profiler an tuning the garbage collector - there's more on that here.
But before that any loading of a database is going to involve iteration over a result set, so the biggest optimization you can make will be to reduce the size of the result sets. There's a couple of ways to do that:
if you using a map, try to use keys that don't require loading and do the load when you get a miss.
once loaded, only refresh the rows that have changed since you last loaded the data, though this obivously doesn't solve the start-up problem.
Now all that said, I would not recommend you write your own caching code in the first place. The reasons I say this are:
all modern RDBMS cache, so providing your queries are performant getting the actual result set should not be a bottleneck.
Hibernate provides not only ORM but a robust and well understood caching solution.
if you really need to cache massive datasets, use Coherence or similar - the cache can be started in a seperate JVM and your application doesn't need to take the load hit.

You have two problems here: discovering how much memory is in use, and managing a cache. I'm not sure that the two are really closely related, although they may be.
Discovering how much memory an object uses isn't extremely difficult: one excellent article you can use for reference is "Sizeof for Java" from JavaWorld. It escapes the whole garbage collection fiasco, which has a ton of holes in it (it's slow, it doesn't count the object but the heap - meaning that other objects factor into your results that you may not want, etc.)
Managing the time to initialize the cache is another problem. I work for a company that has a data grid as a product, and thus I'm biased; be aware.
One option is not using a cache at all, but using a data grid. I work for GigaSpaces Technologies, and I feel ours is the best; we can load data from a database on startup, and hold your data there as a distributed, transactional data store in memory (so your greatest cost is network access.) We have a community edition as well as full-featured platforms, depending on your need and budget. (The community edition is free.) We support various protocols, including JDBC, JPA, JMS, Memcached, a Map API (similar to JCache), and a native API.
Other similar options include Coherence, which is itself a data grid, and Terracotta DSO, which can distribute an object graph on a JVM heap.
You can also look at the cache projects themselves: Two include Ehcache and OSCache. (Again: bias. I was one of the people who started OpenSymphony, so I've a soft spot for OSCache.) In your case, what would happen is not a preload of cache - note that I don't know your application, so I'm guessing and might be wrong - but a cache on demand. When you acquire data, you'd check the cache for data first and fetch from the DB only if the data is not in cache, and load the cache on read.
Of course, you can also look at memcached, although I obviously prefer my employer's offering here.

Be aware that invoking
System.gc()
or
Runtime.getRuntime().gc()
is a bad idea unless you really need to do that. You should leave the VM the task of deciding when to free objects, unless after profiling you found that it's the only way to make the application go faster on your client's VM.

I tend to use YourKit for this sort of thing. It costs money but IMO is worth every penny (no connection other than as a customer).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.