I'm considering to port an application to db4o. The data model consists of lots of small objects with a lot of references between each other. For example, I have a book which points to an author and chapter. Chapters have sections, sections have large blobs of text, images, and they reference characters mentioned.
I think it should be possible to keep the meta structure in memory (everything except the text blobs) but I was wondering whether I could use some clever trick involving WeakReference so db4o would just keep the part of the model in memory that I really need (i.e. which I've been using recently).
The same is true for the text blobs (which should be around 1-10KB). Is it possible to get a String without having to worry about the DB layer and without having to query for the text blob using an artificial ID inside the getter and without using a hard reference which keeps the whole text in memory all the time?
Turning off WeakReferences is mostly used for performance tuning. The downsides to this approach are not negligible - so be careful. I would not recommend it.
Controlling memory usage should be done using activation features. Activation can help you keep only part of you model in memory and weakreferences will help you GC objects no longer used. I think that's the way to go.
Also - you can post your questions to db4o forums to get help from the db4o community.
Goran
I've not used db40, or any ORM/OODB product recently, however it would strike me that this kind of memory management & graph management feature should be part of the framework itself rather than something you build on top of it. If Versant's db40 doesn't offer this it might be worth you looking into another product instead that does offer it. So, I realise not the answer your looking for, but leveraging the framework would be my first port of call.
Related
When I first started using the Java Preferences API, the one glaring omission from the API was a putObject() method. I've always wondered why they did not include it.
So, I did some googling and I found this article from IBM which shows you how to do it: http://www.ibm.com/developerworks/library/j-prefapi/
The method they're using seems a bit hackish to me, because you have to break the Object up into byte matrices, store them, and reassemble them later.
My question is, has anyone tried this approach? Can you testify that it is a good way to store/retrieve objects?.
I'm also curious why the Java devs left putObject() out of the API. Does anyone have valuable insight?
I'm also curious why the Java devs left putObject() out of the API.
Does anyone have valuable insight?
From: http://docs.oracle.com/javase/7/docs/technotes/guides/preferences/designfaq.html
Why doesn't this API contain methods to read and write arbitrary
serializable objects?
Serialized objects are somewhat fragile: if the version of the program
that reads such a property differs from the version that wrote it, the
object may not deserialize properly (or at all). It is not impossible
to store serialized objects using this API, but we do not encourage
it, and have not provided a convenience method.
The article describes a reliable way to do it. I see there are a couple of things I may do differently (like I would store the count of the number of pieces as well as the pieces themselves so that I can figure things out easily when I retrieve them).
Your comment about Serialization is wrong though.... the object you want to store has to be Serializable.... that's how the ObjectOutputStream that the document uses does it's job.
So, Yes, it looks like a reliable mechanism, you need to have Serializable objects, and I imagine that the reason that putObject and getObject are not part of the API for two reasons:
it's not part of the way that is native to Windows registries
It risks people putting huge amounts of data in the registry.
Storing serialized objects in the registry strikes me as being somewhat concerning because they can be so big. I would only use it for occasions when there is no way to reconstruct the Object from constructors, and the serialized version is relatively small.
I have written a math game in Java, and have distributed some copies to a few beta-testers. The problem is that the version I have given them is saving the GameData via object serialization, which I found out is mainly for sending Objects, or in this case, ArrayLists of GameData, over a network. It is NOT persistance; that is what a relational database is for. Knowing this, I would like to know if it would be better to create a database on the beta-tester's machine (and rewrite the game), or continue with the Object serialization version of the game, and then retrieve the Objects when they are ready to send the data?
My guess would be to just move their data to a database that is created on their computer, and then give them the database version of the game. That way, the data can be persisted and be much easier to manipulate. What turns me away from that idea is the question of how am I going to write their database into mine (in the future)?
Although relatively rare, there are still lots of applications that use serialization for storage and retrieval of objects. It's not wrong to do this, just slightly unusual. If it's working for you, stick with it because DB's are a heavyweight solution. What you found out, about serialization, is only an opinion and an ill-formed one at that.
In terms of using an embedded database, two options to consider are SQLite and HyperSQL. However, serialization is also an option, and in my opinion it should be your default option if you've already implemented it. Some considerations:
With serialization you've generally got to retrieve the entire object, which is slow if you've got an object with several dozen fields and you only want to read one of them. If you're making queries like these, then use a database. I suspect that you're just reading in all of your serialized objects at startup and serializing them back out to disk at shutdown, in which case there's no reason to use a database instead of serialization.
Java's default serialization mechanism is fairly slow. You may want to consider another serialization mechanism, such as Kryo or Jackson, but only if you're not happy with your program's serialization performance.
It is difficult to advise on the best choice of technology without knowing what you are persisting and why.
If the state is simply a snapshot of your game state (i.e. a save file) or a "best scores" table, then you don't need a database. Serializing using JSON, XML or ... Java Object serialization is sufficient.
If the state needs to be read or updated incrementally or shared with other applications ... or users on other machines ... then a database is more appropriate.
Serialization mechanisms are problematic if the requirements include incremental changes, etcetera. You end up building a database-like layer over the top of the serialization.
As to whether you should stick with Java serialization ... or switch to JSON or XML or something like that:
Object serialization is simple, but it can be fragile if you change the classes that you are serializing. This fragility can be mitigated, but it is messy and you lose the simplicity. (You need to write custom readObject and writeObject methods that know how to read "old versions" of the serialized objects.)
JSON and XML are a bit more complicated, but still relatively simple if you use an object binding mechanism.
It is worth noting that changes to the persisted object classes (or the database schemas) are potentially problematic no matter what you do. There is no easy universal solution to this problem.
UPDATE
Given the additional information that you provided in your first comment (below), it seems like you don't need a database in the game itself. All you need is something that can read and analyse the session state save files that your beta testers provide for you. Indeed, it doesn't even seem like the actual app needs to be able read the files. (But that's unclear, because you've not said what the real purpose of these files is ... or at least, not what the entire purpose is.)
It is also worth noting that you are probably saving the wrong information if your aim is to tune the sets of questions. What you really need to do is record the length of time and whether the user got the right or wrong answer and the time ... for each individual question. And you probably need to know what the actual answer given was ... so that you can spot cases where the user's answer was actually right and you "marked" it as wrong ... or vice versa.
"What turns me away from that idea is the question of how am I going to write their database into mine (in the future)?"
Exactly. If you hadn't prematurely "analysed" the data, you wouldn't have this problem.
But ignoring that, it seems like that a simple state saving mechanism is sufficient to meet your (still hypothetical / inferred) requirement of keeping a personal score board for the end user. Your "tuning" stuff would be better implemented using a custom log file. I cannot see any value in incorporating a database as part of the app itself.
I presume you are doing java serialisation, If so there is nothing wrong with it. Just be aware of its limitations - Different versions of java might not be able to retrieve the file.
Also If you change the Class, previous saved data can not be retrieved.
If you decide to change you could look at Xml, JSon, Protocol Buffers, Thrift, Avro etc as well as a DB.
Note:
Xml is builtin in to java
Java Db (Derby) is also in Java
Other serialisation schema's require a seperate library.
Is there a method where I can iterate a Collection and only retrieve just a subset of attributes without loading/unloading the each of the full object to cache? 'Cos it seems like a waste to load/unload the WHOLE (possibly big) object when I need only some attribute(s), especially if the objects are big. It might cause unnecessary cache conflicts when loading such unnecessary data, right?
When I meant to 'load to cache' I mean to 'process' that object via the processor. So there would be objects of ex: 10 attributes. In the iterating loop I only use 1 of those. In such a scenario, I think its a waste to load all the other 9 attributes to the processor from the memory. Isn't there a solution to only extract the attributes without loading the full object?
Also, does something like Google's Guava solve the problem internally?
THANK YOU!
It's not usually the first place to look, but it's not certainly impossible that you're running into cache sharing problems. If you're really convinced (from realistic profiling or analysis of hardware counters) that this is a bottleneck worth addressing, you might consider altering your data structures to use parallel arrays of primitives (akin to column-based database storage in some DB architectures). e.g. one 'column' as a float[], another as a short[], a third as a String[], all indexed by the same identifier. This structure allows you to 'query' individual columns without loading into cache any columns that aren't currently needed.
I have some low-level algorithmic code that would really benefit from C's struct. I ran some microbenchmarks on various alternatives and found that parallel arrays was the most effective option for my algorithms (that may or may not apply to your own).
Note that a parallel-array structure will be considerably more complex to maintain and mutate than using Objects in java.util collections. So I'll reiterate - I'd only take this approach after you've convinced yourself that the benefit will be worth the pain.
There is no way in Java to manage loading to processor caches, and there is no way to change how the JVM works with objects, so the answer is no.
Java is not a low-level language and hides such details from the programmer.
The JVM will decide how much of the object it loads. It might load the whole object as some kind of read-ahead optimization, or load only the fields you actually access, or analyze the code during JIT compilation and do a combination of both.
Also, how large do you worry your objects are? I have rarely seen classes with more than a few fields, so I would not consider that big.
We are trying to cache the results of database selects (in hash map), so we wouldn’t have to execute them multiple times. and whenever we are changing data base, so for getting the changes in app we have added refresh list functionality.
Now we have a large no of list to fetch, so it taking too much time to load pick list from the data base.
So I have some question regarding this issue:
How I can find out how much memory the list is using? (I have used the method where we are using garbage collector for collecting the memory and taking the difference but there are many list and so it is taking too much time)
How I can optimize the refresh list?
Thanks for the help.
how i can find how much memory the list is using
JProfiler
VisualVM
how i can optimize the refresh list.
Make sure you're using the correct collection type for your data.
Have a look here.
Also have a look at the Guava collections.
One last thing, ignis is very right by advising you not to use System.gc() this might be the very reason you're having performance problems. This is why.
First, while not wanting to generalize when it comes to performance problems, the issue you're seeing are unlikely to be purely down to memory use, though if the lists are large this could come into play when they're refreshed and a large number of objects become eligible for collection.
To solve issues relating to garbage collection there's a few rules of thumb, but it always comes down to breaking out a profiler an tuning the garbage collector - there's more on that here.
But before that any loading of a database is going to involve iteration over a result set, so the biggest optimization you can make will be to reduce the size of the result sets. There's a couple of ways to do that:
if you using a map, try to use keys that don't require loading and do the load when you get a miss.
once loaded, only refresh the rows that have changed since you last loaded the data, though this obivously doesn't solve the start-up problem.
Now all that said, I would not recommend you write your own caching code in the first place. The reasons I say this are:
all modern RDBMS cache, so providing your queries are performant getting the actual result set should not be a bottleneck.
Hibernate provides not only ORM but a robust and well understood caching solution.
if you really need to cache massive datasets, use Coherence or similar - the cache can be started in a seperate JVM and your application doesn't need to take the load hit.
You have two problems here: discovering how much memory is in use, and managing a cache. I'm not sure that the two are really closely related, although they may be.
Discovering how much memory an object uses isn't extremely difficult: one excellent article you can use for reference is "Sizeof for Java" from JavaWorld. It escapes the whole garbage collection fiasco, which has a ton of holes in it (it's slow, it doesn't count the object but the heap - meaning that other objects factor into your results that you may not want, etc.)
Managing the time to initialize the cache is another problem. I work for a company that has a data grid as a product, and thus I'm biased; be aware.
One option is not using a cache at all, but using a data grid. I work for GigaSpaces Technologies, and I feel ours is the best; we can load data from a database on startup, and hold your data there as a distributed, transactional data store in memory (so your greatest cost is network access.) We have a community edition as well as full-featured platforms, depending on your need and budget. (The community edition is free.) We support various protocols, including JDBC, JPA, JMS, Memcached, a Map API (similar to JCache), and a native API.
Other similar options include Coherence, which is itself a data grid, and Terracotta DSO, which can distribute an object graph on a JVM heap.
You can also look at the cache projects themselves: Two include Ehcache and OSCache. (Again: bias. I was one of the people who started OpenSymphony, so I've a soft spot for OSCache.) In your case, what would happen is not a preload of cache - note that I don't know your application, so I'm guessing and might be wrong - but a cache on demand. When you acquire data, you'd check the cache for data first and fetch from the DB only if the data is not in cache, and load the cache on read.
Of course, you can also look at memcached, although I obviously prefer my employer's offering here.
Be aware that invoking
System.gc()
or
Runtime.getRuntime().gc()
is a bad idea unless you really need to do that. You should leave the VM the task of deciding when to free objects, unless after profiling you found that it's the only way to make the application go faster on your client's VM.
I tend to use YourKit for this sort of thing. It costs money but IMO is worth every penny (no connection other than as a customer).
Most people use some kind of an IoC framework - Guice, Spring, you name it. Many of us need to scale their applications too, so they complicate their lifes with Terracotta, Glassfish/JBoss/insertyourfavouritehere clusters.
But is it really the way to go? Are you using any of the above?
Here's some ideas we currently have implemented in a yet-to-be-opensourced framework, and I'd like to see what you think of it, or maybe "it's a complete ripoff of XY!".
cluster-wide object replication - give it a name, and whenever you do something (in any node)
on such an object, it will get replicated - with different guarantees
do transparent soft-loadbalancing - simplest scenario: restful webservice method call proxied to an other node
view-only node injection: inject a proxy to a "named" object, and get your calls automatically proxied to a node
Would you use something like that? Is there a current, stable, enterprise-ready implementation out there?
FWIW I work at a company that does very large scale web applications and we tend not to use this form of object caching.
In fact we tend to make our lives easier by not storing anything in the session and not caching anything that is transactional and needs to be read in a current state. This way your application is quite simple, easy to reason about, and very easy to scale horizontally.
I guess the rationale behind using these object caches is primarily to reduce load on your persistence store and possibly to reduce latency. My suggestion is to work at scaling this backend independently from the relatively dumb webapp. Most large sites do this through the use of read replicas and data sharding. Have a look here: http://highscalability.com/livejournal-architecture. I remember looking at this a long time ago and it was pretty interesting. It also fits reasonably well into the kind of architecture that I see being used in high traffic websites.