db4o experiences? - java

I'm currently trying out db4o (the java version) and I pretty much like what I see. But I cannot help wondering how it does perform in a real live (web-)environment. Does anyone have any experiences (good or bad) to share about running db4o?

We run DB40 .NET version in a large client/server project.
Our experiences is that you can potentially get much better performance than typical relational databases.
However, you really have to tweak your objects to get this kind of performance. For example, if you've got a list containing a lot of objects, DB4O activation of these lists is slow. There are a number of ways to get around this problem, for example, by inverting the relationship.
Another pain is activation. When you retrieve or delete an object from DB4O, by default it will activate the whole object tree. For example, loading a Foo will load Foo.Bar.Baz.Bat, etc until there's nothing left to load. While this is nice from a programming standpoint, performance will slow down the more nesting in your objects. To improve performance, you can tell DB4O how many levels deep to activate. This is time-consuming to do if you've got a lot of objects.
Another area of pain was text searching. DB4O's text searching is far, far slower than SQL full text indexing. (They'll tell you this outright on their site.) The good news is, it's easy to setup a text searching engine on top of DB4O. On our project, we've hooked up Lucene.NET to index the text fields we want.
Some APIs don't seem to work, such as the GetField APIs useful in applying database upgrades. (For example, you've renamed a property and you want to upgrade your existing objects in the database, you need to use these "reflection" APIs to find objects in the database. Other APIs, such as the [Index] attribute don't work in the stable 6.4 version, and you must instead specify indexes using the Configure().Index("someField"), which is not strongly typed.
We've witnessed performance degrade the larger your database. We have a 1GB database right now and things are still fast, but not nearly as fast as when we started with a tiny database.
We've found another issue where Db4O.GetByID will close the database if the ID doesn't exist anymore in the database.
We've found the Native Query syntax (the most natural, language-integrated syntax for queries) is far, far slower than the less-friendly SODA queries. So instead of typing:
// C# syntax for "Find all MyFoos with Bar == 23".
// (Note the Java syntax is more verbose using the Predicate class.)
IList<MyFoo> results = db4o.Query<MyFoo>(input => input.Bar == 23);
Instead of that nice query code, you have to an ugly SODA query which is string-based and not strongly-typed.
For .NET folks, they've recently introduced a LINQ-to-DB4O provider, which provides for the best syntax yet. However, it's yet to be seen whether performance will be up-to-par with the ugly SODA queries.
DB4O support has been decent: we've talked to them on the phone a number of times and have received helpful info. Their user forums are next to worthless, however, almost all questions go unanswered. Their JIRA bug tracker receives a lot of attention, so if you've got a nagging bug, file it on JIRA on it often will get fixed. (We've had 2 bugs that have been fixed, and another one that got patched in a half-assed way.)
If all this hasn't scared you off, let me say that we're very happy with DB4O, despite the problems we've encountered. The performance we've got has blown away some O/RM frameworks we tried. I recommend it.
update July 2015 Keep in mind, this answer was written back in 2008. While I appreciate the upvotes, the world has changed since then, and this information may not be as reliable as it was when it was written.

Most native queries can and are efficiently converted into SODA queries behind the scenes so that should not make a difference. Using NQ is of course preferred as you remain in the realms of strong typed language. If you have problems getting NQ to use indexes please feel free to post your problem to the db4o forums and we'll try to help you out.
Goran

Main problem I've encountered with it is reporting. There just doesn't seem to be any way to run efficient reports against a db4o data source.

Judah, it sounds like you are not using transparent activation, which is a feature of the latest production version (7.4)? Perhaps if you specified the version you are using as there may be other issues which are now resolved in the latest version?

Related

Solution to provide shared entities between multiple Java processes

I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source
You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.
If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.
The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.

Is there any existing Java library that allows you to do fast, in-memory lookups of zipcodes (bonus, state and city) from latitude/longitude?

I have seen many so-called "reverse geocoding" libraries in various languages; all depend on calling an external provider via REST or some similar method. However, you cannot call a REST provider if you must handle thousands of requests per second.
On the other hand, the problem should be simple to solve - CSV-based databases are available for free with this information. The issue is the time and cost of writing an efficient and well-tested in-memory search implementation, versus downloading or buying an existing one.
I can't find any after a lot of looking, but I can't believe there can't be one. Is there any pre-written library that does this?
This question:
Fastest way to find the location(zip, city, state) given latitude/longitude
came the closest, but essentially indicates how to write the solution, not that there is anything available off the shelf. But there must be some library everyone uses for this. A dozen people a day must have this problem.
Spatial databases (e.g. Postgresql with PostGis) use algorithms which are fast in looking up data for given latitude/longitude information. As you want to use a Java library and have it in memory you could look at the H2 Spatial database. I have never used it, so I can't comment on its performance.
Edit: Hm, looking closer at the link I've provided shows that this is a planned feature... Personally I'd simply use Postgresql/PostGis (with or without Java as server frontend) and be done with it. If your server has enough memory it will still fit the "in-memory" requirement. Naturally it does not fit the Java library requirement. There is however JSI, which could be used in memory and with Java.

Is there an SQL alternative like Python's Shelve for Android/Java?

Shelve is an ultra-simple No-SQL persistence layer which allows you to trivially persist a mapping of objects. It's a commonly used package in Python because it allows you to trivially add persistence to any application.
It's simplistic nature means it's somewhat limited - but it's surprisingly useful. You can map any arbitrary hashable key onto any serializable object.
Does something like this exist for Android? I'm writing a very simple app, and I've noticed that I'm spending a lot of time faffing around with table structures, select & insert statements. That's the sort of thing I almost never do in Python since I'd usually have some kind of NoSQL alternative.
I'm not expecting to to work exactly the same way - clearly Python and Java are languages with very different characteristics. I just want something that nearly as simple to use and requires less manual SQL faffing.
One more thing - this is a fairly trivial app. I'd prefer to introduce the bare minimum of additional project dependencies. Preference will be given to solutions which require nothing more than the Android APIs.
You said preference to Java API answers so I probably won't get preference, but Couchbase Mobile is the best Android No-SQL I have ever come across.
http://www.couchbase.org/get/couchbase-mobile-for-android/current
You can use SharedPrefences from DataStore. It's pretty much what you need. You don't need full SQLite power for this.

Comparison of NoSQL Databases for Java

I want to find out more about NoSQL databases/data-stores available for use from Java, and so far I tried out Project Voldemort. Except for awfully chosen name, it seems fine so far.
I'd like to find out more about other such database systems. Now, on wikipedia article there is a list of some of them, and there is some documentation on their project pages.
However, instead of comparing technical specs and tutorials provided by authors, what I would like to know is:
What are your experiences with working with these libraries on real projects? Which one would you recommend for use based on that experience, which one you wouldn't and why?
I know that only people to be able to answer this question are those who actually used more than one such database, but I hope that someone did do so.
EDIT:
By "real project" I primarily mean a project in production (but in absence of these anything larger than a homework or finished tutorial applies).
I worked with a relational database that had enormous amount of data in it, most of it concentrated in a single table, which was denormalized for performance anyway. But, because of the entire mess with constraints etc, creating a usable cluster had shown horrible results in both stability and performance.
Now, I'm quite sure that most likely any of these NoSQL systems would be a better choice then what I had at disposal. But, there has to be a difference between them, too. Whether it is in documentation, stability between versions, community, ease of use, whatever... And there are many giants. Which ones shoulders to choose? :D
We have been working with HBase for our projects. Our experience is -
The community is very dynamic and extremely helpful
The installation procedure for developers is quite easy in either pseudo distributed or standalone mode
We have been using it for integration test like unit tests
Installing a cluster is also easy but comparing some other NoSQL it has more components to install than others.
Administering - is still going on so not able to say much to say about it.
Do not use it for SQL like SELECT queries, for that we are using Apache Solr
To make development and testing easier we have come up with a simple object mapper - https://github.com/smart-it/smart-dao
The reason I chose is HBase, like other NoSQL, solves sharding, scaling by design making it easier in the long run and that seems to hold well.
Maybe the most prominent of Java NoSQL solutions is Cassandra. It has some features beyond Voldemort (Order-Preserving Partitioner which allows range queries; BigTable style structure for values); and is missing others (no alternate storage backends or version clocks for versioning).
Its performance is more optimal for fast writes, but its biggest strength is probably ease at which it can be horizontally scaled by adding new nodes (something where V is bit more static).
Compared to, say, MongoDB, its data model is quite simple and often there's no point in using much more than key/value abstraction (that is, handle data mapping on client side, store serialized objects).
It has full replication and distribution, unlike some k/v stores (couchdb, from what I understand).
It's pretty difficult to nail down a good choice without knowing exactly what your use case is. Much of it depends on what kind of data model are you comfortable with and fits your need. You have key-value stores, document-oriented, column-oriented, etc. Another huge factor is the products take on scaling and how they choose to deal with availability/consistency trade-offs.
I like MongoDB. I like how it supports queries and I like the document oriented data models. It fits many problems that I seem to run into. There is a Great (with capital G) community as seen at the recent MongoSV event.
Your best bet it to pick 3 different products and evaluate them. I would also see if you can find some companies who have presented at conferences and tell their stories of how they were successful. Videos from MongoSV will be available soon.

How to tune performance of hsqldb/hibernate app

I have an open source Java application that uses Hibernate and HSQLDB for persistence. In all my toy tests, things run fast and everything is good. I have a client who has been running the software for several months continuously and their database has grown significantly over that time, and performance has dropped gradually. It finally occurred to me that the database could be the problem. As far as I can tell from log statements, all of the computation in the server happens quickly, so this is consistent with the hypothesis that the DB might be at fault.
I know how to do normal profiling of a program to figure out where hot spots are and what is taking up significant amounts of time. But all the profilers I know of monitor execution time within the program and don't give you any help about calls to external resources. What tools do people use to profile programs that are using external db calls to find out where to optimize performance?
A little blind searching around has already found a few hot spots--I noticed a call where I was enumerating all the objects of a particular class in order to find out whether there were any. A one line change to the criterion [.setMaxResults(1)] changed that call from a half-second to virtually instantaneous. I also see places where I ask the same question from the db many times within a single transaction. I haven't figured out how to cache the answer yet, but what I really want is a tool to help me look for these kinds of things more systematically.
Unfortunately, as far as I know, there is no tool for that.
But there are some things you might want to check:
Are you using eager loading instead of lazy loading? By the description of your problem, it really looks like you are not using lazy loading...
Have you turned on and properly configured your second-level caching? Including the Query Cache? Hibernate caching mechanism is extremely powerful and flexible.
Have you consider using Hibernate Search? Depending on your query, Hibernate Search Full Text index on top of Apache Lucene can speed up you queries (since it indexing system is so powerful)
How much data are you storing in HSQLDB? I don't think it performs well when managing large sets of data, since it's just storing everything in files...
There was once a tool called IronGrid/IronEye/IronTrackSql that did exactly what you are looking for. Unfortunately, they went out of business. They did open source their product at the last minute, but I have not been able to find source or a binary for quite some time.
I have been using YourKit for profiling lately, partly because you can have it profile SQL time to find your most called statements and longest running statements. It is not as detailed as IronGrid was, but it does give you valuable information. In my latest database/hibernate tuning session, the problem turned out to be hibernate and how and when it was doing eager vs. lazy loading, and adding some judicious overrides of the default when selecting large numbers of items.
Lots to report on here. I have some results, and am still looking for good answers.
I've found a couple of tools that help:
VisualVM (with BTrace, or the built in Trace) claims to help with tracing, but I haven't been able to find any tool that shows timing on method calls.
YourKit is reputed to be useful; I've asked for an open source license.
The most useful thing I found is Hibernate's built in statistics. If you set
hibernate.generate_statistics true in your properties, you can send sessionFactory.getStatistics(), and see detailed statistics on what objects have been stored and retrieved and what affects the caches are having. I found one of the answers I wanted in the qeuryStatistics, which reports for each compiled query, the cache hits and misses, the number of times the query has run, how many rows were returned, and the average, max and min execution times. These timings made it abundantly clear where the time was going.
I then did some reading on caching. Razenha's suggestion was right on. [I'll mark his answer as right for now.] I added hibernate.cache.use_query_cache true to my properties, and added query.setCacheable(true); to most of my queries. I also added <cache usage="read-write"/> to a few of my .hbm.xml files. Now most of my statistics are showing a vast predominance of cache hits, and the performance is vastly better.
I'd still like some tools to help me trace execution timing so I can attack the worst problems rather than the most obvious, but this is a big help. Maybe one of the tracing tools above will turn out to help.
In Terracotta 3.1, you can monitor all of those statistics in real-time using the Terracotta Developer Console. You can see historical graphs for cache statistics, and see the hibernate statistics or cache statistics cluster-wide or on an per-node basis.
Terracotta is open source. More details and download is at Terracotta for Hibernate.

Categories