Choosing between Berkeley DB Core and Berkeley DB JE

Choosing between Berkeley DB Core and Berkeley DB JE - java

I'm designing a Java based web-app and I need a key-value store. Berkeley DB seems fitting enough for me, but there appears to be TWO Berkeley DBs to choose from: Berkeley DB Core which is implemented in C, and Berkeley DB Java Edition which is implemented in pure Java.
The question is, how to choose which one to use? With web-apps scalability and performance is quite important (who knows, maybe my idea will become the next Youtube), and I couldn't find easily any meaningful benchmarks between the two. I have yet to familiarize with Cores Java API, but I find it hard to believe that it could be much worse than Java Editions, which seems to be quite nice.
If some other key-value store would be much better, feel free to recommend that too. I'm storing smallish binary blobs, and keys probably will be hashes of the data, or some other unique id.

I have quite a bit of experience using both BDB-JE and BDB-core with Java. Deciding which one to use is quite simple: If you want concurrency, use BDB-JE. If you want scalability, use BDB-core.
BDB-JE breaks down performance-wise with large databases due to its file format and its reliance on Java garbage collection to clean up evicted cache entries. Expect long garbage collection pauses or spend a lot of time tuning magic GC settings. The file format has issues too, because the background cleaner threads have to spend a lot of time cleaning up garbage created by early cache evictions. If your database fits in RAM, BDB-JE works quite well.
BDB-core relies on a page-locking strategy, and highly concurrent applications experience a lot of deadlocks. If you can randomly order operations it reduces the deadlock potential, but it never eliminates it. Because BDB-core stores data in a more traditional way, it scales to super large sizes with predictable and expected performance degradation. Because its cache is not managed by a garbage collector, it can be quite large and not cause any pauses.

If you derive a common interface to these, and have a suitable set of unit tests, you should be able to swap between the two trivially at a later date (perhaps when you really need to make a decision based on hard facts that are not available right now)

I faced the same problem and decided to go with the Java edition, mainly because of its portability(I need something that would ran even on mobile devices). There are also the Direct Persistence Layer (DPL) API and the fact that the whole db is a single jar makes its deployment fairly simple.
The recent version 4 brought in High availability and performance improvements. There is also the fact that long running java applications can achieve such an optimization, that they would surpass native C applications performance in some scenarios.
It's a natural fit for any Java application - desktop or web.

I while ago I was having the same question, after doing some benchmarks I found that hash mode in the native edition is much faster and storage efficient than anything the java edition has to offer, so I decided to go with the native implementation.
I suggest you do your own benchmarks for the storage capacities you expect and decide if the Java edition is fast enough.
if it is, or if performance is not a big issue for you (it's critical for me), just go with the Java edition. otherwise go with the native one (assuming you see the same performance boost for your own use case).
btw:
my benchmark was test the speed of querying random keys out of 20,000,000 records, where the key is a string and the value is an int (4 bytes).
I saw that inserts (populating the benchmark) was much faster with the native version, and queries was twice as fast.
(This is not due to Java shortcoming but because the Java version is not of the same version as the native version - 4.0 vs 4.8 IIRC).

I decided to go with the Java Edition, simply because its possible to embed the database runtime within the same deployable. This was an important feature for my setup. I haven't benchmarked between core and JE, but I have seen great performance compared with other key-value stores that I tested when first evaluating database stores.
If you're creating a web-application though, then concurrency might be very important to you in the long run.

Related

How to improve performance of code you don't own?

We have in-house a 3rd party java application on a ridiculously hefty Linux box that runs a scheduling algorithm. The application runs far too slow for the load we need. We do not have the code and the vendor won't be making any changes to the application due to monetary reasons, thus I can't improve the code. The application is single threaded and its design does not lend itself to parallelization (so I can't split the load between 2 boxes).
What can I, either software or hardware wise, do to improve performance of the application?

Get on the newest version of Java (newer versions tend to have performance improvements)
Give Java more memory to work with (benchmark to see if this makes any difference)
Measure what it's doing with top. Upgrade whatever it's having problems with (more memory, faster CPU, SSD). Some CPUs are better at single threaded work loads than others (read: don't run this on a Bulldozer; something with Turbo Boost might be helpful).
Play with other experimental JVM options (benchmark to see if this makes any difference)
Remove any other applications running on this machine (benchmark to see if there's any benefit -- no sense wasting money if it doesn't help)
Pay the vendor to make it faster or give you the code (ie: give them monetary reasons to fix this)
Find an alternative
Write your own alternative

1) You can improve the hardware that the application runs on. Do this by looking at what resources the application is using. Is it maxing out CPU, or using all the memory (or both)? If so, you can add more CPU power or RAM accordingly.
2) Is there a way you can cache the results from the application? Can you ever avoid using it?
Otherwise, there really isn't much more you can do. If becomes a bigger problem, you might have to write your own scheduling algorithm, or better yet, find a better vendor.

Can you preprocess the input so the application has less work to do?
For example, perhaps the first thing the application does is sort the list of jobs to be scheduled using a merge sort. If you pre-sort the list, then the application's sort will have no work to do. You might be able to sort the list faster than the application can - use many cores, do it ahead of time, etc.

Run it on a faster computer. This is probably the cheapest solution of the lot.

high performance rtsp server

I want to implement a high performance rtsp server which is to handle vod request --- it only handles signaling request, it does not need to streaming the media file. I have accomplish a version that is written in Java basing on the Mina networking framework, and the performance seems to be not very high.
As far as I know, high performance SIP server(e.g. VoIP server) is written in C (e.g. OpenSIPS, Kamailo), should I use C or C++ for my project to get a significant performance improvement?
BTW. I found some explanation of the reason why OpenSER is written in C by its author:
"On the other hand, it is the garbage collector that can cause lots of troubles when developing SIP applications in Java. Aheavily loaded server written in Java stopsworking when the garbage collector is cleaning the memory. The delay caused by the garbage collector can be even more than 10 seconds. Such delays are unacceptable"
Is that a fact nowadays which mean that I should use C too?

There are a huge number of variables here, language may not be the determining factor. Trustin Lee, the author of MINA, later created Netty, which offers very high performance indeed. Lee himself says that MINA has "relatively poor performance" as a result of the complexity of some of the features it offers being too tightly bound to the core. So you might look at Netty before completely rewriting everything.
If you're using Oracle's JVM, you're using an extremely optimized runtime system that identifies hotspots in the code (hence the name "HotSpot") and aggressively optimizes them at runtime. It's been a long time since you could say, ipso facto, that Java code would run more slowly than C code. Well-written, optimized C code probably out-performs equivalent Java code in certain select tasks, but a generalization from there is probably no longer appropriate, and of course your code has to take on several of the burdens that the JVM shoulders for you with Java. Also note that there are several things you can do to tune the JVM's garbage collector, for instance to prefer consistency and short pauses over footprint and long pauses.
Obviously C has several strengths (being close to the machine is sometimes exactly what you want), as does explicit memory management for certain tasks.

Have you compared your rtsp server with Wowza?
Wowza is also written in Java, if your rtsp server has lower performance than Wowza, I believe you could improve its performance without changing language, otherwise, if Wowza has similar performance with your server, it indicates that Java cannot satisfy the performance requirements, maybe you should consider to use c/c++ instead.

I built my own RtspServer in C# and have no problem streaming to hundreds of clients.
http://net7mma.codeplex.com/
Code Project article # http://www.codeproject.com/Articles/507218/Managed-Media-Aggregation-using-Rtsp-and-Rtp
You are more then welcome to adopt / reference the design! (Apache 2 License)

Existing File based implementation of java.util.Map

I'm working on a project that uses custom Map<String, Entry> (where Entry is a pair of ints) implementation based on B-tree to store from 10 to 100 millions of records, the code for this class is slow and dirty. I need efficient implementation of the Map, which uses a file for storage and a small amount of memory.
I searched and found that Java Edition Of Berkeley DB has java.util.Collection API (including Map), but it seems superfluous to use a fully fledged database for this purpose (it uses directory with many files, has several additional threads for management). Is there a simpler solution?

I had this very same problem recently and looked at everything under the sun, including NoSQL and caches. You want a disk/file based/backed hashmap.
Berkeley DB Java Edition is by far the best. It's fast, scalable, and complete, but you can't distribute it to clients without distributing your source code or buying the commercial version from Oracle.
The only other choice, besides reinventing the wheel, is JDBM2. It also has a hashmap and a tree map. You are responsible for regularly flushing to disk to prevent OutOfMemoryError and it isn't near as fast as Berkeley DB but it is a very good 2nd choice.

Take a look at Kyoto Cabininet, a disk-backed DBM implementation. I've used the previous version, Tokyo Cabinet - it was dead easy to use, basically just like a native Map, and very fast.

JDBM is a lightweight, pure Java B-Tree implementation.

Comparison of NoSQL Databases for Java

I want to find out more about NoSQL databases/data-stores available for use from Java, and so far I tried out Project Voldemort. Except for awfully chosen name, it seems fine so far.
I'd like to find out more about other such database systems. Now, on wikipedia article there is a list of some of them, and there is some documentation on their project pages.
However, instead of comparing technical specs and tutorials provided by authors, what I would like to know is:
What are your experiences with working with these libraries on real projects? Which one would you recommend for use based on that experience, which one you wouldn't and why?
I know that only people to be able to answer this question are those who actually used more than one such database, but I hope that someone did do so.
EDIT:
By "real project" I primarily mean a project in production (but in absence of these anything larger than a homework or finished tutorial applies).
I worked with a relational database that had enormous amount of data in it, most of it concentrated in a single table, which was denormalized for performance anyway. But, because of the entire mess with constraints etc, creating a usable cluster had shown horrible results in both stability and performance.
Now, I'm quite sure that most likely any of these NoSQL systems would be a better choice then what I had at disposal. But, there has to be a difference between them, too. Whether it is in documentation, stability between versions, community, ease of use, whatever... And there are many giants. Which ones shoulders to choose? :D

We have been working with HBase for our projects. Our experience is -
The community is very dynamic and extremely helpful
The installation procedure for developers is quite easy in either pseudo distributed or standalone mode
We have been using it for integration test like unit tests
Installing a cluster is also easy but comparing some other NoSQL it has more components to install than others.
Administering - is still going on so not able to say much to say about it.
Do not use it for SQL like SELECT queries, for that we are using Apache Solr
To make development and testing easier we have come up with a simple object mapper - https://github.com/smart-it/smart-dao
The reason I chose is HBase, like other NoSQL, solves sharding, scaling by design making it easier in the long run and that seems to hold well.

Maybe the most prominent of Java NoSQL solutions is Cassandra. It has some features beyond Voldemort (Order-Preserving Partitioner which allows range queries; BigTable style structure for values); and is missing others (no alternate storage backends or version clocks for versioning).
Its performance is more optimal for fast writes, but its biggest strength is probably ease at which it can be horizontally scaled by adding new nodes (something where V is bit more static).
Compared to, say, MongoDB, its data model is quite simple and often there's no point in using much more than key/value abstraction (that is, handle data mapping on client side, store serialized objects).
It has full replication and distribution, unlike some k/v stores (couchdb, from what I understand).

It's pretty difficult to nail down a good choice without knowing exactly what your use case is. Much of it depends on what kind of data model are you comfortable with and fits your need. You have key-value stores, document-oriented, column-oriented, etc. Another huge factor is the products take on scaling and how they choose to deal with availability/consistency trade-offs.
I like MongoDB. I like how it supports queries and I like the document oriented data models. It fits many problems that I seem to run into. There is a Great (with capital G) community as seen at the recent MongoSV event.
Your best bet it to pick 3 different products and evaluate them. I would also see if you can find some companies who have presented at conferences and tell their stories of how they were successful. Videos from MongoSV will be available soon.

Alternative to Java

I need an alternative to Java, because I am working on a genetics-calculation project.
It takes a lot of memory and the most of the cpu time. And therefore it won´t work when I deploy it on a server, because many people use the program at the same time.
Does anybody know another language that is not running in a virtual machine and is similar to Java (object-oriented, using exceptions and type-safety)?
Best regards,
Jonathan

To answer the direct question: there are dozens of languages that fit your explicit requirements. AmmoQ listed a few; Wikipedia has many more.
And I think that you'll be disappointed with every one of them.
Despite what Java haters want you to think, Java's performance is not much different than any other compiled language. Just changing languages won't improve performance much.
You'll probably do better by getting a profiler, and looking at the algorithms that you used.
Good luck!

If your apps is consuming most of the CPU and memory on a single-user workstation, I'm skeptical that translating it into some non-VM language is going to help much. With Java, you're depending on the VM for things like memory management; you're going to have to re-implement their equivalents in your non-VM language. Also, Java's memory management is pretty good. Your application probably isn't real-time sensitive, so having it pause once in a while isn't a problem. Besides, you're going to be running this on a multi-user system anyway, right?
Memory usage will have more to do with your underlying data structures and algorithms rather than something magical about the language. Unless you've got a really great memory allocator library for your chosen language, you may find you uses just as much memory (if not more) due to bugs in your program.
Since your app is compute-intensive, some other language is unlikely to make it less so, unless you insert some strategic sleep() calls throughout the code to deliberately make it yield the CPU more often. This will slow it down, but will be nicer to the other users.
Try running your app with Java's -server option. That will engage a VM designed for long-running programs and includes a JIT that will compile your Java into native code. It may make your program run a bit faster, but it will still be CPU and memory bound.

If you don't like C++, you might consider D, ObjectiveC or the new Go language from google.

You may try C++, it satisfies all your requirements.

Use Python along with numpy, scipy and matplotlib packages. numpy is a Python package which has all the number crunching code implemented in C. Hence runtime performance (bcoz of Python Virtual Machine) won't be an issue.
If you want compiled, statically typed language only, have a look at Haskell.

Can your algorithms be parallelised?
No matter what language you use you may come up against limitations at some point if you use a single process. Using something like Hadoop will mean you can retain Java and ease of use but you can run in parallel across many machines.

On the same theme as #Barry Brown's answer:
If your application is compute / memory intensive in Java, it will probably be compute / memory intensive in C++ or any other "more efficient" language. You might get some extra leeway ... but you'll soon run into the same performance wall.
IMO, you need to do the following things:
You need profile your application, and look for any major performance bottlenecks. You might find some real surprises.
In the light of the previous step, review the design and algorithms, paying attention to space and time complexity issues. Do some research to see if someone has discovered better algorithms for doing the computations that are problematic from a performance perspective.
If the previous steps don't get you ahead of the curve, see if you can upgrade your platform; get a bigger machine with more processors, more memory, etc.
If you are still stuck, your only other option is a scale-out design. Assuming that individual user requests are processed in a single-threaded, re-architect your system so that you can run "workers" across multiple servers, with a load balancer on the front. If you have a persistent back-end, look into how you can replicate that. And so on.
Figure out if the key algorithms can be parallelized / distributed so that the resource intensive parts of a user request execute in parallel on multiple processors / multiple servers; e.g. using a "map-reduce" framework.
OK, so there is no easy answer. But simply changing programming languages is NOT a good answer.

Regardless of language your program will need to share with others when running in multiple instances on a single machine. That is simply the way computers work.
The best way to allow your current program to scale to use the available hardware resources is to chop your amount of work into small, independent pieces, and make them implement the Callable interface. These can then be executed by a suitable Executor which can then be chosen according to the available hardware. See the Executors class for many preconfigured versions. THis is what I would recommend you to do here.
If you want to switch language then Mac OS X 10.6 allows for programming in the way described above with C and ObjectiveC and if you do it properly OS X can distribute the code over all available computing resources (both CPU and GPU and what have we).
If none of the above is interesting to you, then consider one of the Grid frameworks. Terracotta may be a good place to start.

F# or ruby, or python, they are very good for calculations, and many other things
NASA uses python

Well.. I think you are looking for C#.
C# is Object Oriented and has excellent support for Generics. You can use it do write both WinForm and server-side applications.
You can read more about C# generics here: http://msdn.microsoft.com/en-us/library/ms379564(VS.80).aspx
Edit:
My mistake, geneTIcs, not geneRIcs. It does not change the fact C# will do the job, and using generics will reduce load significantly.

You might find the computer language shootout here interesting.
For example, here's Java vs C++.
You might find Ocaml (from which F# is derived) worth a look; it meets your requirements for OO, exceptions, static types and it has a native compiler, however according to the shootout you may be trading less memory for lower speed.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.