I want to find out more about NoSQL databases/data-stores available for use from Java, and so far I tried out Project Voldemort. Except for awfully chosen name, it seems fine so far.
I'd like to find out more about other such database systems. Now, on wikipedia article there is a list of some of them, and there is some documentation on their project pages.
However, instead of comparing technical specs and tutorials provided by authors, what I would like to know is:
What are your experiences with working with these libraries on real projects? Which one would you recommend for use based on that experience, which one you wouldn't and why?
I know that only people to be able to answer this question are those who actually used more than one such database, but I hope that someone did do so.
EDIT:
By "real project" I primarily mean a project in production (but in absence of these anything larger than a homework or finished tutorial applies).
I worked with a relational database that had enormous amount of data in it, most of it concentrated in a single table, which was denormalized for performance anyway. But, because of the entire mess with constraints etc, creating a usable cluster had shown horrible results in both stability and performance.
Now, I'm quite sure that most likely any of these NoSQL systems would be a better choice then what I had at disposal. But, there has to be a difference between them, too. Whether it is in documentation, stability between versions, community, ease of use, whatever... And there are many giants. Which ones shoulders to choose? :D
We have been working with HBase for our projects. Our experience is -
The community is very dynamic and extremely helpful
The installation procedure for developers is quite easy in either pseudo distributed or standalone mode
We have been using it for integration test like unit tests
Installing a cluster is also easy but comparing some other NoSQL it has more components to install than others.
Administering - is still going on so not able to say much to say about it.
Do not use it for SQL like SELECT queries, for that we are using Apache Solr
To make development and testing easier we have come up with a simple object mapper - https://github.com/smart-it/smart-dao
The reason I chose is HBase, like other NoSQL, solves sharding, scaling by design making it easier in the long run and that seems to hold well.
Maybe the most prominent of Java NoSQL solutions is Cassandra. It has some features beyond Voldemort (Order-Preserving Partitioner which allows range queries; BigTable style structure for values); and is missing others (no alternate storage backends or version clocks for versioning).
Its performance is more optimal for fast writes, but its biggest strength is probably ease at which it can be horizontally scaled by adding new nodes (something where V is bit more static).
Compared to, say, MongoDB, its data model is quite simple and often there's no point in using much more than key/value abstraction (that is, handle data mapping on client side, store serialized objects).
It has full replication and distribution, unlike some k/v stores (couchdb, from what I understand).
It's pretty difficult to nail down a good choice without knowing exactly what your use case is. Much of it depends on what kind of data model are you comfortable with and fits your need. You have key-value stores, document-oriented, column-oriented, etc. Another huge factor is the products take on scaling and how they choose to deal with availability/consistency trade-offs.
I like MongoDB. I like how it supports queries and I like the document oriented data models. It fits many problems that I seem to run into. There is a Great (with capital G) community as seen at the recent MongoSV event.
Your best bet it to pick 3 different products and evaluate them. I would also see if you can find some companies who have presented at conferences and tell their stories of how they were successful. Videos from MongoSV will be available soon.
Related
I would like to build a distributed NoSQL database or key-value store using golang, to learn golang and practice distribute system knowledge I've learnt from school. The target use case I can think of is running MapReduce on top of it, and implement a HDFS-compatible "filesystem" to expose the data to Hadoop, similar to running Hadoop on Ceph and Amazon S3.
My question is, what difficulties should I expect to integrate such an NoSQl database with Hadoop? Or integrate with other languages (e.g., providing Ruby/Python/Node.js/C++ APIs?) if I use golang to build the system.
Ok, I'm not much of a Hadoop user so I'll give you some more general lessons learned about the issues you'll face:
Protocol. If you're going with REST Go will be fine, but expect to find some gotchas in the default HTTP library's defaults (not expiring idle keepalive connections, not necessarily knowing when a reader has closed a stream). But if you want something more compact, know that: a. the Thrift implementation for Go, last I checked, was lacking and relatively slow. b. Go has great support for RPC but it might not play well with other languages. So you might want to check out protobuf, or work on top the redis protocol or something like that.
GC. Go's GC is very simplistic (STW, not generational, etc). If you plan on heavy memory caching in the orders of multiple Gs, expect GC pauses all over the place. There are techniques to reduce GC pressure but the straight forward Go idioms aren't usually optimized for that.
mmap'ing in Go is not straightforward, so it will be a bit of a struggle if you want to leverage that.
Besides slices, lists and maps, you won't have a lot of built in data structures to work with, like a Set type. There are tons of good implementations of them out there, but you'll have to do some digging up.
Take the time to learn concurrency patterns and interface patterns in Go. It's a bit different than other languages, and as a rule of thumb, if you find yourself struggling with a pattern from other languages, you're probably doing it wrong. A good talk about Go concurrency is this one IMHO http://www.youtube.com/watch?v=QDDwwePbDtw
A few projects you might want to have a look at:
Groupcache - a distributed key/value cache written in Go by Brad Fitzpatrick for Google's own use. It's a great implementation of a simple yet super robust distributed system in Go. https://github.com/golang/groupcache and check out Brad's presentation about it: http://talks.golang.org/2013/oscon-dl.slide
InfluxDB which includes a Go based version of the great Raft algorithm: https://github.com/influxdb/influxdb
My own humble (pretty dead) project, a redis compliant database that's based on a plugin architecture. My Go has improved since, but it has some nice parts, and it includes a pretty fast server for the redis protocol. https://bitbucket.org/dvirsky/boilerdb
I am trying to reconstruct a flow of information from multiple parts handled by different Java processes. Please note that i don't generate the flows, i just read some information about them.
I've tried using MySQL (MyISAM/InnoDB tables) with INSERT ON DUPLICATE KEY UPDATE using an id for each flow. I've also tried storing all the pieces of information and running a query at the end to get the full information. Neither of these approaches yielded the performance needed.
I'm looking for a solution that will allow me to have a set of shared objects between multiple Java processes. The objects should be persistent between runs and fast to lookup/update concurrently (>100k lookups/updates per second).
I've thought of a few solutions including:
NoSQL: something like MongoDB, HBase etc.
a caching solution like EhCache, Memcached etc.
The problem is i don't have any experience with any of these solutions. So, what would you recommend that fits the following criteria:
very fast on a single system. Most of the applications i mentioned were built for distributed systems, but it's not the case here.
easy to learn/use (i want to be able to prototype it in a day)
mature technology
free to use even for commercial purposes
preferably open-source
You could try a seperate java process that co-ordinates between the others. This process would hold the information to pass over to the main processes. You could wire them up with RMI.
If you want to do only exchange of objects withing java applications, you could also looki into tuple spaces. There are specific implementations of spaces for java, JavaSpaces, which should be able to do what you need. Not sure if they can keep up with the performance though. Also I’m not sure how widely this technology is still being used, since it only supports Java and isn’t as flexible as NoSQL stores would be these days.
Wikipedia has a more detailed description and list of different implementations, many of which are open source.
The other option is to go with Redis, you have notifications there and it can for sure scale to the requirements you are looking for.
The old (legacy?) solution is JavaSpaces. However, from an software architects point of view I would say distributed caches are the replacements for that nowadays. Especially take a look at hazelcast and infinispan.
From the performance viewpoint I am not happy with the performance of the "big" distributed caching solutions, when only a single in-memory cache is needed, see my writeup on the cache2k benchmarks page (hazelcast needs to be added here).
Anyways, please clarify your problem statement first, because your question falls into the XyProblem category. You are not describing the actual problem, and your question just boils down to "fast reliable distributed objects" solution. What kind of data comes in? What is the rate? Who is it accessed? What consistency guarantees need to be met, considering the fact that writing and reading is in parallel?
By the term "flow of information" it sounds more like a complex event processing problem to me.
Shelve is an ultra-simple No-SQL persistence layer which allows you to trivially persist a mapping of objects. It's a commonly used package in Python because it allows you to trivially add persistence to any application.
It's simplistic nature means it's somewhat limited - but it's surprisingly useful. You can map any arbitrary hashable key onto any serializable object.
Does something like this exist for Android? I'm writing a very simple app, and I've noticed that I'm spending a lot of time faffing around with table structures, select & insert statements. That's the sort of thing I almost never do in Python since I'd usually have some kind of NoSQL alternative.
I'm not expecting to to work exactly the same way - clearly Python and Java are languages with very different characteristics. I just want something that nearly as simple to use and requires less manual SQL faffing.
One more thing - this is a fairly trivial app. I'd prefer to introduce the bare minimum of additional project dependencies. Preference will be given to solutions which require nothing more than the Android APIs.
You said preference to Java API answers so I probably won't get preference, but Couchbase Mobile is the best Android No-SQL I have ever come across.
http://www.couchbase.org/get/couchbase-mobile-for-android/current
You can use SharedPrefences from DataStore. It's pretty much what you need. You don't need full SQLite power for this.
I'm looking for the best database software for a new open source application. The primary criteria is it has to be lightning fast for searching among tens of thousands of entries. Ideally it would be entirely Java based but simply having a Java API is OK. I'm looking to license under GPL so the project would have to be compatible with that. So far SQLite seems to be the most ubiquitous solution but I don't want to overlook something else if it could turn out to be better.
When I search the general internet, most results seems to be for object databases. I don't care if the database is object-based or relational, and I don't think I care if it's "NoSQL" . I have lots of experience with MySQL but I'm not terribly afraid of learning a new query language or interface if it's faster that way. The main kind of data this will be managing is filenames with at least 20 metadata fields attached; I'd want to have multiple datasets with the same fields, and it would be nice to also store some application preferences in the database.
I see from some responses that there may be confusion about my (former) use of "embedded" in the title. I want to clarify that I mean "embedded in the application and redistributed" and not "in use on an embedded device." The application is currently targeting full scale computers, although one reason for "ideally it would be entirely java based" is a dreamy aspiration of creating an Android version.
Ultimately it really depends on your application. SQLite is not designed to be as robust as standard client\server databases like Oracle and MySQL. From the FAQ for SQLite they say the following on the subject:
However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.
That being said SQLite is very fast but then again this depends on how you'll be using it and on what platforms. If you are running on an embedded device you may see significant performance differences than when running on a regular desktop\server which is why its hard to give a exact answer. SQlite does see significant performance gains from not abiding to the standard client\server model.
Your best bet is to pick a few, like SQLite, PostgreSQL, MySQL, and see the performance implications of each by running some tests which simulate common scenarios you will encounter in you application.
Take a look at http://www.polepos.org/ there is a benchmark which clains thathttp://www.db4o.com/
is one of the fastest embedded dbs.
I personally worked with db4o and its very nice and its licensed under GPL so it should possibly fit your needs
I'm currently trying out db4o (the java version) and I pretty much like what I see. But I cannot help wondering how it does perform in a real live (web-)environment. Does anyone have any experiences (good or bad) to share about running db4o?
We run DB40 .NET version in a large client/server project.
Our experiences is that you can potentially get much better performance than typical relational databases.
However, you really have to tweak your objects to get this kind of performance. For example, if you've got a list containing a lot of objects, DB4O activation of these lists is slow. There are a number of ways to get around this problem, for example, by inverting the relationship.
Another pain is activation. When you retrieve or delete an object from DB4O, by default it will activate the whole object tree. For example, loading a Foo will load Foo.Bar.Baz.Bat, etc until there's nothing left to load. While this is nice from a programming standpoint, performance will slow down the more nesting in your objects. To improve performance, you can tell DB4O how many levels deep to activate. This is time-consuming to do if you've got a lot of objects.
Another area of pain was text searching. DB4O's text searching is far, far slower than SQL full text indexing. (They'll tell you this outright on their site.) The good news is, it's easy to setup a text searching engine on top of DB4O. On our project, we've hooked up Lucene.NET to index the text fields we want.
Some APIs don't seem to work, such as the GetField APIs useful in applying database upgrades. (For example, you've renamed a property and you want to upgrade your existing objects in the database, you need to use these "reflection" APIs to find objects in the database. Other APIs, such as the [Index] attribute don't work in the stable 6.4 version, and you must instead specify indexes using the Configure().Index("someField"), which is not strongly typed.
We've witnessed performance degrade the larger your database. We have a 1GB database right now and things are still fast, but not nearly as fast as when we started with a tiny database.
We've found another issue where Db4O.GetByID will close the database if the ID doesn't exist anymore in the database.
We've found the Native Query syntax (the most natural, language-integrated syntax for queries) is far, far slower than the less-friendly SODA queries. So instead of typing:
// C# syntax for "Find all MyFoos with Bar == 23".
// (Note the Java syntax is more verbose using the Predicate class.)
IList<MyFoo> results = db4o.Query<MyFoo>(input => input.Bar == 23);
Instead of that nice query code, you have to an ugly SODA query which is string-based and not strongly-typed.
For .NET folks, they've recently introduced a LINQ-to-DB4O provider, which provides for the best syntax yet. However, it's yet to be seen whether performance will be up-to-par with the ugly SODA queries.
DB4O support has been decent: we've talked to them on the phone a number of times and have received helpful info. Their user forums are next to worthless, however, almost all questions go unanswered. Their JIRA bug tracker receives a lot of attention, so if you've got a nagging bug, file it on JIRA on it often will get fixed. (We've had 2 bugs that have been fixed, and another one that got patched in a half-assed way.)
If all this hasn't scared you off, let me say that we're very happy with DB4O, despite the problems we've encountered. The performance we've got has blown away some O/RM frameworks we tried. I recommend it.
update July 2015 Keep in mind, this answer was written back in 2008. While I appreciate the upvotes, the world has changed since then, and this information may not be as reliable as it was when it was written.
Most native queries can and are efficiently converted into SODA queries behind the scenes so that should not make a difference. Using NQ is of course preferred as you remain in the realms of strong typed language. If you have problems getting NQ to use indexes please feel free to post your problem to the db4o forums and we'll try to help you out.
Goran
Main problem I've encountered with it is reporting. There just doesn't seem to be any way to run efficient reports against a db4o data source.
Judah, it sounds like you are not using transparent activation, which is a feature of the latest production version (7.4)? Perhaps if you specified the version you are using as there may be other issues which are now resolved in the latest version?