key-value store suggestion - java

I need a very basic key-value store for java. I started with a HashMap but it seems that HashMap is somewhat space inefficient (I'm storing ~20 million records, and seems to require ~6GB RAM).
The map is Map<Integer,String>, and so I'm considering using GNU Trove TIntObjectHashMap<byte[]>, and storing the map value as an ascii byte array rather than String.
As an alternative to that, is there a key-value store that only requires adding jar files, does not hold the entire map in RAM at once, and is still reasonably fast?

BabuDB
BabuDB is an embedded non-relational database system. Its lean and simple design allows it to persistently store large amounts of key-value pairs without the overhead and complexity of similar approaches such as BerkeleyDB.
License: New BSD license, Language: Java
JDBM2
JDBM2 provides HashMap and TreeMap which are backed by disk storage.
License: Apache License 2.0, Language: Java
Banana DB
Banana DB is a self-contained key/value pair database implemented in Java.
License: Apache License 2.0, Language: Java
I've tried BabuDB and JDBM2 and they work fine. BabuDB is a little bit more difficult to set up, but potentially delivers higher performance than JDBM2.
These all all databases, which allow to persist data on disk. There are also solutions to hold a large map in memory (ehcache, hazelcast, ...).

Use Berkeley DB.
Berkeley DB stores object graphs, objects in collections, or simple binary key/value data directly in an a btree on disk. This simple, highly efficient approach removes all the unnecessary overhead in ORM solutions. Using the Direct Persistence Layer (DPL) Java developers annotate classes with storage information, much like JPA. This approach is familiar, efficient, and fast. The DPL reduces the complexity of data storage while not sacrificing speed.
This should definitely give you huge gains in memory and speed, while not increasing the complexity of your application. Enjoy!

http://www.mapdb.org/ is what you are looking for. It's a rocking fast persistent implementation of java.util.Map.
Features
Concurrent
MapDB has record level locking and state-of-art concurrent engine. Its performance scales nearly linearly with number of cores. Data can be written by multiple parallel threads.
Fast
MapDB has outstanding performance rivaled only by native DBs. It is result of more than a decade of optimizations and rewrites.
ACID
MapDB optionally supports ACID transactions with full MVCC isolation. MapDB uses write-ahead-log or append-only store for great write durability.
Flexible
MapDB can be used everywhere from in-memory cache to multi-terabyte database. It also has number of options to trade durability for write performance. This makes it very easy to configure MapDB to exactly fit your needs.
Hackable
MapDB is component based, most features (instance cache, async writes, compression) are just class wrappers. It is very easy to introduce new functionality or component into MapDB.
SQL Like
MapDB was developed as faster alternative to SQL engine. It has number of features which makes transition from relational database easier: secondary indexes/collections, autoincremental sequential ID, joins, triggers, composite keys…
Low disk-space usage
MapDB has number of features (serialization, delta key packing…) to minimize disk used by its store. It also has very fast compression and custom serializers. We take disk-usage seriously and do not waste single byte.

Consider Koloboke Collections, which is up to 2 times faster than Trove according to various tests:
Time - memory tradeoff with the example of Java Maps
Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove
if configured to consume the same memory as Trove. Or alternatively, you can think it consumes considerably lesser memory if configured to be equally fast to Trove.
If you want to persist the map between JVM runs with very quick bootstrap, you might also be interested in Chronicle-Map which stores Strings in UTF-8 by default (so you shouldn't bother with conversions String <-> byte[] as with Koloboke/Trove). Chronicle-Map is ultra fast for persisted key-value store, but a bit slower that Koloboke and even Trove.

Just wanted to reference some more open source options that became available over time since this question was first asked.
Apache 2, BTree, Apache Directory Project JDBM replacement effort:
http://directory.apache.org/mavibot/
MPL2/EPL1, RTree, MVStore, H2 Storage Engine:
http://www.h2database.com/html/mvstore.html
Apache 2, Xodus Environments, JetBrains YouTrack and Hub storage engine:
https://github.com/JetBrains/xodus

The map is Map, and so I'm considering using GNU Trove TIntObjectHashMap, and storing the map value as an ascii byte array rather than String.
This doesn't entirely make sense because a TIntObjectHashMap is not a Map. However, the approach is sound.
Do you know what kind of space savings I can expect over HashMap for Trove?
The best answer is to try it out.
But here are some rough estimates (assuming a 32bit JVM):
HashMap keys would need to be Integer instances. They will occupy ~18bytes per instance + 4 bytes per reference. Total 24 bytes.
Trove keys would be 4 byte int values.
String values would be 20 bytes + 12 bytes + 2 * number of "characters".
Byte array values would be 12 bytes + 1 * number of "characters".
I haven't examined the details of the respective hash table internal data structures.
That probably amounts to around 50% memory saving, though it depends critically on the average length of the value "strings". (The longer they are, the more they will dominate the space usage.)
FWIW, Trove publish their own benchmarks here. They don't look very convincing, but you should be able to dig out their benchmark code and modify it to better match your use-case.

You can use Xodus KV.
That is a key-value store used in production by JetBrains in the YouTrack product.
It provides snapshot isolation with readers not competing with writers.
JetBrains actively supports it.
Xodus also has an entity store solution which, along with ORM implemented in Kotlin, can be used as a primary database in your project.
There are plans to implement SQL language, which will allow using Xodus as the primary database for projects written in other languages.

Related

Pure Java alternative to database / cache for storing records

I have created an application sold to customers, some of which are hardware manufacturers with fixed constraints (slow CPU). The app has to be in java, so that it can be easily installed as a single package.
The application is multithreaded and maintains audio records. In this particular case all we have is INSERT SOMEDATA FOR RECORD, each record representing an audio file (and this can be done by different threads), and then later on we have SELECT SOMEDATA WHERE IDS in (x, y, z) by an single thread, then 3rd step is we actually DELETE all the data in this table.
The main constraint is cpu, slow single cpu. Memory is also a constraint, but only in that the application is designed so it can process an unlimited number of files, and so even if had lots of memory would eventually run out if all stored in memory rather than utilizing the disk.
In my Java application I started off using the H2 database to store all my data. But the software has to run on some slow single cpu servers so I want to reduce the cpu cycles used, and one area I want to look again is the database.
In many cases I am inserting data into database simply for the purposes of keeping the data off the heap otherwise would run out of memory, then later on we retrieve the data, we never have to UPDATE the data.
So I considered using a cache like ehCache but that has two problems:
It doesn't guarantee the data will not be thrown away (If the cache gets full)
I can only retrieve records one at a time, whereas with relational database I can retrieve a batch of records, this looks like a potential bottleneck.
What is an alternative that solves these issues ?
You want to retrieve records in batch fast, not loose any data, but you don't need optimized queries nor updates and you want to use CPU and memory resources as effectively as possible:
Why don't you simply store your records in a file? The operating system uses any free memory for caching. So when you access your file frequently, the OS will do its best to keep as much content as possible in memory. The OS does this job anyway, so this type of caching costs you no additional CPU and no single line of code.
The only scenarios where it could make sense to invest more in optimization would be:
a) Your process or other processes make heavy use of the file system and
pollute file cache
b) Serialization / deserialization is too expensive
In case of a):
Define your priorities. An explicit cache (in heap or off-heap) can help you to keep some content of selected files in memory. But this memory will not be avalaible anymore for the OS's file cache. So while you speed up one file access you potentially slow down access to other files.
In case of b):
Measure performance first, before you optimize anything. Usually disk access is the bottleneck - that's something you cannot change without replacing hardware. If you still want to optimize (e.g. because GC eats up CPU due to a very high number of temporarily created objects - i guess with only one core serial GC will be in use) then I suggest to have a closer look on Google flatbuffers.
You started with the most complex solution for your problem, a database. I suggest to start at the other end of the spectrum and keep it as simple as possible
UPDATE:
The question has been edited in the meanwhile and requirements have changed. A new requirement is now that it has to be possible to read selected records by IDs.
Possible extensions:
Store each record in an own file and use the key as file name
Store all records in one file and use a file-based HashMap implementation
like MapDB's HTreeMap implementation.
Independent from the chosen extension, the operating system's file cache will do its best to hold as much content as possible in main memory.
Some of ideas that can help
You say that you're running on a single CPU and want to check a substitution to H2. So, H2 "consumes" a lot of CPU power and the application is claimed to be "slow". But what if its because of slow Disk not a CPU, after all, Databases store their stuff on disks and the disks can be slow. If you want to check this theory - map the disk to some ram backed drive (in linux it's an easy task) and measure again with the same CPU.
If you come to the conclusion that indeed H2 is CPU intensive for use cases, maybe it worth to invest some time to optimize queries, this is much cheaper than substituting the database.
Now, if you can't stay with H2, consider Lucene which is really optimized for this "append-only" use-case (I understand that you have "append-only" flow because you said "later on we retrieve the data, we never have to UPDATE the data). Having said that Lucene also should have its own threads that handle indexing, so some CPU overhead is expected anyway. However, the chances are that Lucene will be faster for this use case. The price is that you won't get "easy" queries, because lucene doesn't implement relational model (well, maybe partially because of that it should be faster), in particular you won't have JOINs, and transaction management. Its possible to query by conditions from a single table like in RDMBS, you don't have to get "top hits" as you describe.
From your question and the comments made on Mark Bramniks answer I understood this:
CPU constraint: very slow cpu, solution should not be cpu intensive
Memory constraint: Not all data can be in memory
Disk constraint: very slow disk, solution should not read/write lots of data from disk
These are very strict constraints. Usually you "trade" cpu vs memory or memory vs disk. In your case these are all constraint. You mentioned you looked at ehCache, however I think this solution (and possibly others such as memcached) are not more lightweight than H2.
One solution you could try is MappedByteBuffer. This class makes it possible to have parts of a file in memory and will swap those parts when needed. But this comes at a cost, it is not an easy beast to tame. You will need to write your own algorithm to locate the data you need. Please consider how much time it will take you to get it working vs the additional cost of a bigger machine. Sometimes better hardware is the solution.
Relational databases like Oracle are decades old (41 years), can you imagine how many CPU cycles were available back then? Based on research from 1970 and well understood by professionals, tested, documented, reliable, consistent (checksums), maintainable (backups with zero data loss), performant if used correctly (all kinds of indexes), accessible securely over the network, scalable, etc but apparently Not Invented Here.
Nowadays there are even many free Open Source databases like PostgreSQL that have very modest requirements and the potential to easily implement new requirements in the future (which is hard to predict) and with some effort interchangeable with other databases (JDBC, JPA)
But yes, there is some overhead but typically hardware is cheaper than changing your architecture late in the project and CPU cycles are not an expensive resource anymore (think raspberry pi, smartphones, etc)

Storing in Hashtables

I an working on an application that might potentially get thousands and thousands of messages (perhaps millions). And I want to store these messages in a hashtable for easy lookup since each message has an id. Is this a good idea? If not, what's the best data structure or way to go about this. Thank you.
Is this a good idea?
Keeping an unbounded amount of data in an in-memory data structure is a bad idea. You will eventually run out of memory, and your application will crash.
If you are able to discard old "messages" so that you can place a reasonable bound on the amount of memory the application needs, then this could be a viable solution.
However, as the comments point out there are other solutions (distibuted memory caches, SQL databases, NoSQL databases, etcetera) that could well be better, depending on how much data there is and how fast access really needs to be.
Using Map (data will be stored main memory) is simple, but should be the least preferable and non realistic option, as you need to and implement/reinvent the logic for the data expiration, clustering, etc.. by yourself.
Using Caching frameworks (data will be stored main memory), this can be chosen only if you have an idea about how much size of data and how long the data needs to be resided in the cache (i.e., when the data can expired and removed), this option limits the data size to the max size of the JVM Heap space.
Using Database (data will be stored in disc space), this is the ideal option for holding millions of data, but comes with a cost as disc operations takes more time compared to the in memory operations.

H2 performance recommendations

I'm currently working with a somewhat larger database, and though I have no specific issues, I would like some recommendations, if anyone has any.
The database is 2.2 gigabyte (after recreation/compacting). It contains about 50 tables. One of those tables contains a blob plus some metadata. It currently has about 22000 rows. If I remove the blobs from the table (UPDATE table SET blob = null), the database size is reduced to about 200 megabyte (after recreation/compacting). The metadata is accessed a lot, the blobs however are not that often needed.
The database URL I currently use is:
jdbc:h2:D:/data;AUTO_SERVER=true;MVCC=true;CACHE_SIZE=524288
It runs in our Java VM which has 4GB max heap.
Some things I was wondering:
Would running H2 in a separate process have any impact on performance (for better or for worse)?
Would it help to have the blobs in a separate table with a 1-1 relation to the metadata? I could imagine it would help with the caching, not having the blobs in the way?
The internet seems divided on whether to include blobs in a database or write them to files on a filesystem with a link in the DB. Any H2-specific advise here?
The answer for you depends on the growth rate of your blob data. If for example, your data set is going to grow at 10% per week - then there is little point of trying to extend the use of H2 to store blob data (as it will quickly out pace the available heap memory). If instead the blob data is the biggest it will ever be, then attempting to use H2 might make sense.
To answer your questions about H2:
1) Running H2 in a separate process will allow H2 claim the majority of heap space - making controlling the available heap space for H2 much more manageable. However, you'll also be adding the maintenance overhead of having a separate process to maintain and monitor. So the answer is "it depends on your operating environment and goals". If you have the people and time, running H2 in a separate process might make sense. But if that's true - then you should probably consider just running an appropriate blob storage platform instead.
2) Yes, you're correct that storing the blobs in a separate table would help with caching - in the case that you don't often need the blobs. It should also help with retrieval times, as H2 won't have to read past the blobs to find the metadata.
3) Note that "the internet" represents many thousands of people with almost as many different specific use cases. You'll need to filter down your use case into requirements, and then apply the logic you glean from others.
4) My personal advice is, if you're trying to make a scalable and maintainable platform - use the right tools. H2, or any other relational database, is most often not the right tool for storing many large blobs. I'd recommend that you investigate using a key/value store.

Keeping data in database or in session

I'm in the early stages of doing a web project which will require working with arrays containing around 500 elements of custom object type. Objects will likely contain between 10 and 40 fields (based on user input), mostly booleans, strings and floats. I'm gonna use PHP for this project, but I'm also interested to know how to treat this problem in Java.
I know that "premature optimization is the root of all evil", but I think I need to decide now, how do I handle those arrays. Do I keep them in the Session object or do I store them in the database (mySQL) and keep just a minimum amount of keys in the session. Keeping data in the session would make application work faster, but when visitor numbers start growing I risk using up too much memory. On the other hand reading and writing from and into database all the time will degrade performance.
I'd like to know where the line is between those two approaches. How do I decide when it's too much data to keep inside session?
When I face a problem like this I try to estimate the size of per user data that I want to keep fast.
If your case, suppose for example to have 500 elements with 40 fields each of which sizing 50 bytes (making an average among texts, numbers, dates, etc.). So we have to keep in memory about 1MB per user for this storage, so you will have about 1GB every 1000 users only for this cache.
Depending on your server resource availability you can find bottlenecks: 1000 users consume CPU, memory, DB, disks accesses; so are in this scenario 1GB the problem? If yes keep them in DB if no keep them in memory.
Another option is to use an in-memory DB or a distributed cache solution that does it all for you, at some cost:
architectural complexity
eventually licence costs
I would be surprised if you had that amount of unique data for each user. Ideally, some of this data would be shared across users, and you could have some kind of application-level cache that stores the most recently used entries, and transparently fetches them from the database if they're missing.
This kind of design is relatively straightforward to implement in Java, but somewhat more involved (and possibly less efficient) with PHP since it doesn't have built-in support for application state.

When is BIG, big enough for a database?

I'm developing a Java application that has performance at its core.
I have a list of some 40,000 "final" objects,
i.e., I have an initialization input data of 40,000 vectors.
This data is unchanged throughout the program's run.
I am always preforming lookups against a single ID property to retrieve the proper vectors.
Currently I am using a HashMap over a sub-sample of a 1,000 vectors,
but
I'm not sure it will scale to production.
When is BIG, actually big enough for a use of DB?
One more thing, an SQLite DB is a viable option as no concurrency is involved,
so I guess the "threshold" for db use, is perhaps lower.
I think you're asking whether a HashMap with 40,000 entries in will be okay. The answer is yes - unless you really don't have enough memory, that should be absolutely fine. If you're writing a performance-sensitive app, then putting a large amount of fast memory in the machine running the app is likely to be an efficient way of boosting performance anyway.
There won't be very much overhead for each HashMap entry, so if you've got enough space to store the objects themselves in memory, it's unlikely that the overhead of the map would cause a problem.
Is there any reason why you can't just test this with a reasonable amount of data?
If you really have no more requirements than:
Read data at start-up
Put data in a map by a single ID (no need for joins, queries against different fields, substring matches etc)
Fetch data from map
... then using a full-blown database would be a huge amount of overkill, IMO.
As long as you're loading the data set in a memory at the beginning of the program and keeping it in memory and you don't have any complex queries, some sort of serialization/deserialization seems to be more feasible than a full blown database.
You could start a DB with as little as 100 (or less). There is no general rule of when the amount of data is large enough to store in a database. It's more if you believe you should better store this data in a database, if this will give you any profit (performance boost, easier programming, more flexible options for your users).
When the benefits are greater than the cost of implementation put it in a database.
There is no set size for a Collection vs a Database. It high depends on what you want to do with the data. Size is less important.
You can have a Map with a billion entries.
There's no such thing as 'big enough for a database'. The question is whether there are enough advantages in using a database to overcome the costs.
Having said that, 40,000 isn't 'big' ;-) Unless the objects are huge or you have complex query requirements I would start with an in-memory implementation. But if you expect to scale this number up over time it might be better to use the database from the beginning.
One option that you might want to consider is the Oracle Berkeley DB Java Edition library. It's a simple JAR file that can read/write data to persistent storage. Because of it's small footprint and ease of use, it's used for applications running on small to very large data sets. It's designed to be linked into the application, so that it's embedded and doesn't require complex client/server installation or protocol stacks.
What's even better is that it's extremely scalable (which works well if you end up with larger data sets than you expect), is very fast, and supports both a Java Collections API and a Direct Persistence Layer API (POJO-like). So you can use it seamlessly with Java Collections.
Berkeley DB Java Edition was designed specifically with Java application developers in mind. It's designed to be simple to use, light weight in terms of resources required, but very fast, scalable and reliable.
You can find information more about Oracle Berkeley DB Java Edition here
Regards,
Dave

Categories