Embeddable disk-based key-value store - java

We are working on a project, that will be distributed using single jar file. We have a need for some key-value store with following properties:
Embeddable into our jar file, so no additional installation.
Can hold up to tens of millions pairs
Memory efficient. That means less than 100M for 50M pairs
Both keys and values are of simple types: long, int, small byte[]
Free license for commercial use is a bonus
We do not need concurrency, ACID or such advanced stuff.
Amortized lookup time below 100 microseconds.
Any suggestions other than BerkelyDB or JDBM2/3?

GNU Trove offers a number of maps (e.g. TIntIntHashMap) that are more memory-efficient than standard Java maps because they use primitive types. I doubt you can get significantly more memory-efficient than this unless you know something about what you are storing. Trove is more or less LGPL, so it's probably safe for you to use. I don't know if it specifically meets your exact specifications, but I think it's worth trying when you can fit things in RAM.
When you might need to swap to disk, Ehcache is a good choice. You can specify that after a certain number of entries it will store values on disk (newly in version 2.5 you can specify after a certain amount of RAM is used if you don't know how the exact number of entries).

Look at noSQL implementations, CouchDB, Cassandra and more are pretty good.
Do google search to compare, you will find what you want.
My favourite is mongoDb and unfortunately its not Java based
regards

Related

'Big dictionary' implementation in Java

I am in the middle of a Java project which will be using a 'big dictionary' of words. By 'dictionary' I mean certain numbers (int) assigned to Strings. And by 'big' I mean a file of the order of 100 MB. The first solution that I came up with is probably the simplest possible. At initialization I read in the whole file and create a large HashMap which will be later used to look strings up.
Is there an efficient way to do it without the need of reading the whole file at initialization? Perhaps not, but what if the file is really large, let's say in the order of the RAM available? So basically I'm looking for a way to look things up efficiently in a large dictionary stored in memory.
Thanks for the answers so far, as a result I've realised I could be more specific in my question. As you've probably guessed the application is to do with text mining, in particular representing text in a form of a sparse vector (although some had other inventive ideas :)). So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible. Initial overhead of 'reading' the dictionary file or indexing it into a database is not as important as long as the string look-up time is optimized. Again, let's assume that the dictionary size is big, comparable to the size of RAM available.
Consider ChronicleMap (https://github.com/OpenHFT/Chronicle-Map) in a non-replicated mode. It is an off-heap Java Map implementation, or, from another point of view, a superlightweight NoSQL key-value store.
What it does useful for your task out of the box:
Persistance to disk via memory mapped files (see comment by MichaƂ Kosmulski)
Lazy load (disk pages are loaded only on demand) -> fast startup
If your data volume is larger than available memory, operating system will unmap rarely used pages automatically.
Several JVMs can use the same map, because off-heap memory is shared on OS level. Useful if you does the processing within a map-reduce-like framework, e. g. Hadoop.
Strings are stored in UTF-8 form, -> ~50% memory savings if strings are mostly ASCII (as maaartinus noted)
int or long values takes just 4(8) bytes, like if you have primitive-specialized map implementation.
Very little per-entry memory overhead, much less than in standard HashMap and ConcurrentHashMap
Good configurable concurrency via lock striping, if you already need, or are going to parallelize text processing in future.
At the point your data structure is a few hundred MB to orders of RAM, you're better off not initializing a data structure at run-time, but rather using a database which supports indexing(which most do these days). Indexing is going to be one of the only ways you can ensure the fastest retrieval of text once you're file gets so large and you're running up against the -Xmx settings of your JVM. This is because if your file is as large, or much larger than your maximum size settings, you're inevitably going to crash your JVM.
As for having to read the whole file at initialization. You're going to have to do this eventually so that you can efficiently search and analyze the text in your code. If you know that you're only going to be searching a certain portion of your file at a time, you can implement lazy loading. If not, you might as well bite the bullet and load your entire file into the DB in the beggenning. You can implement parallelism in this process, if there are other parts of your code execution that doesn't depend on this.
Please let me know if you have any questions!
As stated in a comment, a Trie will save you a lot of memory.
You should also consider using bytes instead of chars as this saves you a factor of 2 for plain ASCII text or when using your national charset as long as it has no more than 256 different letters.
At the first glance, combining this low-level optimization with tries makes no sense, as with them the node size is dominated by the pointers. But there's a way if you want to go low level.
So what is critical for usage is to be able to look strings up in the dictionary, obtain their keys as fast as possible.
Then forget any database, as they're damn slow when compared to HashMaps.
If it doesn't fit into memory, the cheapest solution is usually to get more of it. Otherwise, consider loading only the most common words and doing something slower for the others (e.g., a memory mapped file).
I was asked to point to a good tries implementation, especially off-heap. I'm not aware of any.
Assuming the OP needs no mutability, especially no mutability of keys, it all looks very simple.
I guess, the whole dictionary could be easily packed into a single ByteBuffer. Assuming mostly ASCII and with some bit hacking, an arrow would need 1 byte per arrow label character and 1-5 bytes for the child pointer. The child pointer would be relative (i.e., difference between the current node and the child), which would make most of them fit into a single byte when stored in a base 128 encoding.
I can only guess the total memory consumption, but I'd say, something like <4 bytes per word. The above compression would slow the lookup down, but still nowhere near what a single disk access needs.
It sounds too big to store in memory. Either store it in a relational database (easy, and with an index on the hash, fast), or a NoSQL solution, like Solr (small learning curve, very fast).
Although NoSQL is very fast, if you really want to tweak performance, and there are entries that are far more frequently looked up than others, consider using a limited size cache to hold the most recently used (say) 10000 lookups.

Binary Search Tree vs a MultiMap

The problem I have to solve is that I have to input IP address prefixes and that data associated with them in a tree so they can be queried later. I'm reading these addresses from a file and the file may contain as many as 16 million records and the file could have duplicates and i have to store those too.
I wrote my own binary search tree, but learned that a TreeMap in Java is implemented using a Red Black tree, but a TreeMap can't contain duplicates.
I want the query to take O(logn) time.
The data structure needs to be in Ram, so I'm also not sure how I'm going to store 16 million nodes.
I wanted to ask: Would it be too much of a performance hit using a library like guava to insert the Ips in Multi-maps? Or is there a better way to do this?
Using a built in library, which is tested documented and well maintained is usually a good practice.
It will also help you learn more about guava. Once you start using it "for just one thing", you will most likely realize there is much more you can use to make your life a bit easier.
Also, an alternative is using a TreeMap<Key,List<MyClass>> rather then TreeMap<Key,MyClass> as a custom implementation of a Multimap.
Regarding memory - you should try to minimize your data as much as possible (use efficient data structures, no need for "wasty" String, for example for storing IPs, there are cheaper alternatives, exploit them.
Also note - the OS will be able to offer you more memory then the RAM you have, by using virtual memory (practically for 64bits machine - it is most likely to be more then enough). However, it will most likely be less efficient then a DS dedicated for disk (such as B+ trees, for example).
Alternatives:
As alternatives to the TreeMap - you might be interested in other data structures (each with its advantages and disadvantages):
hash table - implemented as HashMap in java. Your type will then beHashMap<Key,List<Value>>. It allows O(1) average case query, but might decay to O(n) worst case. It also does not allow efficient range queries.
trie or its more space efficient version - radix tree. Allows O(1) access to each key, but is usually less space efficient then the alternatives. With this approach, you will implement the Map interface with the DS, and your type will be Map<Key,List<Value>>
B+ tree, which is much more optimized for disk - if your data is too large to fit in RAM after all.

Replacing a huge dump file with an efficient lookup Java key-value text store

I have a huge dump file - 12GB of text containing millions of entries. Each entry has a numeric id, some text, and other irrelevant properties. I want to convert this file into something that will provide an efficient look-up. That is, given an id, it would return the text quickly. The limitations:
Embedded in Java, preferably without an external server or foreign language dependencies.
Read and writes to the disk, not in-memory - I don't have 12GB of RAM.
Does not blow up too much - I don't want to turn a 12GB file into a 200GB index. I don't need full text search, sorting, or anything fancy - Just key-value lookup.
Efficient - It's a lot of data and I have just one machine, so speed is an issue. Tools that can store large batches and/or work well with several threads are preferred.
Storing more than one field is nice, but not a must. The main concern is the text.
Your recommendations are welcomed!
I would use Java Chronicle or something like it (partly because I wrote it) because it is designed to access large amounts of data (larger than your machine) some what randomly.
It can store any number of fields in text or binary formats (or a combination if you wish) It adds 8 bytes per record you want to be able to randomly access. It doesn't support deleting records (you can mark them for reuse), but you can update and add new records.
It can only have a single writer thread, but it can be read by an number of threads on the same machine (even different processes)
It doesn't support batching but it can read/write millions of entries per second with typical sub microsecond latency (except for random reads/writes which are not in memory)
It uses next to no heap (<1 MB for TBs of data)
It uses an id which is sequential but you can build a table to do just that translation.
BTW: You can buy 32 GB for less than $200. Perhaps its time to get more memory ;)
Why not use JavaDb - the db that comes with Java ?
It'll store the info on disk, and be efficient in terms of lookups, provided you index properly. It'll run in-JVM, so you don't need a separate server/service. You talk to it using standard JDBC.
I suspect it'll be pretty efficient. This database has a long history (it used to be IBM's Derby) and will have had a lot of effort expended on it in terms of robustness and efficiency.
You'll obviously need to do an initial onboarding of the data to create the database, but that's a one-off task.

Bitcask ok for simple and high performant file store?

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.

HashMap alternatives for memory-efficient data storage

I've currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You'll no doubt be shocked when I tell you that this hasn't proven ideal. The overhead seems to use 5x more memory than the data itself.
This question asks about efficient collections libraries, and the answer was use Google Collections. My follow up is "which part?". I've been reading through the documentation but don't feel like it gives a very good sense of which classes are a good fit for this. (I'm also open to other libraries or suggestions).
So I'm looking for something that will let me store dense spreadsheet-type data with minimal memory overhead.
My columns are currently referenced by Field objects, rows by their indexes, and values are Objects, almost always Strings
Some columns will have a lot of repeated values
primary operations are to update or remove records based on values of certain fields, and also adding/removing/combining columns
I'm aware of options like H2 and Derby but in this case I'm not looking to use an embedded database.
EDIT: If you're suggesting libraries, I'd also appreciate it if you could point me to a particular class or two in them that would apply here. Whereas Sun's documentation usually includes information about which operations are O(1), which are O(N), etc, I'm not seeing much of that in third-party libraries, nor really any description of which classes are best suited for what.
Some columns will have a lot of
repeated values
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.
Trove collections should have a particular care about space occupied (I think they also have tailored data structures if you stick to primitive types).. take a look here.
Otherwise you can try with Apache collections.. just do your benchmarks!
In anycase, if you've got many references around to same elements try to design some suited pattern (like flyweight)
Chronicle Map could have overhead of less than 20 bytes per entry (see a test proving this). For comparison, java.util.HashMap's overhead varies from 37-42 bytes with -XX:+UseCompressedOops to 58-69 bytes without compressed oops (reference).
Additionally, Chronicle Map stores keys and values off-heap, so it doesn't store Object headers, which are not accounted as HashMap's overhead above. Chronicle Map integrates with Chronicle-Values, a library for generation of flyweight implementations of interfaces, the pattern suggested by Brian Agnew in another answer.
So I'm assuming that you have a map of Map<ColumnName,Column>, where the column is actually something like ArrayList<Object>.
A few possibilities -
Are you completely sure that memory is an issue? If you're just generally worried about size it'd be worth confirming that this will really be an issue in a running program. It takes an awful lot of rows and maps to fill up a JVM.
You could test your data set with different types of maps in the collections. Depending on your data, you can also initialize maps with preset size/load factor combinations that may help. I've messed around with this in the past, you might get a 30% reduction in memory if you're lucky.
What about storing your data in a single matrix-like data structure (an existing library implementation or something like a wrapper around a List of Lists), with a single map that maps column keys to matrix columns?
Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)
Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.
keeps its data in an ArrayList of HashMaps
Well, this part seems terribly inefficient to me. Empty HashMap will already allocate 16 * size of a pointer bytes (16 stands for default initial capacity), plus some variables for hash object (14 + psize). If you have a lot of sparsely filled rows, this could be a big problem.
One option would be to use a single large hash with composite key (combining row and column). Although, that doesn't make operations on whole rows very effective.
Also, since you don't mention the operation of adding cell, you can create hashes with only necessary inner storage (initialCapacity parameter).
I don't know much about google collections, so can't help there. Also, if you find any useful optimization, please do post here! It would be interesting to know.
I've been experimenting with using the SparseObjectMatrix2D from the Colt project. My data is pretty dense but their Matrix classes don't really offer any way to enlarge them, so I went with a sparse matrix set to the maximum size.
It seems to use roughly 10% less memory and loads about 15% faster for the same data, as well as offering some clever manipulation methods. Still interested in other options though.
From your description, it seems that instead of an ArrayList of HashMaps you rather want a (Linked)HashMap of ArrayList (each ArrayList would be a column).
I'd add a double map from field-name to column-number, and some clever getters/setters that never throw IndexOutOfBoundsException.
You can also use a ArrayList<ArrayList<Object>> (basically a jagged dinamically growing matrix) and keep the mapping to field (column) names outside.
Some columns will have a lot of
repeated values
I doubt this matters, specially if they are Strings, (they are internalized) and your collection would store references to them.
Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.
For me apache commons collections did not save any space, here are two similar heap dumps just before OOME comparing Java 11 HashMap to Apache Commons HashedMap:
The Apache Commons HashedMap doesn't appear to make any meaningful difference.

Categories