I basically want to store a hashtable on disk so I can query it later. My program is written in Java.
The hashtable maps from String to List.
There are a lot of key-value stores out there, but after doing a lot of research/reading, its not clear which one is the best for my purposes. Here are some things that are important to me.
Simple key-value store which allows you to retrieve a value with a single key.
Good Java client that is documented well.
Dataset is small and there is no need for advanced features. Again, I want it to be simple.
I have looked into Redis and MongoDB. Both look promising but not ideal for my purposes.
Any info would be appreciated.
If your dataset is small and you want it to be SIMPLE. why don't you serialize your hashmap to a file or rdbms and load it in your application?
How do you wan't to "query" your hashmap? key approximation? value 'likeness'? I don't know, seems overkill to me to mantain a keyvalue storage just for the sake of.
What you are looking for is a library that supports object prevalence. These libraries are designed to be simple and fast providing collection like API. Below are few such libraries that allow you to work with collections but behind the scenes use a disk storage.
space4j
Advagato
Prevayler
Before providing any sort of answers, I'd start by asking myself why do I need to store this hashtable on disk as according to your description the data set is small and so I assume it can fit into memory. If it is just to be able to reuse this structure after restarting your application, then you can probably use any sort of format to persist it.
Second, you don't provide any reasons for Redis or MongoDB not being ideal. Based on your (short) 3 requirements, I would have said Redis is probably your best bet:
good Java clients
not only able to store lists, but also supports operations on the list values (so data is not opaque)
The only reason I could suppose for eliminating Redis is that you are looking for strict ACID characteristics. If that's what you are looking for than you could probably take a look at BerkleyDB JE. It has been around for a while and the documentation is good.
Check out JDBM2 - http://code.google.com/p/jdbm2/
I worked on the JDBM 1 code base, and have been impressed with what I've seen in jdbm2
Chronicle Map should be a perfect fit, it's an embeddable key-value store written in pure Java, so it acts as the best possible "client" (though actually there are no "client" or "server", you just open your database and have full read/update in-process access to it).
Chronicle Map resides a single file. This file could be moved around filesystem, and even sent to another machine with different OS and/or architecture and still be an openable Chronicle Map database.
To create or open a data store (if the database file is non-existent, it is created, otherwise an existing store is accessed):
ChronicleMap<String, List<Point>> map = ChronicleMap
.of(String.class, (Class<List<Point>>) (Class) List.class)
.averageKey("range")
.averageValue(asList(of(0, 0), of(1, 1)))
.entries(10_000)
.createPersistedTo(myDatabaseFile);
Then you can work with created ChronicleMap object just as with a simple HashMap, not bothering with keys and values serialization.
Related
Initial problem:
I have the following issue: I am joining 2 CSVs using Java. While I can "stream" one of the CSVs (read in, process, write out line-by-line), the smaller one resides in memory (a HashMap to be precise), as I need to look up the keys of each row of the big CSV while going through it. The problem: if the "small CSV" is too large to keep in mem, I am running into OutOfMem errors.
While I know that I could avoid these issues by just reading both CSVs into a DB and perform the join there, it is infeasible in my application to do so. Is there a Java wrapper (or some other sort of object) which would allow me to keep only the HashMap's keys in memory, and put all of its values into a temp file on disk (in a self-managed fashion)?
Update:
After the comments of ThomasKläger and JacobG, I solved the problem in the following way:
Use a HashMap to store a row’s keys and that row’s start and end position using RandomAccessFile’s .getFilePointer().
While going through the large CSV, I am now using the HashMap to look up the matching rows’ positions, .seek(pos), and read them.
This is a working solution, thanks a lot.
According to what you describe you need something like off heap collections, in example MapDb lib, http://www.mapdb.org/ From description:
MapDB provides Java Maps, Sets, Lists, Queues and other collections backed by off-heap or on-disk storage. It is a hybrid between java collection framework and embedded database engine.
I have a large amount of data I need to store in a Map<String, int...>. I need to be able to perform the following functions:
containsKey(String key)
get(String key).add(int value)
put(String key, singletonList(int value))
entrySet().iterator()
I originally was just using a HashMap<String, ArrayList<Integer>> where singletonList is a function that creates a new ArrayList<Integer> and adds the given value to it. However, this strategy does not scale well to the amount of data I am using as it stores everything in the RAM and my RAM is not big enough to store all the data.
My next idea was to just dump everything into a file. However, this would mean that get, containsKey, and put would become very expensive operations, which is not at all desirable. Of course, I could keep everything sorted, but that is often difficult in a large file.
I was wondering if there is a better strategy out there.
Using an embedded database engine (key-value store), such as MapDB, is one way to go.
From MapDB's official website:
MapDB is embedded database engine. It provides java collections backed by disk or memory database store. MapDB has excellent performance comparable to java.util.HashMap and other collections, but is not limited by GC. It is also very flexible engine with many storage backend, cache algorithms, and so on. And finally MapDB is pure-java single 400K JAR and only depends on JRE 6+ or Android 2.1+.
If that works for you, you can start from here.
Why don`t you try ObjectOutputStream to store the map to a file and you can use ObjectInputStream to get data from the map.
I hope it would help you.
Is there a simple way of having a file backed Map?
The contents of the map are updated regularly, with some being deleted, as well as some being added. To keep the data that is in the map safe, persistence is needed. I understand a database would be ideal, but sadly due to constraints a database can't be used.
I have tried:
Writing the whole contents of the map to file each time it gets updated. This worked, but obviously has the drawback that the whole file is rewritten each time, the contents of the map are expected to be anywhere from a couple of entries to ~2000. There are also some concurrency issues (i.e. writing out of order results in loss of data).
Using a RandomAccessFile and keeping a pointer to each file's start byte so that each entry can be looked up using seek(). Again this had a similar issue as before, changing an entry would involve updating all of the references after it.
Ideally, the solution would involve some sort of caching, so that only the most recently accessed entries are kept in memory.
Is there such a thing? Or is it available via a third party jar? Someone suggested Oracle Coherence, but I can't seem to find much on how to implement that, and it seems a bit like using a sledgehammer to crack a nut.
You could look into MapDB which was created with this purpose in mind.
MapDB provides concurrent Maps, Sets and Queues backed by disk storage
or off-heap-memory. It is a fast and easy to use embedded Java
database engine.
Yes, Oracle Coherence can do all of that, but it may be overkill if that's all you're doing.
One way to do this is to "overflow" from RAM to disk:
BinaryStore diskstore = new BerkeleyDBBinaryStore("mydb", ...);
SimpleSerializationMap mapDisk = SimpleSerializationMap(diskstore);
LocalCache mapRAM = new LocalCache(100 * 1024 * 1024); // 100MB in RAM
OverflowMap cache = new OverflowMap(mapRAM, mapDisk);
Starting in version 3.7, you can also transparently overflow from RAM journal to flash journal. While you can configure it in code (as per above), it's generally just a line or two of config and then you ask for the cache to be configured on your behalf, e.g.
// simplest example; you'd probably use a builder pattern or a configurable cache factory
NamedCache cache = CacheFactory.getCache("mycache");
For more information, see the doc available from http://coherence.oracle.com/
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
jdbm2 looks promising, never used it but it seems to be a candidate to meet your requirements:
JDBM2 provides HashMap and TreeMap which are backed by disk storage. It is very easy and fast way to persist your data. JDBM2 also have minimal hardware requirements and is highly embeddable (jar have only 145 KB).
You'll find many more solutions if you look for key/value stores.
I'm looking for an on-disk implementation of java.util.Map. Nothing too fancy, just something that I can point at a directory or file and have it store its contents there, in some way it chooses. Does anyone know of such a thing?
You could have a look at the Disk-Backed-map project.
A library that implements a disk backed map in Java
A small library that provide a disk backed map implementation for storing large number of key value pairs. The map implementations (HashMap, HashTable) max out around 3-4Million keys/GB of memory for very simple key/value pairs and in most cases the limit is much lower. DiskBacked map on the other hand can store betweeen 16Million (64bit JVM) to 20Million(32bit JVM) keys/GB, regardless the size of the key/value pairs.
If you are looking for key-object based structures to persist data then NoSQL databases are a very good choice. You'll find that some of them such MongoDB or Redis scale and perform for big datasets and apart from hash based look ups they provide interesting query and transactional features.
In essence these types of systems are a Map implementation. And it shouldn't be too complicated to implement your own adapter that implements java.util.Map to bridge them.
MapDB (mapdb.org) does exactly what you are looking for. Besides disk backed TreeMap and HashMap it gives you other collection types.
It's maps are also thread-safe and have really good performance.
See Features
Chronicle Map is a modern and the fastest solution to this problem. It implements ConcurrentMap interface and persists the data to disk (under the hood, it is done by mapping Chronicle Map's memory to a file).
You could use a simple EHCache implementation? The nice thing about EHCache being that it can be very simple to implement :-)
I take it you've ruled out serialising / deserialising an actual Map instance?
This seems like a relatively new open source solution to the problem, I've used it, and like it so far
https://github.com/jankotek/JDBM4
I need to store some data that follows the simple pattern of mapping an "id" to a full table (with multiple rows) of several columns (i.e. some integer values [u, v, w]). The size of one of these tables would be a couple of KB. Basically what I need is to store a persistent cache of some intermediary results.
This could quite easily be implemented as simple sql, but there's a couple of problems, namely I need to compress the size of this structure on disk as much as possible. (because of amount of values I'm storing) Also, it's not transactional, I just need to write once and simply read the contents of the entire table, so a relational DB isn't actually a very good fit.
I was wondering if anyone had any good suggestions? For some reason I can't seem to come up with something decent atm. Especially something with an API in java would be nice.
This sounds like a job for.... new ObjectOutputStream(new FileOutputStream(STORAGE_DIR + "/" + key + ".dat"); !!
Seriously - the simplest method is to just create a file for each data table that you want to store, serialize the data into and look it up using the key as the filename when you want to read.
On a decent file system writes can be made atomic (by writing to a temp file and then renaming the file); read/write speed is measured in 10s of MBit/second; look ups can be made very efficient by creating a simple directory tree like STORAGE_DIR + "/" + key.substring(0,2) + "/" + key.substring(0,4) + "/" + key which should be still efficient with millions of entries and even more efficient if your file system uses indexed directories; lastly its trivial to implement a memory-backed LRU cache on top of this for even faster retrievals.
Regarding compression - you can use Jakarta's commons-compress to affect a gzip or even bzip2 compression to the data before you store it. But this is an optimization problem and depending on your application and available disk space you may be better off investing the CPU cycles elsewhere.
Here is a sample implementation that I made: http://geek.co.il/articles/geek-storage.zip. It uses a simple interface (which is far from being clean - its just a demonstration of the concept) that offers methods for storing and retrieving objects from a cache with a set maximum size. A cache miss is transfered to a user implementation for handling, and the cache will periodically check that it doesn't exceed the storage requirements and will remove old data.
I also included a MySQL backed implementation for completion and a benchmark to compare the disk based and MySQL based implementations. On my home machine (an old Athlon 64) the disk benchmark scores better then twice as fast as the MySQL implementation in the enclosed benchmark (9.01 seconds vs. 18.17 seconds). Even though the DB implementation can probably tweaked for slightly better performance, I believe it demonstrates the problem well enough.
Feel free to use this as you see fit.
I'd use EHCache, it's used by Hibernate and other Java EE libraries, and is really simple and efficient:
To add a table:
List<List<Integer>> myTable = new(...)
cache.put(new Element("myId", myTable));
To read:
List<List<Integer>> myTable = (List<List<Integer>>) cache.get("myId").getObjectValue();
Have you looked at Berkeley DB? That sounds like it may fit the bill.
Edit:
I forgot to add you can gzip the values themselves before you store them. Then just unzip them when you retrieve them.
Apache Derby might be a good fit if you want something embedded (not a separate server).
There is a list of other options at Lightweight Data Bases in Java
It seems that Key=>Value Databases are the thing you search for.
Maybe SuperCSV is the best framework for you!
If you don't want to use a relational database, you can use JAXB to store your Objects as XML files!
There is also a way with other libraries like XStream
If you prefer XML, then use JAXB or XStream. Otherwise you should have a look at CSV libraries such as SuperCSV. People who can life with serialized java files can use the default persistance mechanism like Guss said. Direct Java persistance may be the fastest way.
You can use JOAFIP http://joafip.sourceforge.net/
It make you able to put all your data model in file and you can access to it, update it, without reloading all in memory.
If you have a couple of KB, I don't understand why you need to "compress the size of this structure on disk as much as possible" Given that 181 MB of disk space costs 1 cent, I would suggest that anything less than this isn't worth spending too much time worrying about.
However to answer your question you can compress the file as you write it. As well as ObjectOutputStream, you can use XMLExcoder to serialize your map. This will be more compact than just using ObjectOutputStream and if you decompress the file you will be able to read or edit the data.
XMLEncoder xe = new XMLEncoder(
new GZIPOutputStream(
new FileOutputStream(filename+".xml.gz")));
xe.writeObject(map);
xe.close();