File backed Java map - java

Is there a simple way of having a file backed Map?
The contents of the map are updated regularly, with some being deleted, as well as some being added. To keep the data that is in the map safe, persistence is needed. I understand a database would be ideal, but sadly due to constraints a database can't be used.
I have tried:
Writing the whole contents of the map to file each time it gets updated. This worked, but obviously has the drawback that the whole file is rewritten each time, the contents of the map are expected to be anywhere from a couple of entries to ~2000. There are also some concurrency issues (i.e. writing out of order results in loss of data).
Using a RandomAccessFile and keeping a pointer to each file's start byte so that each entry can be looked up using seek(). Again this had a similar issue as before, changing an entry would involve updating all of the references after it.
Ideally, the solution would involve some sort of caching, so that only the most recently accessed entries are kept in memory.
Is there such a thing? Or is it available via a third party jar? Someone suggested Oracle Coherence, but I can't seem to find much on how to implement that, and it seems a bit like using a sledgehammer to crack a nut.

You could look into MapDB which was created with this purpose in mind.
MapDB provides concurrent Maps, Sets and Queues backed by disk storage
or off-heap-memory. It is a fast and easy to use embedded Java
database engine.

Yes, Oracle Coherence can do all of that, but it may be overkill if that's all you're doing.
One way to do this is to "overflow" from RAM to disk:
BinaryStore diskstore = new BerkeleyDBBinaryStore("mydb", ...);
SimpleSerializationMap mapDisk = SimpleSerializationMap(diskstore);
LocalCache mapRAM = new LocalCache(100 * 1024 * 1024); // 100MB in RAM
OverflowMap cache = new OverflowMap(mapRAM, mapDisk);
Starting in version 3.7, you can also transparently overflow from RAM journal to flash journal. While you can configure it in code (as per above), it's generally just a line or two of config and then you ask for the cache to be configured on your behalf, e.g.
// simplest example; you'd probably use a builder pattern or a configurable cache factory
NamedCache cache = CacheFactory.getCache("mycache");
For more information, see the doc available from http://coherence.oracle.com/
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

jdbm2 looks promising, never used it but it seems to be a candidate to meet your requirements:
JDBM2 provides HashMap and TreeMap which are backed by disk storage. It is very easy and fast way to persist your data. JDBM2 also have minimal hardware requirements and is highly embeddable (jar have only 145 KB).
You'll find many more solutions if you look for key/value stores.

Related

Creatinge a very, very, large Map in Java

Using Java I would like to create a Map that can grow and grow and potentially be larger than the size of the memory available. Now obviously using a standard POJO HashMap we're going to run out of memory and the JVM will crash. So I was thinking along the lines of a Map that if it becomes aware of memory running low, it can write the current contents to disk.
Has anyone implemented anything like this or knows of any existing solutions out there?
What I'm trying to do is read a very large ASCII file (say 50Gb) a line at a time. Each line contains a key and a value. Keys can be duplicated in the file. I'll then store each line in a Map, which is Keys to a List of values. This Map is the object that will just grow and grow.
Any advice greatly appreciated.
Phil
Update:
Thanks for all the comments and advice everyone. With the problem that I described, a Database is the correct, scalable, solution. I should have stated that this is a temporary Map that needs to be created and used for a short period of time to aid in the parsing of a file. In this case, Michael's suggestion to "store only the line number instead of the actual value " is the most appropriate. Marking Michael's answer(s) as the recommended solution.
I think you are looking for a database.
A NoSQL database will be probably easy to setup and it is more akin a map.
Check BerkeleyDB Java edition, now from Oracle.
It has a map like interface, can be embeddable so no complex setup is needed
Sounds like dumping your huge file into DB.
Well, I had a same situation like this. But, In my case everything was in TXT file format and the throughout the file has the same formatted lines. So, what I did is I just splitted the files into several pieces (possibly, which my JVM can able to process maximum size). Then I called files one by one, to get processed.
Another way, you can directly load your data into database directly.
Seriously, choose a simple database as advised. It's not overhead — you don't have to use JPA or whatnot, just plain JDBC with native SQL. Derby or HSQL, for example, can run in embedded mode, no need to define users, access rights, start the server separately.
The "overhead" will stab you in the back when you've plodden far into the hash map solution and it turns out that you need yet another optimization to avoid the OutOfMemoryException, or the file is not 50 GB, but 75... Really, don't go there.
If you're just wanting to build up the map for data processing (rather than random access in response to requests), then MapReduce may be what you want, with no need to work with a database.
Edit: Note that although many MapReduce introductions focus on the ability to run many nodes, you should still get benefit from sidestepping the requirement to hold all the data in memory on one machine.
How much memory do you have? Unless you have enough memory to keep most of the data in memory its going to be so slow, it may as well have failed. A program which is heavily paging can be 1000x slower or more. Some PC have 16-24 GB and you might consider getting more memory.
Lets assume there is enough duplicates, you can keep most of the data in memory. I suggest you use a byte based String class of your own making, since you have ASCII data and your store your values as another of these "String" types (with a separator) You may find you can keep the working data set in memory.
I use BerkleyDB for this, though it is more complicated than a Map (though they have a Map wrapper which I don't really recommend for anything but simple applications)
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It is also available in Maven http://www.oracle.com/technetwork/database/berkeleydb/downloads/maven-087630.html
<dependencies>
<dependency>
<groupId>com.sleepycat</groupId>
<artifactId>je</artifactId>
<version>3.3.75</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>oracleReleases</id>
<name>Oracle Released Java Packages</name>
<url>http://download.oracle.com/maven</url>
<layout>default</layout>
</repository>
</repositories>
It also has one other disadvantage of vendor lock-in (i.e. you are forced to use this tool. though there may be other Map wrappers to some other databases)
So just choose according to your needs.
Most cache-APIs work like maps and support overflow to disk. Ehcache for example supports that. Or follow this tutorial for guave.

Bitcask ok for simple and high performant file store?

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.

Key-Value Database with Java client

I basically want to store a hashtable on disk so I can query it later. My program is written in Java.
The hashtable maps from String to List.
There are a lot of key-value stores out there, but after doing a lot of research/reading, its not clear which one is the best for my purposes. Here are some things that are important to me.
Simple key-value store which allows you to retrieve a value with a single key.
Good Java client that is documented well.
Dataset is small and there is no need for advanced features. Again, I want it to be simple.
I have looked into Redis and MongoDB. Both look promising but not ideal for my purposes.
Any info would be appreciated.
If your dataset is small and you want it to be SIMPLE. why don't you serialize your hashmap to a file or rdbms and load it in your application?
How do you wan't to "query" your hashmap? key approximation? value 'likeness'? I don't know, seems overkill to me to mantain a keyvalue storage just for the sake of.
What you are looking for is a library that supports object prevalence. These libraries are designed to be simple and fast providing collection like API. Below are few such libraries that allow you to work with collections but behind the scenes use a disk storage.
space4j
Advagato
Prevayler
Before providing any sort of answers, I'd start by asking myself why do I need to store this hashtable on disk as according to your description the data set is small and so I assume it can fit into memory. If it is just to be able to reuse this structure after restarting your application, then you can probably use any sort of format to persist it.
Second, you don't provide any reasons for Redis or MongoDB not being ideal. Based on your (short) 3 requirements, I would have said Redis is probably your best bet:
good Java clients
not only able to store lists, but also supports operations on the list values (so data is not opaque)
The only reason I could suppose for eliminating Redis is that you are looking for strict ACID characteristics. If that's what you are looking for than you could probably take a look at BerkleyDB JE. It has been around for a while and the documentation is good.
Check out JDBM2 - http://code.google.com/p/jdbm2/
I worked on the JDBM 1 code base, and have been impressed with what I've seen in jdbm2
Chronicle Map should be a perfect fit, it's an embeddable key-value store written in pure Java, so it acts as the best possible "client" (though actually there are no "client" or "server", you just open your database and have full read/update in-process access to it).
Chronicle Map resides a single file. This file could be moved around filesystem, and even sent to another machine with different OS and/or architecture and still be an openable Chronicle Map database.
To create or open a data store (if the database file is non-existent, it is created, otherwise an existing store is accessed):
ChronicleMap<String, List<Point>> map = ChronicleMap
.of(String.class, (Class<List<Point>>) (Class) List.class)
.averageKey("range")
.averageValue(asList(of(0, 0), of(1, 1)))
.entries(10_000)
.createPersistedTo(myDatabaseFile);
Then you can work with created ChronicleMap object just as with a simple HashMap, not bothering with keys and values serialization.

how to handle large lists of data

We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
If you're working with huge amounts of data, you might want to consider using a database instead.
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.

Efficient persistent storage for simple id to table of values map for java

I need to store some data that follows the simple pattern of mapping an "id" to a full table (with multiple rows) of several columns (i.e. some integer values [u, v, w]). The size of one of these tables would be a couple of KB. Basically what I need is to store a persistent cache of some intermediary results.
This could quite easily be implemented as simple sql, but there's a couple of problems, namely I need to compress the size of this structure on disk as much as possible. (because of amount of values I'm storing) Also, it's not transactional, I just need to write once and simply read the contents of the entire table, so a relational DB isn't actually a very good fit.
I was wondering if anyone had any good suggestions? For some reason I can't seem to come up with something decent atm. Especially something with an API in java would be nice.
This sounds like a job for.... new ObjectOutputStream(new FileOutputStream(STORAGE_DIR + "/" + key + ".dat"); !!
Seriously - the simplest method is to just create a file for each data table that you want to store, serialize the data into and look it up using the key as the filename when you want to read.
On a decent file system writes can be made atomic (by writing to a temp file and then renaming the file); read/write speed is measured in 10s of MBit/second; look ups can be made very efficient by creating a simple directory tree like STORAGE_DIR + "/" + key.substring(0,2) + "/" + key.substring(0,4) + "/" + key which should be still efficient with millions of entries and even more efficient if your file system uses indexed directories; lastly its trivial to implement a memory-backed LRU cache on top of this for even faster retrievals.
Regarding compression - you can use Jakarta's commons-compress to affect a gzip or even bzip2 compression to the data before you store it. But this is an optimization problem and depending on your application and available disk space you may be better off investing the CPU cycles elsewhere.
Here is a sample implementation that I made: http://geek.co.il/articles/geek-storage.zip. It uses a simple interface (which is far from being clean - its just a demonstration of the concept) that offers methods for storing and retrieving objects from a cache with a set maximum size. A cache miss is transfered to a user implementation for handling, and the cache will periodically check that it doesn't exceed the storage requirements and will remove old data.
I also included a MySQL backed implementation for completion and a benchmark to compare the disk based and MySQL based implementations. On my home machine (an old Athlon 64) the disk benchmark scores better then twice as fast as the MySQL implementation in the enclosed benchmark (9.01 seconds vs. 18.17 seconds). Even though the DB implementation can probably tweaked for slightly better performance, I believe it demonstrates the problem well enough.
Feel free to use this as you see fit.
I'd use EHCache, it's used by Hibernate and other Java EE libraries, and is really simple and efficient:
To add a table:
List<List<Integer>> myTable = new(...)
cache.put(new Element("myId", myTable));
To read:
List<List<Integer>> myTable = (List<List<Integer>>) cache.get("myId").getObjectValue();
Have you looked at Berkeley DB? That sounds like it may fit the bill.
Edit:
I forgot to add you can gzip the values themselves before you store them. Then just unzip them when you retrieve them.
Apache Derby might be a good fit if you want something embedded (not a separate server).
There is a list of other options at Lightweight Data Bases in Java
It seems that Key=>Value Databases are the thing you search for.
Maybe SuperCSV is the best framework for you!
If you don't want to use a relational database, you can use JAXB to store your Objects as XML files!
There is also a way with other libraries like XStream
If you prefer XML, then use JAXB or XStream. Otherwise you should have a look at CSV libraries such as SuperCSV. People who can life with serialized java files can use the default persistance mechanism like Guss said. Direct Java persistance may be the fastest way.
You can use JOAFIP http://joafip.sourceforge.net/
It make you able to put all your data model in file and you can access to it, update it, without reloading all in memory.
If you have a couple of KB, I don't understand why you need to "compress the size of this structure on disk as much as possible" Given that 181 MB of disk space costs 1 cent, I would suggest that anything less than this isn't worth spending too much time worrying about.
However to answer your question you can compress the file as you write it. As well as ObjectOutputStream, you can use XMLExcoder to serialize your map. This will be more compact than just using ObjectOutputStream and if you decompress the file you will be able to read or edit the data.
XMLEncoder xe = new XMLEncoder(
new GZIPOutputStream(
new FileOutputStream(filename+".xml.gz")));
xe.writeObject(map);
xe.close();

Categories