Using Java I would like to create a Map that can grow and grow and potentially be larger than the size of the memory available. Now obviously using a standard POJO HashMap we're going to run out of memory and the JVM will crash. So I was thinking along the lines of a Map that if it becomes aware of memory running low, it can write the current contents to disk.
Has anyone implemented anything like this or knows of any existing solutions out there?
What I'm trying to do is read a very large ASCII file (say 50Gb) a line at a time. Each line contains a key and a value. Keys can be duplicated in the file. I'll then store each line in a Map, which is Keys to a List of values. This Map is the object that will just grow and grow.
Any advice greatly appreciated.
Phil
Update:
Thanks for all the comments and advice everyone. With the problem that I described, a Database is the correct, scalable, solution. I should have stated that this is a temporary Map that needs to be created and used for a short period of time to aid in the parsing of a file. In this case, Michael's suggestion to "store only the line number instead of the actual value " is the most appropriate. Marking Michael's answer(s) as the recommended solution.
I think you are looking for a database.
A NoSQL database will be probably easy to setup and it is more akin a map.
Check BerkeleyDB Java edition, now from Oracle.
It has a map like interface, can be embeddable so no complex setup is needed
Sounds like dumping your huge file into DB.
Well, I had a same situation like this. But, In my case everything was in TXT file format and the throughout the file has the same formatted lines. So, what I did is I just splitted the files into several pieces (possibly, which my JVM can able to process maximum size). Then I called files one by one, to get processed.
Another way, you can directly load your data into database directly.
Seriously, choose a simple database as advised. It's not overhead — you don't have to use JPA or whatnot, just plain JDBC with native SQL. Derby or HSQL, for example, can run in embedded mode, no need to define users, access rights, start the server separately.
The "overhead" will stab you in the back when you've plodden far into the hash map solution and it turns out that you need yet another optimization to avoid the OutOfMemoryException, or the file is not 50 GB, but 75... Really, don't go there.
If you're just wanting to build up the map for data processing (rather than random access in response to requests), then MapReduce may be what you want, with no need to work with a database.
Edit: Note that although many MapReduce introductions focus on the ability to run many nodes, you should still get benefit from sidestepping the requirement to hold all the data in memory on one machine.
How much memory do you have? Unless you have enough memory to keep most of the data in memory its going to be so slow, it may as well have failed. A program which is heavily paging can be 1000x slower or more. Some PC have 16-24 GB and you might consider getting more memory.
Lets assume there is enough duplicates, you can keep most of the data in memory. I suggest you use a byte based String class of your own making, since you have ASCII data and your store your values as another of these "String" types (with a separator) You may find you can keep the working data set in memory.
I use BerkleyDB for this, though it is more complicated than a Map (though they have a Map wrapper which I don't really recommend for anything but simple applications)
http://www.oracle.com/technetwork/database/berkeleydb/overview/index.html
It is also available in Maven http://www.oracle.com/technetwork/database/berkeleydb/downloads/maven-087630.html
<dependencies>
<dependency>
<groupId>com.sleepycat</groupId>
<artifactId>je</artifactId>
<version>3.3.75</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>oracleReleases</id>
<name>Oracle Released Java Packages</name>
<url>http://download.oracle.com/maven</url>
<layout>default</layout>
</repository>
</repositories>
It also has one other disadvantage of vendor lock-in (i.e. you are forced to use this tool. though there may be other Map wrappers to some other databases)
So just choose according to your needs.
Most cache-APIs work like maps and support overflow to disk. Ehcache for example supports that. Or follow this tutorial for guave.
Related
I'm looking for the fastest approach, in Java, to store ~1 billion records of ~250 bytes each (storage will happen only once) and then being able to read it multiple times in a non-sequential order.
The source records are being generated into simple java value objects and I would like to read them back in the same format.
For now my best guess is to store these objects, using a fast serialization library such as Kryo, in a flat file and then to use Java FileChannel to make direct random access to read the records at specific positions in the file (when storing the data, I will keep in a hashmap (also to be saved on disk) with the position in the file of each record so that I know where to read it).
Also, there is no need to optimize disk space. My key concern is to optimize read performance, while having a reasonable write performance (that, again, will happen only once).
Last precision: while the records are all of the same type (same Java value object), their size (in bytes) is variable (e.g. it contains strings).
Is there any better approach than what I mentioned above? Any hint or suggestion would be greatly appreciated !
Many thanks,
Thomas
You can use Apache Lucene, it will take care of everything you have mentioned above :)
It is super fast, you can search results more quickly then ever.
Apache Lucene persist objects in files and indexes them. We have used it in couple of apps and it is super fast.
You could just use an embedded Derby database. It's written in Java and you can actually run it up embedded within your process so there is no overhead of inter-process or networked communication. It will store the data and allow you to query it/etc handling all the complexity and indexing for you.
Is there a simple way of having a file backed Map?
The contents of the map are updated regularly, with some being deleted, as well as some being added. To keep the data that is in the map safe, persistence is needed. I understand a database would be ideal, but sadly due to constraints a database can't be used.
I have tried:
Writing the whole contents of the map to file each time it gets updated. This worked, but obviously has the drawback that the whole file is rewritten each time, the contents of the map are expected to be anywhere from a couple of entries to ~2000. There are also some concurrency issues (i.e. writing out of order results in loss of data).
Using a RandomAccessFile and keeping a pointer to each file's start byte so that each entry can be looked up using seek(). Again this had a similar issue as before, changing an entry would involve updating all of the references after it.
Ideally, the solution would involve some sort of caching, so that only the most recently accessed entries are kept in memory.
Is there such a thing? Or is it available via a third party jar? Someone suggested Oracle Coherence, but I can't seem to find much on how to implement that, and it seems a bit like using a sledgehammer to crack a nut.
You could look into MapDB which was created with this purpose in mind.
MapDB provides concurrent Maps, Sets and Queues backed by disk storage
or off-heap-memory. It is a fast and easy to use embedded Java
database engine.
Yes, Oracle Coherence can do all of that, but it may be overkill if that's all you're doing.
One way to do this is to "overflow" from RAM to disk:
BinaryStore diskstore = new BerkeleyDBBinaryStore("mydb", ...);
SimpleSerializationMap mapDisk = SimpleSerializationMap(diskstore);
LocalCache mapRAM = new LocalCache(100 * 1024 * 1024); // 100MB in RAM
OverflowMap cache = new OverflowMap(mapRAM, mapDisk);
Starting in version 3.7, you can also transparently overflow from RAM journal to flash journal. While you can configure it in code (as per above), it's generally just a line or two of config and then you ask for the cache to be configured on your behalf, e.g.
// simplest example; you'd probably use a builder pattern or a configurable cache factory
NamedCache cache = CacheFactory.getCache("mycache");
For more information, see the doc available from http://coherence.oracle.com/
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.
jdbm2 looks promising, never used it but it seems to be a candidate to meet your requirements:
JDBM2 provides HashMap and TreeMap which are backed by disk storage. It is very easy and fast way to persist your data. JDBM2 also have minimal hardware requirements and is highly embeddable (jar have only 145 KB).
You'll find many more solutions if you look for key/value stores.
I'm filling up the JVM Heap Space.
Changing parameters to give more heap space to the JVM, or changing something in my algorithm in the code not to use so much space are two of the most recommended options.
But, if those two have already been tried and applied, and I still get out of memory exceptions, I'd like to see what the other options are.
I found out about this example of "Using a memory mapped file for a huge matrix" and a library called HugeCollections which are an interesting way to solve my problem. Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one.
My question is, is there any other library doing this, or a good way of achieving it (having collection objects (lists and sets) memory mapped)?
You don't say what sort of collections you're using, or the way that you're using them, so it's hard to give recommendations. However, here are a few things to keep in mind:
Keeping the objects on the Java heap will always be the simplest option, and RAM is relatively cheap.
Blindly moving to memory-mapped data is very likely to give horrendous performance, especially if you're moving around in the file and/or making lots of changes. Hash-based collection types are the worst, as they work by distributing data. Tree-based collection types are generally a better choice, and linear collections can go both ways.
Once you move off-heap, you need a way to translate your objects to/from Java. Object serialization is the easiest, but adds lots of overhead. Binary objects accessed via byte buffers are usually a better choice, but you need to be thread-conscious.
You also have to manage your own garbage collection for off-heap objects. Not a problem if all you're doing is creating/updating, but quickly becomes a pain if you're deleting.
If you have a lot of data, and need to access that data in varied ways, a database is probably your best bet.
Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one I agree and I wrote it. ;)
I suggest you look at https://github.com/peter-lawrey/Java-Chronicle which is higher performance has been used a bit. It really design for List & Queue but you could use it for a Map or Set with additional data structures.
Depending on your requirements, you could write your own library. e.g. for time series data I wrote a different library which is not open source unfortunately but can load tables of 500+ GB pretty cleanly.
it's not in any Maven repo
Neither is this one but would be happy for someone to add it.
Sounds like you're either having trouble with a memory leak, or trying to put too large an Object into memory.
Have you tried making a rough estimate of the amount of memory needed to load your data?
Assuming you have no memory leaks or other issues and really need that much storage that you can't fit it in the heap (which I find unlikely) you have basically only one option:
Don't put your data on the heap. Simple as that. Now which method you use to move your data out is very dependend on your requirements (what kind of data, frequency of updates and how much is it really?).
Note: You can use very large heaps with a 64-bit VM and if necessary enlarge the swap space of the OS. It may be the simplest solution to just brutally increase the maximum heap size (even if it means lots of swapping). I certainly would try that first in the situation you outlined.
I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.
I need to store some data that follows the simple pattern of mapping an "id" to a full table (with multiple rows) of several columns (i.e. some integer values [u, v, w]). The size of one of these tables would be a couple of KB. Basically what I need is to store a persistent cache of some intermediary results.
This could quite easily be implemented as simple sql, but there's a couple of problems, namely I need to compress the size of this structure on disk as much as possible. (because of amount of values I'm storing) Also, it's not transactional, I just need to write once and simply read the contents of the entire table, so a relational DB isn't actually a very good fit.
I was wondering if anyone had any good suggestions? For some reason I can't seem to come up with something decent atm. Especially something with an API in java would be nice.
This sounds like a job for.... new ObjectOutputStream(new FileOutputStream(STORAGE_DIR + "/" + key + ".dat"); !!
Seriously - the simplest method is to just create a file for each data table that you want to store, serialize the data into and look it up using the key as the filename when you want to read.
On a decent file system writes can be made atomic (by writing to a temp file and then renaming the file); read/write speed is measured in 10s of MBit/second; look ups can be made very efficient by creating a simple directory tree like STORAGE_DIR + "/" + key.substring(0,2) + "/" + key.substring(0,4) + "/" + key which should be still efficient with millions of entries and even more efficient if your file system uses indexed directories; lastly its trivial to implement a memory-backed LRU cache on top of this for even faster retrievals.
Regarding compression - you can use Jakarta's commons-compress to affect a gzip or even bzip2 compression to the data before you store it. But this is an optimization problem and depending on your application and available disk space you may be better off investing the CPU cycles elsewhere.
Here is a sample implementation that I made: http://geek.co.il/articles/geek-storage.zip. It uses a simple interface (which is far from being clean - its just a demonstration of the concept) that offers methods for storing and retrieving objects from a cache with a set maximum size. A cache miss is transfered to a user implementation for handling, and the cache will periodically check that it doesn't exceed the storage requirements and will remove old data.
I also included a MySQL backed implementation for completion and a benchmark to compare the disk based and MySQL based implementations. On my home machine (an old Athlon 64) the disk benchmark scores better then twice as fast as the MySQL implementation in the enclosed benchmark (9.01 seconds vs. 18.17 seconds). Even though the DB implementation can probably tweaked for slightly better performance, I believe it demonstrates the problem well enough.
Feel free to use this as you see fit.
I'd use EHCache, it's used by Hibernate and other Java EE libraries, and is really simple and efficient:
To add a table:
List<List<Integer>> myTable = new(...)
cache.put(new Element("myId", myTable));
To read:
List<List<Integer>> myTable = (List<List<Integer>>) cache.get("myId").getObjectValue();
Have you looked at Berkeley DB? That sounds like it may fit the bill.
Edit:
I forgot to add you can gzip the values themselves before you store them. Then just unzip them when you retrieve them.
Apache Derby might be a good fit if you want something embedded (not a separate server).
There is a list of other options at Lightweight Data Bases in Java
It seems that Key=>Value Databases are the thing you search for.
Maybe SuperCSV is the best framework for you!
If you don't want to use a relational database, you can use JAXB to store your Objects as XML files!
There is also a way with other libraries like XStream
If you prefer XML, then use JAXB or XStream. Otherwise you should have a look at CSV libraries such as SuperCSV. People who can life with serialized java files can use the default persistance mechanism like Guss said. Direct Java persistance may be the fastest way.
You can use JOAFIP http://joafip.sourceforge.net/
It make you able to put all your data model in file and you can access to it, update it, without reloading all in memory.
If you have a couple of KB, I don't understand why you need to "compress the size of this structure on disk as much as possible" Given that 181 MB of disk space costs 1 cent, I would suggest that anything less than this isn't worth spending too much time worrying about.
However to answer your question you can compress the file as you write it. As well as ObjectOutputStream, you can use XMLExcoder to serialize your map. This will be more compact than just using ObjectOutputStream and if you decompress the file you will be able to read or edit the data.
XMLEncoder xe = new XMLEncoder(
new GZIPOutputStream(
new FileOutputStream(filename+".xml.gz")));
xe.writeObject(map);
xe.close();