We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
If you're working with huge amounts of data, you might want to consider using a database instead.
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.
Related
What is the fastest way to populate a Hazelcast Data Grid. Reading through documentation I can see couple of variants:
Use multithreading and IMap.set
Use multithreading and IMap.putAll
Use a Distributed Execution in order to start populating the grid from all participants.
My performance benchmark shows that IMap.putAll is faster than IMap.Set. But it is stated in the Hazelcasty Documentation that IMap.putAll does not come with guarantees that everything will be inserted atomically.
Can someone clarify a little bit about what would be the fastest way to populate a data grid with data ?
Is variant number 3 good ?
I would see the same three options. Anyhow as you mentioned, option two does not guarantee that everything was put into the map atomically but if you just load data and wait for all threads to finish loading data using IMap::putAll you should be fine.
Apart from that IMap::set would be the alternative. In any case you want to multithread the loading process. I would play around a bit with different thread numbers and loading data from a client is normally recommended to keep nodes free for storage operations.
I personally never benchmarked your third option, anyhow it would be possible as well. Just not sure it is worth the additional work.
How much data do you want to load that you're concerned it could be slow? Do you already know that loading is slow? Do you use Java Serialization (this is a huge performance killer)? Do you use indexes (those have to be generated while putting data)?
There's normally a lot of optimizations to apply to speed up, not only, data loading but also normal operation.
I'm filling up the JVM Heap Space.
Changing parameters to give more heap space to the JVM, or changing something in my algorithm in the code not to use so much space are two of the most recommended options.
But, if those two have already been tried and applied, and I still get out of memory exceptions, I'd like to see what the other options are.
I found out about this example of "Using a memory mapped file for a huge matrix" and a library called HugeCollections which are an interesting way to solve my problem. Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one.
My question is, is there any other library doing this, or a good way of achieving it (having collection objects (lists and sets) memory mapped)?
You don't say what sort of collections you're using, or the way that you're using them, so it's hard to give recommendations. However, here are a few things to keep in mind:
Keeping the objects on the Java heap will always be the simplest option, and RAM is relatively cheap.
Blindly moving to memory-mapped data is very likely to give horrendous performance, especially if you're moving around in the file and/or making lots of changes. Hash-based collection types are the worst, as they work by distributing data. Tree-based collection types are generally a better choice, and linear collections can go both ways.
Once you move off-heap, you need a way to translate your objects to/from Java. Object serialization is the easiest, but adds lots of overhead. Binary objects accessed via byte buffers are usually a better choice, but you need to be thread-conscious.
You also have to manage your own garbage collection for off-heap objects. Not a problem if all you're doing is creating/updating, but quickly becomes a pain if you're deleting.
If you have a lot of data, and need to access that data in varied ways, a database is probably your best bet.
Unluckily, the library hasn't seen an update for over a year, and it's not in any Maven repo - so for me it's not a really reliable one I agree and I wrote it. ;)
I suggest you look at https://github.com/peter-lawrey/Java-Chronicle which is higher performance has been used a bit. It really design for List & Queue but you could use it for a Map or Set with additional data structures.
Depending on your requirements, you could write your own library. e.g. for time series data I wrote a different library which is not open source unfortunately but can load tables of 500+ GB pretty cleanly.
it's not in any Maven repo
Neither is this one but would be happy for someone to add it.
Sounds like you're either having trouble with a memory leak, or trying to put too large an Object into memory.
Have you tried making a rough estimate of the amount of memory needed to load your data?
Assuming you have no memory leaks or other issues and really need that much storage that you can't fit it in the heap (which I find unlikely) you have basically only one option:
Don't put your data on the heap. Simple as that. Now which method you use to move your data out is very dependend on your requirements (what kind of data, frequency of updates and how much is it really?).
Note: You can use very large heaps with a 64-bit VM and if necessary enlarge the swap space of the OS. It may be the simplest solution to just brutally increase the maximum heap size (even if it means lots of swapping). I certainly would try that first in the situation you outlined.
I have a Huge data file and I only need specific data from this file, and later on, I will be using these data frequently.
So which of these two methods would be more efficient :
save this data in global variables (maybe LinkedList) and use them every time I need
save them in a file, and read the file every time I need the data
I should mention that these data could be a huge amount of integers.
Which of the mentioned two ways would give better performance with respect to speed and memory ?
If the file I/O overhead is not an issue for you: Save them in a file and create an index file mapping keys to file positions so you do not have to read your huge file.
If the data fits in your RAM and you want to be able to access it quickly - go by the first approach (but maybe without an index file) but read the data into memory at startup or when needed the first time.
As long as it fits in memory, working with memory is surely some orders of magnitude faster. But do not use LinkedList - it has a huge overhead. And do not use any standard Collection at all since it means boxing and blows the memory overhead by a factor 3 at least.
You could use int[] or a specialized collection for primitive types.
I'd recommend using a file via java.nio.IntBuffer. This way the data reside primarily on the disk but get mapped into memory too.
Probably the first one.
But there really isn't enough information there to answer you properly.
Firstly a linked list is fine if you only ever traverse it in order. However, if you need random access to it (5th element, then 100th, then 12th, then 45th...), it's lousy, and you'd be better with an ArrayList or something. Secondly, if you're storing lots of ints, if you use one of the standard Java collections, each int will be boxed, which may present a performance overhead.
Then you haven't said what 'huge' means. Thousands? Millions?
So, yeah, you need to say what kind of numbers you're dealing with, and what the access patterns are likely to be. And is the 'filtering' step a one-off--or is it done quite frequently?
It depends on system spec, if you are designing your app for one machine - the task is simple, elsewhere you should take into account memory and/or disk space limit on client's computer.
I think you cannot compare these two attitudes performance, as each one has it's own benefits and drawbacks. I'm certain that there are some algorithms available that you could further investigate, connected with reading part of a file into the memory, or creating a cache (when you read a number from a file, store it in memory, so next time you load it - it will be stored in memory).
I am trying to better understand when I should and should not use Iterators. To me, whenever I have a potentially large amount of data to iterate through, I write an Iterator for it. If it also lends itself to the Iterator interface, then it seems like a win.
I was reading a little bit that there is a lot of overhead with using an Iterator.
A good example of where I used an Iterator was to iterate through a bunch of SQL scripts to execute one query at a time, reading it in, then executing it.
Is there another performance trade off I should be aware of? Before I used iterators, I would read the entire String of SQL commands to execute into an ArrayList, and the iterate through that. If the import is rather large (like for geolocation data, then the server tends to get bogged down).
Walter
I think your question is when you should 'stream' input rather than load it all into memory and the process it. It's not really a question of using Iterator or not I think.
"It depends," of course, though in your given example it sounds like streaming the input rather than loading it all into memory is a clear win, so iterate indeed.
The benefit of loading into memory is usually that the code is simpler, and maybe you get some benefit from loading large chunks into memory at once rather than reading bits at a time. The benefit of "streaming" is that you limit your memory requirements, and, gain performance associated with that.
As a very crude rule of thumb, I wouldn't load anything like this into memory unless I were sure it was under 100K or so.
A good example of where I used an Iterator was to iterate through a bunch of SQL scripts to execute one query at a time, reading it in, then executing it.
In this scenario the overhead of an Iterator is likely dwarfed by the time it takes to run the queries.
Before I used iterators, I would read the entire String of SQL commands to execute into an ArrayList, and the iterate through that. If the import is rather large (like for geolocation data, then the server tends to get bogged down).
Any particular reason you need to collect them all into an ArrayList? You could just execute them one by one as you read the statements.
Iterators are particularly suited for streaming cases where the data is loaded/created on the fly/lazily. They do not require the data to be completely in memory upfront.
Alright. So I have a very large amount of binary data (let's say, 10GB) distributed over a bunch of files (let's say, 5000) of varying lengths.
I am writing a Java application to process this data, and I wish to institute a good design for the data access. Typically what will happen is such:
One way or another, all the data will be read during the course of processing.
Each file is (typically) read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file simultaneously, or the middle few kilobytes of each file simultaneously, etc.
There are times when the application will want random access to a byte or two here and there.
Currently I am using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate the data access into some class such that it is fast and I never have to worry about it again. The basic functionality is that I will be asking it to read frames of data from specified files, and I wish to minimize the I/O operations given the considerations above.
Examples for typical access:
Give me the first 10 kilobytes of all my files!
Give me byte 0 through 999 of file F, then give me byte 1 through 1000, then give me 2 through 1001, etc, etc, ...
Give me a megabyte of data from file F starting at such and such byte!
Any suggestions for a good design?
Use Java NIO and MappedByteBuffers, and treat your files as a list of byte arrays. Then, let the OS worry about the details of caching, read, flushing etc.
#Will
Pretty good results. Reading a large binary file quick comparison:
Test 1 - Basic sequential read with RandomAccessFile.
2656 ms
Test 2 - Basic sequential read with buffering.
47 ms
Test 3 - Basic sequential read with MappedByteBuffers and further frame buffering optimization.
16 ms
Wow. You are basically implementing a database from scratch. Is there any possibility of importing the data into an actual RDBMS and just using SQL?
If you do it yourself you will eventually want to implement some sort of caching mechanism, so the data you need comes out of RAM if it is there, and you are reading and writing the files in a lower layer.
Of course, this also entails a lot of complex transactional logic to make sure your data stays consistent.
I was going to suggest that you follow up on Eric's database idea and learn how databases manage their buffers—effectively implementing their own virtual memory management.
But as I thought about it more, I concluded that most operating systems are already a better job of implementing file system caching than you can likely do without low-level access in Java.
There is one lesson from database buffer management that you might consider, though. Databases use an understanding of the query plan to optimize the management strategy.
In a relational database, it's often best to evict the most-recently-used block from the cache. For example, a "young" block holding a child record in a join won't be looked at again, while the block containing its parent record is still in use even though it's "older".
Operating system file caches, on the other hand, are optimized to reuse recently used data (and reading ahead of the most recently used data). If your application doesn't fit that pattern, it may be worth managing the cache yourself.
You may want to take a look at an open source, simple object database called jdbm - it has a lot of this kind of thing developed, including ACID capabilities.
I've done a number of contributions to the project, and it would be worth a review of the source code if nothing else to see how we solved many of the same problems you might be working on.
Now, if your data files are not under your control (i.e. you are parsing text files generated by someone else, etc...) then the page-structured type of storage that jdbm uses may not be appropriate for you - but if all of these files are files that you are creating and working with, it may be worth a look.
#Eric
But my queries are going to be much, much simpler than anything I can do with SQL. And wouldn't a database access be much more expensive than a binary data read?
This is to answer the part about minimizing I/O traffic. On the Java side, all you can really do is wrap your readers in BufferedReaders. Aside from that, your operating system will handle other optimizations like keeping recently-read data in the page cache and doing read-ahead on files to speed up sequential reads. There's no point in doing additional buffering in Java (although you'll still need a byte buffer to return the data to the client).
I had someone recommend hadoop (http://hadoop.apache.org) to me just the other day. It looks like it could be pretty nice, and might have some marketplace traction.
I would step back and ask yourself why you are using files as your system of record, and what gains that gives you over using a database. A database certainly gives you the ability to structure your data. Given the SQL standard, it might be more maintainable in the long run.
On the other hand, your file data may not be structured so easily within the constraints of a database. The largest search company in the world :) doesn't use a database for their business processing. See here and here.