We have an application in which an XML string is created from a stored proc resultset and transformed using XSLT to return to the calling servlet. This work fine with smaller dataset but causing out of memory error with large amount of data. What will be the ideal solution in this case ?
XSLT transformations, in general, require the entire dataset to be loaded into memory, so the easiest thing is to get more memory.
If you can rewrite your XSLT, there is Streaming Transformations for XML which allow for incremental processing of data.
If you're processing the entire XML document at once then it sounds like you'll need to allocate more memory to the Java heap. But that only works up to the defined maximum heap size. Do you know a reasonable maximum data set size or is it unbounded?
Why do you need the database to generate the XML?
Few important things to note.
You mentioned works fine functionally with small data-set but goes out of memory with large data sets. you need to identify whether its creation of datasets that causes out of memory or transfer of datasets in the same process.
You are doing something which is making many objects to stay in memory. Re-Check your code and nullify some objects explicitly after usage.This will make life easier for garbage collector. Play with MaxPermSize settings of JVM. This will give you additional space for strings.
This approach is going to have a limitation that even if you are able to transfer large datasets for single user this might go outOfMemory for multiple users.
A suggestion that might work for you.
Break this in an Asyncronous process.Make creation of large datasets separate process and downloading of that datasets a different process.
While making the datasets available for download you can very well control the memory consumption using stream based downloading.
Related
I am developing a text analysis program that represents documents as arrays of "feature counts" (e.g., occurrences of a particular token) within some pre-defined feature space. These arrays are stored in an ArrayList after some processing.
I am testing the program on a 64 mb dataset, with 50,000 records. The program worked fine with small data sets, but now it consistently throws a "out of memory" Java heap exception when I start loading the arrays into an ArrayList object (using the .add(double[]) method). Depending on how much memory I allocate to the stack, I will get this exception at the 1000th to 3000th addition to the ArrayList, far short of my 50,000 entries. It became clear to me that I cannot store all this data in RAM and operate on it as usual.
However, I'm not sure what data structures are best suited to allow me to access and perform calculations on the entire dataset when only part of it can be loaded into RAM?
I was thinking that serializing the data to disk and storing the locations in a hashmap in RAM would be useful. However, I have also seen discussions on caching and buffered processing.
I'm 100% sure this is a common CS problem, so I'm sure there are several clever ways that this has been addressed. Any pointers would be appreciated :-)
You have plenty of choices:
Increase heap size (-Xmx) to several gigabytes.
Do not use boxing collections, use fastutil - that should decrease your memory use 4x. http://fastutil.di.unimi.it/
Process your data in batches or sequentially - do not keep whole dataset in memory simultaneously.
Use a proper database. There are even intraprocess databases like HSQL, your mileage may vary.
Process your data via map-reduce, perhaps something local like pig.
How about using Apache Spark (Great for in-memory cluster computing) ?This would help scale your infrastructure as your data set gets Larger.
What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.
The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!
Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)
How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.
I am reading a single XML file of size- 2.6GB-- the size of JVM is 6GB.
However I am still getting a Heap Space out of memory error?
What am I doing wrong here...
For reference, I output the max memory and free memory properties of the JVM--
The max memory was shown as approx 5.6GB, but free memory was shown as only 90MB... Why is only 90MB being shown as free, esp. when I have not even started any processing... I have just started the program?
In general, when converting structured text to the corresponding data structures in Java you need a lot more space than the size of the input file. There is a lot of overhead associated with the various data structures that are used, apart from the space required for the strings.
For example, each String instance has an additional overhead of about 32-40 bytes - not to mention that each character is stored in two bytes, which effectively doubles the space requirements for ASCII-encoded XML.
Then you have additional overhead when storing the String in a structure. For example, in order to store a String instance in a Map you will need about 16-32 bytes of additional overhead, depending on the implementation and how you measure the usage.
It is quite possible that 6GB is just not enough to store a parsed 2.6GB XML file at once...
Bottom line:
If you are loading such a large XML file in memory (e.g. using a DOM parser) you are probably doing something wrong. A stream-based parser such as SAX should have far more modest requirements.
Alternatively consider transforming the XML file into a more usable file format, such as an embedded database - or even an actual server-based database. That would allow you to process far larger documents without issues.
You should avoid loading the entire xml into memory at once and instead use a specialized class that can deal with large amounts of xml.
There are potentially several different issues here.
But for starters:
1) If you're on a 64-bit OS, make sure you're using a 64-bit JVM
2) Make sure your code closes all resources you open as promptly as possible.
3) Explicitly set references to large objects you're done with to "null".
... AND ...
4) Familiarize yourself with JConsole or VisualVM:
http://www.ibm.com/developerworks/library/j-5things7/
http://visualvm.java.net/api-quickstart.html
You can't load a 2.6 GB XML image as a document with just 6 GB. As jhordo suggests, the ratio is more likely to be be 12 to 1. This is because every byte turns into a 16-bit character and every tag, attribute and value turns into a String with at least 32 bytes of overhead.
Instead what you should do is use a SAX or event based parser to process the file progressively. This way it will only keep as much data as you need to retain. If you can process everything in one pass, you won't need to retain anything.
I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
Very fast random lookup by id (e.g. document URL)
Accessible by both Java and Perl
Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
No clustering required
No daemon ("service") required
No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
(for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.
The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.
Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.
Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.
Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.
I am not sure on the status of a Java version/wrapper.
We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
I just came across this post which offers a very good option: STXXL equivalent in Java
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
If you're working with huge amounts of data, you might want to consider using a database instead.
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.