How to handle large XM File Java Around 5 GB

How to handle large XM File Java Around 5 GB - java

My application needs to use data in a XML file which is up to 5 GB in size. I Load data in Image Classed from the XML. The Image class has many attributes, Like Path, Name, MD5, Hash, and many other information like that.
The 5 GB file has around 50 Million of Image data in it, When i parse the xml the data is loaded inside the app and same amount of image classes is created inside the app, and i perform different operation and calculation on it.
My Problem is when i parse such a hugh file my memory get eat up. I guess all the data is loading inside the ram. Due to complexity of the code, I'm unable to provide the whole code. I there an efficient way to handle such a hugh number of classes. I have done research all night, but didn't get success, Can some one point me in right direction?
Thanks

You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once
I don't know how your code doing the parsing but you you don't need to store all data in the memory.
Here is a very good answer for implementation for reading large XML files

If you're using SAX, but you are eating up memory, then you are doing something wrong, and there is no way we can tell you what you are doing wrong without seeing your code.
I suggest using JVisualVM to get a heap dump and see what objects are using up the memory, and then investigating the part of your application that creates those objects.

Related

High volume data processing using lower RAM size without standard java libraries

In one of the technical discussion, I was asked a solution for the below scenario without using any standard java libraries which handles similar situation
how do you process an array which does not fit in available memory for JVM
how do you process a big file say 20 gb which does not fit in available memory.
I can possibly think of the following solutions
get the length of the array and process the array using some part of the length(for ex 4 iterations on array with length/4)
original file can be split into multiple parts using split command (or something similar in respective OS environment). Process individual smaller files and generate the intermediate result (for example data aggregation). Once individual file processing is done then based on the final size of the intermediate result files, process all the result file in 1 go or again apply iterative processing.
However I would like to know if there is a better approach. Also if such questions are inappropriate for this forum then please let me know and I will delete the question.
While I found some google articles with a solution but it talks of solution already provided by some libraries and hence posting it here.

One solution could be off-heap memory. Which could be mapped to direct memory or disk. You can get some inspiration here
https://mechanical-sympathy.blogspot.com/2012/10/compact-off-heap-structurestuples-in.html

Fastest way to create a trie (JSON) from a 4GB file, using only 1GB of ram?

Perhaps I'm doing this the wrong way:
I have a 4GB (33million lines of text) file, where each line has a string in it.
I'm trying to create a trie -> The algorithm works.
The problem is that Node.js has a process memory limit of 1.4GB, so the moment I process 5.5 million lines, it crashes.
To get around this, I tried the following:
Instead of 1 Trie, I create many Tries, each having a range of the alphabet.
For example:
aTrie ---> all words starting with a
bTrie ---> all words starting with b...
etc...
But the problem is, I still can't keep all the objects in memory while reading the file, so each time I read a line, I load / unload a trie from disk. When there is a change I delete the old file, and write the updated trie from memory to disk.
This is SUPER SLOW! Even on my macbook pro with SSD.
I've considered writing this in Java, but then the problem of converting JAVA objects to json comes up (same problem with using C++ etc).
Any suggestions ?

You may extend memory size limit that the node process uses by specifying the option below;
ps: size in mb's.
node --max_old_space_size=4096
for more options please see:
https://github.com/thlorenz/v8-flags/blob/master/flags-0.11.md

Instead of using 26 Tries you could use a hash function to create an arbitrary number of sub-Tries. This way, the amount of data you have to read from disk is limited to the size of your sub-Trie that you determine. In addition, you could cache the recently used sub-Tries in memory and then persist the changes to disk asynchronously in the background if IO is still a problem.

Processing a large (GB) file, quickly and multiple times (Java)

What options are there for processing large files quickly, multiple times?
I have a single file (min 1.5 GB, but can be upwards of 10-15 GB) that needs to be read multiple times - on the order of hundreds to thousands of times. The server has a large amount of RAM (64+ GB) and plenty of processors (24+).
The file will be sequential, read-only. Files are encrypted (sensitive data) on disk. I also use MessagePack to deserialize them into objects during the read process.
I cannot store the objects created from the file into memory - too large of an expansion (1.5 GB file turns into 35 GB in-memory object array). File can't be stored as a byte array (limited by Java's array length of 2^32-1).
My initial thought is to use a memory mapped file, but that has its own set of limitations.
The idea is to get the file off the disk and into memory for processing.
The large volume of data is for a machine learning algorithm, that requires multiple reads. During the calculation of each file pass, there's a considerable amount of heap usage by the algorithm itself, which is unavoidable, hence the requirement to read it multiple times.

The problem you have here is that you cannot mmap() the way the system call of the same name does; the syscall can map up to 2^64, FileChannel#map() cannot map more than 2^30 reliably.
However, what you can do is wrap a FileChannel into a class and create several "map ranges" covering all the file.
I have done "nearly" such a thing except more complicated: largetext. More complicated because I have to do the decoding process to boot, and the text which is loaded must be so into memory, unlike you who reads bytes. Less complicated because I have a define JDK interface to implement and you don't.
You can however use nearly the same technique using Guava and a RangeMap<Long, MappedByteBuffer>.
I implement CharSequence in this project above; I suggest that you implement a LargeByteMapping interface instead, from which you can read whatever parts you want; or, well, whatever suits you. Your main problem will be to define that interface. I suspect what CharSequence does is not what you want.
Meh, I may even have a go at it some day, largetext is quite exciting a project and this looks like the same kind of thing; except less complicated, ultimately!
One could even imagine a LargeByteMapping implementation where a factory would create such mappings with only a small part of that into memory and the rest written to a file; and such an implementation would also use the principle of locality: the latest queried part of the file into memory would be kept into memory for faster access.
See also here.
EDIT I feel some more explanation is needed here... A MappedByteBuffer will NOT EAT HEAP SPACE!!
It will eat address space only; it is nearly the equivalent of a ByteBuffer.allocateDirect(), except it is backed by a file.
And a very important distinction needs to be made here; all of the text above supposes that you are reading bytes, not characters!

Figure out how to structure the data. Get a good book about NoSQL and find the appropriate Database (Wide-Column, Graph, etc.) for your scenario. That's what I'd do. You'd not only have sophisticated query methods on your data but also mangling the data using distribute map-reduced implementations doing whatever you want. Maybe that's what you want (you even dropped the bigdata bomb)

How about creating "a dictionary" as the bridge between your program and the target file? Your program will call the dictionary then dictionary will refer you to the big fat file.

Reading a file using a buffer

I'm not sure if I'm asking this question right, but I want to make some sort of lyrics player using subtitle files. As I also want to make it compatible with larger files (say 10.000 lines), it's not a good idea to load the file before you play it. This might cost a lot of time and unnecessary amounts of data stored on the RAM. That's why I want to load it the way for example online videos do (they store an amount of minutes on the RAM and discard that what's been played, all while playing). I believe this is called buffering.
My question was: are there any pre-made i/o classes inside java that allow this sort of thing? I know a lot of classes with buffer in their name, but I have little to no idea what they do or what they are different from other classes (without buffer in their name).

Are all .class files in my Java application loaded into memory after application start?

I am making an app for Android, in my Activity I need to load an array of about 10000 strings. Loading it from database was slow, so I decided to put it directly into one .java file (as a private field). I have about 20 of these classes containing string arrays and my question is, are all the classes loaded into memory after my application is started? If so the Activity in which I need these strings would be loaded quickly, but the application as a whole would have a slow start...
Is there other way, how to very quickly load an 10000 string array from a file?
UPDATE:
Why I need these strings? My Android app allows you to find "journeys" in Prague's public transit - you choose departure stop, arrival stop and it finds your journey (have a look here). My app has a suggestions feature - you enter leter "c" as your departure stop and a suggestions ListView appears with stops starting with "c". For these suggestions I need the strings. Fetching the suggestions from database is slow (about 400ms on G1).

First, 400ms to perform a simple database query is really slow. So slow that I'd suspect that there is some problem in your database schema (e.g. indices) or your database connection configuration.
But if you a serious about not using a database, there are a couple of alternatives to what you are currently doing:
Arrange that the classes containing the arrays are lazily loaded as required, using Class.forName(...). If you implement it right, it should be possible for the garbage collector to reclaim the classes after they have been loaded and the strings have been added to your primary data structure.
Turn the 10000 Strings into a flat file, put the file into your app's JAR file. Then use Class.getResourceAsStream(...) to open the file and read it into the in-memory array.
As above, but using an indexed file and replacing the array with a data structure that allows you to read Strings from the file lazily. (This will be a bit complicated, but if you are worried by the memory consumed by the 10000 Strings, this will help address that.)

A class is loaded only when it is first referenced.
Though you need an array of 10000, you may not need all at once. Here is where the concept of paging comes in. This link indicates that Paging is often done in Android.Initialy have only a small amount of array in memory, and as you need it, keep loading it in to memory and unloading any previous data from memory if not wanted.
For e.g. in any table, at one shot, the user may see at best 50 records, then he will have to scroll(considering his screen is not size of an iMax movie theatre). When he scrolls, load the next chunk of data and unload any data that is now inivsible to the user.

When is a Type Loaded? This is a
surprisingly tricky question to
answer. This is due in large part to
the significant flexibility afforded,
by the JVM spec, to JVM
implementations. Loading must be
performed before linking and linking
must be performed before
initialization. The VM spec does
stipulate the timing of
initialization. It strictly requires
that a type be initialized on its
first active use (see Appendix A for a
list of what constitutes an "active
use"). This means that loading (and
linking) of a type MUST be performed
at or before that type's first active
use.
From http://www.developer.com/java/other/article.php/2248831/Java-Class-Loading-The-Basics.htm

I don't think that you will be happy with maintaining 10K Strings, hardcoded at Java files.
Rather check if you are using the right database for your problem and if your indices are set correctly. A wrong index can cause really poor performance.
Additionally you should limit the amount of results returned by the query, but make sure you don't fetch the entries one by one.
If nothing fits, you can still preload the Strings from the database at startup.
You could preload, let's say 10 entries, for each character. If a character is keyed in, you can preload the entries with that character following another and so on.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.