I am using the latest POI 3.5 for Excel reading . I have Excel MS office 2007 installed and for that poi is providing XSSF for executing the data.
For 15000 lines of data it is executing properly, but when exceeding the limit till 30000 or 100000 or 200000, it is prone to a Java heap space Exception.
Code is below :
UATinput = new FileInputStream(UATFilePath);
uatBufferedInputStream = new BufferedInputStream(UATinput);
UATworkbook = new XSSFWorkbook(uatBufferedInputStream);
I am getting the Exception in the last line for Java heap size.
I have increased the size using -Xms256m -Xmx1536m, but still for more data it is giving the Java heap space Exception.
Can anybody help me out for this Exception for the XSSFWorbook?
Instead of reading the entire file in memory try using the eventusermodel api
This is a very memory efficient way to read large files. It works on the principle of SAX parser (as opposed to DOM) in the sense that it will call callback methods when particular data structures are encountered. It might get a little tricky as it expects you to know the nitty-gritty of the underlying data
Here you can find a good tutorial on this topic
Hope this helps!
Its true guys, after using the UserEventModel, my performance was awesome. Please write to me, if you guys have any issues. djeakandane#gmail.com
If you use XSSFWorkbook, POI has to create a memory model containing your whole Excel file, thus a huge memory consumption. Maybe you could use the Event API which isn't as simple as the user API but allows lower memory consumption.
By the way you could also set a bigger value for -Xmx...
The other thing to watch in your own code is how many objects you are "new"ing. If you are creating a lot of objects as you read through cells, it could exhaust the heap as well. Make sure you are being careful with the number of objects you create.
As others have said, your best bet is to switch over the the Event API
One thing that'll make a small difference though is to not wrap your file in an input stream! XSSF will happily accept a File as the input, and that's a lower memory footprint than an InputStream. That's because POI needs random access to the contents, and with an input stream the only way to do that is to buffer the whole contents into memory. With a File, it can just seek around. Using a File rather than an InputStream will save you a little over the size of the file worth of memory.
If you can, you should pass a File. If memory is tight, write your InputStream to a file and use that!
Try this one: -Xms256m -Xmx512m.
You should really look forward to process the XML data grid behind XLSX technology. You will be liberated from the heap space problems.
Here is the tutorial:
Check both links below.
http://poi.apache.org/spreadsheet/how-to.html
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi/xssf/eventusermodel/examples/FromHowTo.java
Some basic knowledge of parsing and the use of the SAX-XML project is required.
The JVM runs with fixed available memory. Once this memory is exceed you will receive "java.lang.OutOfMemoryError". The JVM tries to make an intelligent choice about the available memory at startup (see Java settings for details) but you can overwrite the default with the following settings.
To turn performance you can use certain parameters in the JVM.
Xms1024m - Set the minimum available memory for the JVM to 1024 Megabyte
Xmx1800m - Set the maximum available memory for the JVM to 1800 Megabyte. The Java application cannot use more heap memory then defined via this parameter.
If you start your Java program from the command line use for example the following setting: java -Xmx1024m YourProgram.
You can use SXSSF, A low-memory footprint SXSSF API built on top of XSSF. "http://poi.apache.org/spreadsheet/how-to.html#sxssf"
Related
I have a Java project with a huge set of XML files (>500). Reading this files at runtime leads to performance issues.
Is there an option to load all the XML files to RAM and read from there instead of the disk?
I know there are products like RamDisk but this one is a commercial tool.
Can I copy XML files to main memory and read from main memory using any existing Java API / libraries?
I would first try memory mapped files, as provided by RandomAccessFile and FileChannel in standard java library. This way OS will be able to keep the frequently used file content in memory, effectively achieving what you want.
You can use In-Memory databases to store intermediate files (XML files). This will give the speed of using ram and db together.
For reference use the following links:
http://www.mcobject.com/in_memory_database
Usage of H2 as in memory database:
http://www.javatips.net/blog/2014/07/h2-in-memory-database-example
Use java.io.RandomAccessFile class. It behaves like a large array of bytes stored in the file system. Instances of this class support both reading and writing to a random access file.
Also I would suggest using a MemoryMappedFile, to read the file directly from the disk instead of loading it in memory.
RandomAccessFile file = new RandomAccessFile("wiki.txt", "r");
FileChannel channel = file.getChannel();
MappedByteBuffer buf = channel.map(FileChannel.MapMode.READ_WRITE, 0, 1024*50);
And then you can read the buffer as usual.
have you considered creating an object structure for these files and serializing them, java object serialization and deserialization is much faster than parsing an XML, this is again considering that these 500 or so XML files don't get modified between reads.
here is an article which talks about serializing and deserializing.
if the concern is to load file content into memory, then consider ByteArrayInputStream, ByteArrayOutputStream classes maybe even use ByteBuffer, these can store the bytes in memory
Java object serialization/deserialization is not faster than XML writing and parsing in general. When large numbers of objects are involved Java serialization/deserialization can actually be very inefficient, because it tracks each individual object (so that repeated references aren't serialized more than once). This is great for networks of objects, but for simple tree structures it adds a lot of overhead with no gains.
Your best approach is probably to just use a fast technique for processing the XML (such as javax.xml.stream.XMLStreamReader). Unless the files are huge, that 30-40 seconds time to load the XML files is way out of line - you're probably using an inefficient approach to processing the XML, such as loading them into a DOM. You can also try reading multiple files in parallel (such as by using Java 8 parallel Streams).
Looks like your main issue is large number of files and RAM is not an issue. Can you confirm?
Is it possible that you do a preprocessing step where you append all these files using some kind of separator and create a big file? This way you can increase the block size of your reads and avoid the performance penalty of disk seeks.
Have you thought about compressing the XML files and reading in those compressed XML files? Compressed XML could be as little as 3-5% the size of the original or better. You can uncompress it when it is visible to users and then store it compressed again for further reading.
Here is a library I found that might help:
zip4j
It all depends, whether you read the data more than once or not.
Assuming we use some sort of Java-based-RamDisk (it would actually be some sort of Buffer or Byte-array).
Further assume the time to process the data takes less than reading from. So you have to read it at least one single time. So it would make no difference if you'd read it first from disk-to-memory and then process it from memory.
If you would read a file more than once, you could read all the files into memory (various options, Buffer, Byte-Arrays, custom FileSystem, ...).
In case processing takes longer than reading (which seems not to be the case), you could pre-fetch the files from disk using a separate thread - and process the data from memory using another thread.
Friends,
In my application, i came across an scenario, where the user may request for an Report download as a flat file, which may have max of 17 Lakhs records (around 650 MB) of Data. During this request either my application server stops serving other threads or occurs out of memory exception.
As of now i am iterating through the result set and printing it to the file.
When i Google out for this, i came across an API named OpenCSV. I tried that too but i didn't see any improvement in the performance.
Please help me out on this.
Thanks for the quick response guys, Here i added my code snap
try {
response.setContentType("application/csv");
PrintWriter dout = response.getWriter();
while(rs.next()) {
dout.print(data row); // Here i am printing my ResultSet tubles into flat file.
dout.print("\r\n");
dout.flush();
}
OpenCSV will cleanly deal with the eccentricities of the CSV format, but a large report is still a large report. Take a look at the specific memory error, it sounds like you need to increase the Heap or Max Perm Gen space (it will depend of the error to be sure). Without any adjusting the JVM will only occupy s fixed amount of RAM (my experience is this number is 64 MB).
If you only stream the data from resultset to file without using big buffers this should be possible, but maybe you are first collecting the data in a growing list before sending to file? So you should investigate this issue.
Please specify your question more otherwise we have to speculate.
CSV format aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.
I am reading an Excel sheet into R with XLConnect. It works very well. However, if I re-run the command (after changing values in the Excel file, for example), the function runs out of memory.
The file/sheet I am reading has 18 columns and 363 rows of numeric data.
The error message is
Error: OutOfMemoryError (Java): Java heap space
which appears on the second (identical) run of a readWorksheetFromFile call. I am trying to produce an MWE by repeatedly running the input call from this example, but the error does not seem to be reproducible with that file.
The Excel file I am using has many interconnected sheets and is about 3 MB. The sheet that I am reading is also linked to others, but I have set useCachedValues = TRUE.
It seems to me that, after executing the first call, the Java memory is not cleared. The second call then attempts to fill more data into memory, which causes the call to fail. Is it possible to force a garbage collection on the Java memory? Currently, the only solution is restarting the R session, which is not practical for my clients.
I know that expanding the Java memory might solve this, but that strikes me as a clumsy solution. I would prefer to find a way to dump the memory from previous calls.
I have also tried using the more verbose loadWorkbook and readWorksheet functions. The same error occurs.
Let me know if there is any other useful information you may require!
You should have a look at
?xlcFreeMemory
and
?xlcMemoryReport
which is also mentioned in the XLConnect package docu if you are having multipe runs and want to clean up in between.
how do I load a file into main memory?
I read the files using,
I use
BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
What is the advantage of loading the file directly into memory?
How do we do that in Java?
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory? Should I use them? Which of the two should I use ?
Thanks in advance!!!
BufferReader buf = new BufferedReader(FileReader());
I presume that this is reading the file line by line from the disk. What is the advantage of this?
Not exactly. It is reading the file in chunks whose size is the default buffer size (8k bytes I think).
The advantage is that you don't need a huge heap to read a huge file. This is a significant issue since the maximum heap size can only be specified at JVM startup (with Hotspot Java).
You also don't consume the system's physical / virtual memory resources to represent the huge heap.
What is the advantage of loading the file directly into memory?
It reduces the number of system calls, and may read the file faster. How much faster depends on a number of factors. And you have the problem of dealing with really large files.
How do we do that in Java?
Find out how large the file is.
Allocate a byte (or character) array big enough.
Use the relevant read(byte[], int, int) or read(char[], int, int) method to read the entire file.
You can also use a memory-mapped file ... but that requires using the Buffer APIs which can be a bit tricky to use.
I found some examples on Scanner or RandomAccessFile methods. Do they load the files into memory?
No, and no.
Should I use them? Which of the two should I use ?
Do they provide the functionality that you require? Do you need to read / parse text-based data? Do you need to do random access on a binary data?
Under normal circumstances, you should chose your I/O APIs based primarily on the functionality that you require, and secondarily on performance considerations. Using a BufferedInputStream or BufferedReader is usually enough to get acceptable* performance if you intend to parse it as you read it. (But if you actually need to hold the entire file in memory in its original form, then a BufferedXxx wrapper class actually makes reading a bit slower.)
* - Note that acceptable performance is not the same as optimal performance, but your client / project manager probably would not want your to waste time writing code to perform optimally ... if this is not a stated requirement.
If you're reading in the file and then parsing it, walking from beginning to end once to extract your data, then not referencing the file again, a buffered reader is about as "optimal" as you'll get. You can "tune" the performance somewhat by adjusting the buffer size -- a larger buffer will read larger chunks from the file. (Make the buffer a power of 2 -- eg 262144.) Reading in an entire large file (larger than, say, 1mb) will generally cost you performance in paging and heap management.
I have a servlet that users post a XML file to.
I read that file using:
String xml = request.getParameter("...");
Now say that xml document is 10KB, since I created the variable xml I am now using 10KB of memory for that variable correct?
Now I need to parse that xml (using xerces), and I converted it to a input stream when passing to it to the saxparsers parse method (http://docs.oracle.com/javase/1.5.0/docs/api/javax/xml/parsers/SAXParser.html).
So if I convert a string to a stream, is that doubling my memory usage?
Need some clarifications on this.
If I connect my process with visualvm or jconsole, while stepping through the code, can I see if I am using additional memory as I step through the code in my debugger?
I want to make sure I am not doing this inefficienctly as this endpoint will be hit hard.
A 10,000 bytes of text generally turns into 20 KB.
When you process text you generally need 2-10x more memory as you will be doing something with that information, such as creating a data structure.
This means you could need 200 KB. However given that in a PC this represents 1 cents worth, I wouldn't worry about it normally. If you have a severely resource limited device, I would consider moving the processing to another device like a server.
I think you might be optimizing your code before actually seeing it running. The JVM is very good and fast to recover unused memory.
But answering your question String xml = request.getParameter("..."); doesn't double the memory, it just allocates an extra 4 or 8 bytes (depending if the JVM is using compressed pointers) for the pointer.
Parsing the xml is different the SAX parser is very memory efficient, so it won't use too much memory, I think around 20 bytes per Handler plus any instance variables that you have... and obviously any extra objects that you might generate in the handler.
So the code you have looks like it's as memory efficient as it can get (depending of what you have in your handlers, of course).
Unless you're working on embedding that code in a device or running it 100k times a second, I would suggest you not to optimize anything unless you're sure you need to optimize it. The JVM has some crazy advanced logic to optimize code and the garbage collector is very fast to recover short lived objects.
If users can post massive files back to your servlet, then it is best not to use the getParameter() methods and handle the stream directly - Apache File Upload Library.
That way you can use the SAX Parser on the InputStream (and the whole text does not need to be loaded into memory before processing) - as you would have to do with the String based solution.
This approach scales well and requires only a tiny amount of memory per request compared to the String xml = getParameter(...) solution.
You will code like this:
saxParser.parse(new InputSource(new StringReader(xml));
You first need to create StringReader around xml. This won't double your memory usage, StringReader class merely wraps xml variable and return it character by character when requested.
InputSource is even thinner - it simply wraps provided Reader or InputStream. So in short: no, your String won't be copied, your implementation is pretty good.
No, you won't get 2 copies of the string, doubling your memory. Other things might double that memory, but the string itself won't be duplicated.
Yes, you should connect visualVm and jconsole to see what happens to memory and thread processing.