Java - read, process, write with large excel - java

I have a large spread sheet. It has 10 sheets, each with 1m rows. With Java, I need to run an algorithm for each row, return a value for each row and insert back into the excel file.
My idea was to load the file into ram, do calculations for each row, store the result in a list, and insert back to excel in order, but I didn't anticipate the issues dealing with the data size.
I tried XSSF, and it wasn't able to load such a large file. After waiting for a few hours it gave me the OOM error.
I tried increasing heap in run->run configurations->arguments, and in control panel->java. It didn't work.
I tried using the following StreamingReader and it didn't work.
FileInputStream in = new FileInputStream("D:\\work\\calculatepi\\sampleresult.xlsx");
Workbook workbook = StreamingReader.builder()
.rowCacheSize(100)
.bufferSize(4096)
.open(in);
I'm really out of clue and not sure what to do. Is there no easy way to do this?

It is not only about the configuration of that library. It is also about the memory that you give to you JVM! Try increasing the heap space of the JVM, see here for example.
Beyond that: I think you should do two things:
make experiments with smaller sheets. Create one that only has 100 rows, then maybe 10K, 100K. Measure the memory consumption. And from there
see if there are other APIs/libraries that allow you to read/write individual rows without pulling the whole file into memory
and if none of that works, maybe you have to use a completely different design: such as just having some sort of "service". And now, you write some VB script code that you run inside excel, that simply for each row calls that service to fetch the results. Or, ideally: do not misuse Excel as database. This is similar to using a sports car to transport a huge number of goods, just because you already have that sports car. But it would still be more appropriate to get yourself a truck instead. In other words: consider moving your data into a real database. In the long run, everything you do will be "easier" then!

Related

How to handle large XM File Java Around 5 GB

My application needs to use data in a XML file which is up to 5 GB in size. I Load data in Image Classed from the XML. The Image class has many attributes, Like Path, Name, MD5, Hash, and many other information like that.
The 5 GB file has around 50 Million of Image data in it, When i parse the xml the data is loaded inside the app and same amount of image classes is created inside the app, and i perform different operation and calculation on it.
My Problem is when i parse such a hugh file my memory get eat up. I guess all the data is loading inside the ram. Due to complexity of the code, I'm unable to provide the whole code. I there an efficient way to handle such a hugh number of classes. I have done research all night, but didn't get success, Can some one point me in right direction?
Thanks
You need some sort of pipeline to pass the data on to its actual destination without ever store it all in memory at once
I don't know how your code doing the parsing but you you don't need to store all data in the memory.
Here is a very good answer for implementation for reading large XML files
If you're using SAX, but you are eating up memory, then you are doing something wrong, and there is no way we can tell you what you are doing wrong without seeing your code.
I suggest using JVisualVM to get a heap dump and see what objects are using up the memory, and then investigating the part of your application that creates those objects.

How to speed up excel reading/writing

I am using Apache POI to read/write to an excel file for my company as an intern here. My program goes through the excel file which is a big square with top rows computer names and left column user names. 240 computers and 342 users. the sheet[computer][user] is 0 in all spaces and the program calls PSLoggedon for each computer and takes the username(s) currently logged on and increments their 0 so after running it after a month, it shows who is logged in the most to each computer. So far it runs in about 25 minutes since I used a socket to check socket.connect before actually calling PSLoggedon.
Without reading or writing at all to the excel file, just calling all the PSLoggedon calls to each computer, takes about 9 minutes. So, the reading and writing apparently takes 10-15 minutes. The thing is, I am calling PSLoggedon on the computer, then opening the excel to find the [x][y] spot of the [computer][user] and then writing to it a +=1 then closing it. So the reason it is taking this long I suppose is because it opens and closes the file so much? I could be completely wrong. But I can't think of a way to make this faster by opening and reading/writing all at once and only opening and closing the file once. Any ideas?
Normally Apache-POI is very fast, if you are running into some issue then you might need to check below points:
POI's logging might be on, you need to turn them off:
You can add one of these –D to your JVM settings to do this:
-Dorg.apache.poi.util.POILogger=org.apache.poi.util.NullLogger
You may be setting your VM heap to low value, try to increase.
Prefer XLS over XLSX.
Get HSQLDB (or another in-process database, but this is what I've used in the past). Add it to your build.
You can now create either a file-based or in-memory database (I would use file-based, as it lets you persist state between runs) simply by using JDBC. Create a table with the columns User, Computer, Count
In your reading thread(s), INSERT or UPDATE your table whenever you find a user with PSLoggedon
Once your data collection is complete, you can SELECT Computer, User, Count from Data ORDER BY Computer, User (or switch the order depending on your excel file layout), loop through the ResultSet and write the results directly.
This is an old question, but from what I see:
Since you are sampling and using Excel, is it safe to assume that consistency and atomicity isn't critical? You're just estimating fractional usage and don't care if a user logged in and logged out between observations.
Is the Excel file stored over a slow network link? Opening and closing a file 240 times could bring significant overhead. How about the following:
You need to open the Excel file once to get the list of computers. At that time, just snapshot the entire contents of the matrix into a Map<ComputerName, Map<UserName, Count>>. Also get a List<ComputerName> and List<UserName> to remember the row/column headings. The entire spreadsheet has less than 90,000 integers --- no need to bring in heavy database machinery.
9 minutes for 240 computers, single-threaded, is roughly 2.25 seconds per computer. Is that the expected throughput of PSLoggedOn? Can you create a thread pool and query all 240 computers at once or in a small number of rounds?
Then, parse the results, increment your map and dump it back to the Excel file. Is there a possibility that you might see new users that were not previously in the Excel? Those will need to be added to the Map and List<UserName>.

Parsing 20 GB input file to an ArrayList

I need to sort a 20 GB file ( which consists of random numbers) in the ascending order, But I am not understanding what technique should I use. I tried to use ArrayList in my Java Program, but it runs out of Memory. Increasing the heap size didn't work too, I guess 20 GB is too big. Can anybody guide me, how should I proceed ?
You shall use an external sorting algorithm, do not try to fit this in memory.
http://en.wikipedia.org/wiki/External_sorting
If you think it is too complex, try this:
include H2 database in your project
create a new on-disk database (will be created automatically on first connection)
create some simple table where the numbers will be stored
read data number-by-number and insert it into the database (do not forget to commit each 1000 numbers or so)
select numbers with ORDER BY clause :)
use JDBC resultSet to fetch results on-the-fly and write them to an output file
H2 database is simple, works very well with Java and can be embedded in your JAR (does not need any kind of installation or setup).
You don't need any special tools for this really. This is a textbook case for external merge sort, wherein you read in parts of the large file at a time (say 100M), sort them, and write the sorted results to an external file. Read in another part, sort it, spit it back out, until there's nothing left to sort. Then you need to read in the sorted chunks, a smaller piece at a time (say 10M) and sort those in memory. The tricky point is to merge those sorted bits together in the right way. Read the external sorting page on Wikipedia as well, as already mentioned. Also, here's an implementation in Java that does this kind of external merge sorting.

Java Heap Size and OutofMemory

I am trying to read a file (tab or csv file) in java with roughly 3m rows; have also added the virtual machine memory to -Xmx6g. The code works fine with 400K rows for tab separated file and slightly less for csv file. There are many LinkedHashMaps and Vectors involved that I try to use System.gc() after every few hundred rows in order to free memory and garbage values. However, my code gives the following error after 400K rows.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Vector.<init>(Vector.java:111)
at java.util.Vector.<init>(Vector.java:124)
at java.util.Vector.<init>(Vector.java:133)
at cleaning.Capture.main(Capture.java:110)
Your attempt to load the whole file is fundamentally ill-fated. You may optimize all you want, but you'll just be pushing the upper limit slightly higher. What you need is eradicate the limit itself.
There is a very negligible chance that you actually need the whole contents in memory all at once. You probably need to calculate something from that data, so you should start working out a way to make that calculation chunk by chunk, each time being able to throw away the processed chunk.
If your data is deeply intertwined, preventing you from serializing your calculation, then the reasonable recourse is, as HovercraftFOE mentions above, transfering the data into a database and work from there, indexing everything you need, normalizing it, etc.

Need suggestion on my approach : to read a file which is being written continuously?

I have one csv file, which is being written continuously by script. It writes timestamp and some other data per row. I have to read the latest data first.
Currently I am using RandomAccessFile in java to read the file in reverse way. But as its written continuously, I have to read the new data with priority. I am maintaining which timestamp has been sent and doing the work. It results unnecessary scanning operations.
Is there any better way to deal with this scenario?
Thanks in advance,
You could consider having one thread that reads new lines as they appear and pushes them onto a stack of unprocessed rows, and a second thread that pops the stack and processes the new rows in reverse order.
Depending on how long it takes to process a new row compared to how quickly they are generated, this might be sufficient. If new rows are generated faster than you can process them then this approach probably won't work - the stack will get too big and you'll run out of memory. In that case, depending on your requirements, you might be able to get away with a size-limited stack that discards old entries.
Two ideas:
Use a fixed size record format instead of CSV. Then you can tell exactly what offsets your records are at instead of having to seek around looking for newlines.
If that isn't possible, have a thread that reads items from the file and pushes them onto a stack. Another thread pops items from the stack and processes them. Because it's a stack it'll always be dealing with the most recent available item. You'll need to figure out how you want to deal with cases where the stack gets too big. Do you just want to throw away items that are too old?
If you have access to the original script, write the record to a database, in addition to the CSV file. Then you can do whatever you want with the database; access the last record, run a report, etc.
If your application is running in a Unix environment, you could run
tail -f /csv-file | custom-program
custom-program would simply accept standard input and echo that to a socket connection with your Java program.
I'm assuming that your Java program is some sort of server app that can't be started from the command line like that. If that would actually be okay, then you could replace custom-program with your Java program.
It results unnecessary scanning operations.
I presume that you are referring to the overheads of seeking to some point, and then finding the next valid CSV row start position by reading until you hit the next newline.
I can think of three ways to do this that may be more efficient than what you are currently doing:
Read the entire file and parse out the rows in forwards direction, storing the positions in memory. Then process the in-memory rows in reverse order.
Scan the file from the beginning looking for row starts, and store the row start positions in memory. Then iterate through the positions in reverse order, seeking to each one to read the corresponding row. (You can do the input more efficiently by processing multiple rows in each seek.)
Map the file into memory using a MappedByteBuffer, then you can step through the Byte buffer forwards or backwards to find the row boundaries.
The first approach requires that you can buffer the entire file in memory, but has the lower I/O overheads because you read the file just once with a minimum number of system calls. The third approach has the same the same issue, though you could map an extremely large file into memory in (large) sections to reduce the memory requirements.
But ultimately, there is no simple and efficient way of reading a file backwards in Java.

Categories