Create pivot table from huge xslx files using java - java

I have an issue while trying to create 2 pivot tables from one large (~100MB) xslx file using java POI. I have to process two sheets. I got a OOM error after nearly 10 minutes, when processing the last line of the following code:
File lFile = new File("workbook.xlsx");
try {
System.out.println("Debug 1");
XSSFWorkbook myWb = new XSSFWorkbook(lFile);
...
I'm not sure that SXSSF and other formats could help me as I don't really know how I could work around this issue. I could split the file in smaller ones, but the issue is the same if I want to split it: I still have to load the file in POI, and I still want 1 pivot table per sheet (the large one).
Is it possible to read the large file using something else, to copy the read data into split files, and to create the pivot table in a new file using multiple files?
The fact is that my final xslx file needs to have access to the data from the two sheets as the pivot tables are used to easily access those data with filters (as well as being used to check counts on some filters).
Every idea is welcome!
Thanks.

Related

In limited memory, use Apache POI to write data in specified rows and columns to an excel file that already has a large amount of data

I want to append data on an existing excel file, my code is as follows:
XSSFWorkbook book = new XSSFWorkbook("/my.xlsx");
Sheet sheet = book.getSheetAt(1);
//do something
But this file already has a lot of data, so I get out of memory error.
I know it is possible to use SXSSFWorkbook, but this doesn't work for me, because it still loads the existing data into memory:
new SXSSFWorkbook(new XSSFWorkbook("/my.xlsx"));
How can I solve this problem with limited memory? thanks in advance.
As mentioned above, I hope that I can write data in specified rows and columns to an excel file that already has a large amount of data in limited memory.

Java: Read 10,000 excel files and write to 1 master file in Java using Apache POI

I searched in Google but couldn't find a proper answer for my problem mentioned below. Pardon me if this is a duplicate question, but I couldn't find any proper answer.
So, coming to the question. I have to read multiple Excel files in Java and generate a final Excel report file out of these multiple files.
There are 2 folders:
Source folder: It contains multiple Excel file (Probably 10,000 files)
Destination folder: This folder will have one final Master Excel file after reading all the files from Source folder.
For each Excel file read from Source folder, the master file in the Destination folder will have 1 row each.
I am planning to use Apache POI to read and write excel files in Java.
I know its easy to read and write files in Java using POI, but my question is, given this scenario where there are almost 10,000 files to read and write into 1 single Master file, what will be the best approach to do that, considering the time taken and the CPU used by the program. Reading 1 file at a time will be too much time consuming.
So, I am planning to use threads to process files in batches of say 100 files at a time. Can anybody please point me to some resources or suggest me on how to proceed with this requirement?
Edited:
I have already written the program to read and write the file using POI. The code for the same is mentioned below:
// Loop through the directory, fetching each file.
File sourceDir = new File("SourceFolder");
System.out.println("The current directory is = "+sourceDir);
if(sourceDir.exists()) {
if(sourceDir.isDirectory()){
String[] filesInsideThisDir = sourceDir.list();
numberOfFiles = filesInsideThisDir.length;
for(String filename : filesInsideThisDir){
System.out.println("(processFiles) The file name to read is = "+filename);
// Read each file
readExcelFile(filename);
// Write the data
writeMasterReport();
}
} else {
System.out.println("(processFiles) Source directory specified is not a directory.");
}
} else {
}
Here, the SourceFolder contains all the Excel files to read. I am looping through this folder fetching 1 file at a time, reading the contents and then writing to 1 Master Excel file.
The readExcelFile() method is reading every excel file, and creating a List which contains the data for each row to be written to Master excel file.
The writeMasterReport() method is writing the data read from every excel file.
The program is running fine. My question is, is there any way I can optimize this code by using Threads for reading through the files? I know that there is only 1 file to write to, and it cannot be done parallely. If the sourceFolder contains 10,000 files, reading and writing this way will take a lot of time to execute.
The size of each Input file will be around few hundred KB.
So, my question is, can we use Threads to read the files in batches, say 100 or 500 files per thread, and then write the data for each thread? I know the write part will need to be synchronized. This way at least the read and write time will be minimized. Please let me know your thoughts on this.
With 10k of files ~100Kb each we're talking about reading ca. ~1Gb of data. If the processing is not overly complex (seems so) then your bottleneck will be IO.
So it most probably does not make sense to parallelize reading and processing files as IO has an upper limit.
Parallelizing would have made sense if processing were complex/the bottleneck. It does not seem to be the case here.

"Zip bomb detected" exception thrown by Apache-POI while opening existing xlsx files with pivot tables

I am trying to open existing xlsx file (Ms-excel 2010) to append with more data using Apache-POI (v 3.15).
The existing xlsx file (size 700Kb) contains number of tabs with pivot tables, charts etc.
File file = new File(FILE_PATH);
OPCPackage opcPackage = OPCPackage.open(file);
XSSFWorkbook wbk = new XSSFWorkbook(opcPackage);
But the exception thrown is as below,
Caused by: java.io.IOException: Zip bomb detected! The file would
exceed the max. ratio of compressed file size to the size of the
expanded data. This may indicate that the file is used to inflate
memory usage and thus could pose a security risk. You can adjust this
limit via ZipSecureFile.setMinInflateRatio() if you need to work with
files which exceed this limit. Counter: 819241, cis.counter: 8192,
ratio: 0.009999499536766349Limits: MIN_INFLATE_RATIO: 0.01 at
org.apache.poi.openxml4j.util.ZipSecureFile$ThresholdInputStream.advance(ZipSecureFile.java:257)
I have tried following changes:
1. tried to change "ZipSecureFile.setMinInflateRatio()" to fix tihs, but JVM crashes with heap space error (even though I allocated more than 4GB)
2. tried to use inputsteam, Workbookfactory.create to create workbook, tried to open as SXSSF.
But none of this has worked for me. Any one has any idea ?
So I found solution to the problem but as with many Microsoft related issues, I couldn't locate root cause particularly.
As this template was provided by analysts working on it before (by just deleting data bits), I felt excel file might be an issue.
What I did was, I started new spreadsheet with same pivot tables created one by one from scratch in each tab (a bit manual work but didn't take long).
The final template was 60Kb so there must have been some hidden data/"something invisible" which was there before and causing this to fail.
I managed to use this template to create new excel sheet and successfully added data around 600k rows in seconds. Awesome!
Bit late to answer this.
hope it will help people facing same kind of problem.
For me what worked is
ZipSecureFile.setMinInflateRatio(0.0);
you can have look at the description here
https://community.pega.com/node/715956

How to find a specific string in 1 or 2 sheets of a excel file using Apache POI?

I need to find an specific string (id, name for example) in 1 sheet of excel.
this is a basic need.
Later on we need to find a user on several excel sheets and copy the whole record identified with that code and send it to a JTable in the frame.
Are you looking for a high-level search function or something? I don't think that exists.
As you load the sheets, you might consider just adding the interesting columns to a HashMap if you can use exact matches, otherwise just iterate over the sheets/columns/rows and search manually.
You could create some mid-level tooling to do this. A "Sheet Indexer" perhaps, that takes a sheet and a list of columns then lets you do lookups. Even if you have to write code to iterate over everything manually you shouldn't worry too much about speed--the number of sheets/rows are very unlikely to get large enough to effect performance or anything.
We actually have a lot of tooling built around poi including a ORM layer that lets us load from spreadsheets using annotations just like hibernate. We called it "son of poi" aka "poison".

Inserting data in RandomAccessFile and updating index

I've got a RandomAccessFile in Java where i manage some data. Simplified:
At the start of the file i have an index. (8 byte long value per dataset which represents the offset where the real data can be found).
So if i want to now where i can find the data of dataset no 3 for example. I read 8 Bytes at offset (2*8). (Indexing starts with 0).
A dataset itsself consists of 4 Bytes which represents the size of the dataset and then all the bytes belonging to the dataset.
So that works fine in case i always rewrite the whole file.
It's pretty important here, that Dataset no 3 could have been written as the first entry in the file so the index is ordered but not the data itsself.
If i insert a new dataset, i always append it to the end of the file. But the number of datasets that could be i n one file is limited. If i can store 100 datasets in the file there will be always 100 entries in the index. If the offset read from the index of a dataset is 0 the dataset is new and will be appended to the file.
Bu there's one case which is not working for me yet. If i read dataset no. 3 from the file and i add some data to it in my application and i want to update it in the file i have no idea how to do this.
If it has the same length as befor i can simply overwrite the old data. But if the new dataset has more bytes than the old one i'll have to move all the data in the file which is behind this dataset and update the indexes for these datasets.
Any idea how to do that?
Or is there maybe a better way to manage storing these datasets in a file?
PS: Yes of course i thought of using a database but this is not applicable for my project. I really do need simple files.
You can't easily insert data into the middle of a file. You'd basically have to read all the remaining data, write the "new" data and then rewrite the "old" data. Alternatively, you could potentially invalidate the old "slow" (potentially allowing it to be reused later) and then just write the whole new record to the end of the file. Your file format isn't really clear to me to be honest, but fundamentally you need to be aware that you can't insert (or delete) in the middle of a file.
I've got a RandomAccessFile in Java where i manage some data.
Stop right there. You have a file. You are presently accessing it via RandomAccessFile in Java. However your entire question relates to the file itself, not to RandomAccessFile or Java. You have a major file design problem, as you are assuming facilities like inserting into the middle of a file that don't exist in any filesystem I have used since about 1979.
As the others answered too, there's no real possibility to make the file longer/shorter without rewriting the whole. There are some workarounds and maybe one solution would work after all.
Limit all datasets to a fixed length.
Delete by changing/removing the index and add by always adding to the end of the file. Update by removing the old dataset and adding the new dataset to the end if the new dataset is longer. Compress the file from time to time by actually deleting the "ignored datasets" and moving all valid datasets together (rewriting everything).
If you can't limit the dataset to a fixed length and you intend to update a dataset making it longer, you can also leave a pointer at the end of the first part of a dataset and continue it later in the file. Thus you get a structure like a linked list. If a lot of editing takes place it would make here sense too, to rearrange & compress the file.
Most solutions have a data overhead but file size is usually not the problem and as mentioned you can let some method "clean it up".
PS: I hope it's ok to answer such old questions - I couldn't find anything about it in the help center and I'm relatively new here.

Categories