APACHE POI - Removing duplicate values from hundreds of previous excel files

APACHE POI - Removing duplicate values from hundreds of previous excel files - java

My company calls customers and offers them to buy things. Every month we get 20 new excel files ,each including thousands of phone numbers and he has a staff calling them.
The problem is that sometimes n "new" excel file he recieves contains phone-numbers which already existed in one of the excel files he got a few months ealier. so that the customers were annoyed that they were being called again.
He asked me if i could help him out somehow. So I thought I would build a program which saves all phones in a text file named STORED.txt:
It has 2 cases:
1.if STORED.txt is empty:
Add each phone of the excel file to STORED.txt.
ELSE if STORED.txt is not empty:
for each phone in the excel - check if it already exists in STORED.txt.
The problem with this approach, is that after many runs, the STORED.txt file already has millions of values, and then a new excel file entered with 100,000 values would have to run over (100,000*millions) of iterations, which makes for an enormous runtime.
I'm wondering if there's a better approach to this issue, not going through O(S*E) complexity (where S are the stored phones and E are the new Excel phones). Any ideas?
BTW, when reading the text file, I'm first storing all the values into a java ArrayList(O(S) time) and only then checking for each phone in the excel if it exists in the ArrayList, using the java ArrayList.contains(value) method. I wonder if it's an efficient way, or maybe there's a way to do it which would be less time-consuming.
Any suggestions would be very welcome. Thanks!

Related

Create pivot table from huge xslx files using java

I have an issue while trying to create 2 pivot tables from one large (~100MB) xslx file using java POI. I have to process two sheets. I got a OOM error after nearly 10 minutes, when processing the last line of the following code:
File lFile = new File("workbook.xlsx");
try {
System.out.println("Debug 1");
XSSFWorkbook myWb = new XSSFWorkbook(lFile);
...
I'm not sure that SXSSF and other formats could help me as I don't really know how I could work around this issue. I could split the file in smaller ones, but the issue is the same if I want to split it: I still have to load the file in POI, and I still want 1 pivot table per sheet (the large one).
Is it possible to read the large file using something else, to copy the read data into split files, and to create the pivot table in a new file using multiple files?
The fact is that my final xslx file needs to have access to the data from the two sheets as the pivot tables are used to easily access those data with filters (as well as being used to check counts on some filters).
Every idea is welcome!
Thanks.

Java: Read 10,000 excel files and write to 1 master file in Java using Apache POI

I searched in Google but couldn't find a proper answer for my problem mentioned below. Pardon me if this is a duplicate question, but I couldn't find any proper answer.
So, coming to the question. I have to read multiple Excel files in Java and generate a final Excel report file out of these multiple files.
There are 2 folders:
Source folder: It contains multiple Excel file (Probably 10,000 files)
Destination folder: This folder will have one final Master Excel file after reading all the files from Source folder.
For each Excel file read from Source folder, the master file in the Destination folder will have 1 row each.
I am planning to use Apache POI to read and write excel files in Java.
I know its easy to read and write files in Java using POI, but my question is, given this scenario where there are almost 10,000 files to read and write into 1 single Master file, what will be the best approach to do that, considering the time taken and the CPU used by the program. Reading 1 file at a time will be too much time consuming.
So, I am planning to use threads to process files in batches of say 100 files at a time. Can anybody please point me to some resources or suggest me on how to proceed with this requirement?
Edited:
I have already written the program to read and write the file using POI. The code for the same is mentioned below:
// Loop through the directory, fetching each file.
File sourceDir = new File("SourceFolder");
System.out.println("The current directory is = "+sourceDir);
if(sourceDir.exists()) {
if(sourceDir.isDirectory()){
String[] filesInsideThisDir = sourceDir.list();
numberOfFiles = filesInsideThisDir.length;
for(String filename : filesInsideThisDir){
System.out.println("(processFiles) The file name to read is = "+filename);
// Read each file
readExcelFile(filename);
// Write the data
writeMasterReport();
}
} else {
System.out.println("(processFiles) Source directory specified is not a directory.");
}
} else {
}
Here, the SourceFolder contains all the Excel files to read. I am looping through this folder fetching 1 file at a time, reading the contents and then writing to 1 Master Excel file.
The readExcelFile() method is reading every excel file, and creating a List which contains the data for each row to be written to Master excel file.
The writeMasterReport() method is writing the data read from every excel file.
The program is running fine. My question is, is there any way I can optimize this code by using Threads for reading through the files? I know that there is only 1 file to write to, and it cannot be done parallely. If the sourceFolder contains 10,000 files, reading and writing this way will take a lot of time to execute.
The size of each Input file will be around few hundred KB.
So, my question is, can we use Threads to read the files in batches, say 100 or 500 files per thread, and then write the data for each thread? I know the write part will need to be synchronized. This way at least the read and write time will be minimized. Please let me know your thoughts on this.

With 10k of files ~100Kb each we're talking about reading ca. ~1Gb of data. If the processing is not overly complex (seems so) then your bottleneck will be IO.
So it most probably does not make sense to parallelize reading and processing files as IO has an upper limit.
Parallelizing would have made sense if processing were complex/the bottleneck. It does not seem to be the case here.

Several FileOutputStreams at a time?

The situation is that:
I have a csv file with records (usually 10k but up to 1m records)
I will process each record (very basic arithmetic with 5 basic select queries to the DB for every record)
Each record (now processed) will then be written to a file BUT not the same file every time. A record CAN be written to another file instead.
Basically I have 1 input file but several possible output files (around 1-100 possible output files).
The process itself is basic so I am focusing on how I should handle the records.
Which option is appropriate for this situation?
Store several List s that will represent per possible output file, and then write each List one by one in the end?
To avoid several very large Lists, every after processing each record, I will immediately write it to its respective output file. But this will require that I have streams open at a time.
Please enlighten me on this. Thanks.

The second option is ok: create the file output streams on demand, and keep them open as long as it takes (track them in a Map for example).
The operating system may have a restriction on how many open file handles it allows, but those numbers are usually well beyond a couple hundreds of files.
A third option:
You could also just append to files, FileOutputStream allows that option in the constructor:
new FileOutputStream(File file, boolean append)
This is less performant than keeping the FileOutputStreams open, but works as well.

Organizing data in java (on a mac)

I'm writing a tool to analyze stock market data. For this I download data and then save all the data corresponding to a stock as a double[][] 20*100000 array in a data.bin on my hd, I know I should put it in some database but this is simply performance wise the best method.
Now here is my problem: I need to do updates and search on the data:
Updates: I have to append new data to the end of the array as time progresses.
Search: I want to iterate over different data files to find a minimum or calculate moving averages etc.
I could do both of them by reading the whole file in and update it writing or do search in a specific area... but this is somewhat overkill since I don't need the whole data.
So my question is: Is there a library (in Java) or something similar to open/read/change parts of the binary file without having to open the whole file? Or searching through the file starting at a specific point?

RandomAccessFile allows seeking into particular position in a file and updating parts of the file or adding new data to the end without rewriting everything. See the tutorial here: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html

You could try looking at Random Access Files:
Tutorial: http://docs.oracle.com/javase/tutorial/essential/io/rafs.html
API: http://docs.oracle.com/javase/6/docs/api/java/io/RandomAccessFile.html
... but you will still need to figure out the exact positions you want to read in a binary file.
You might want to consider moving to a database, maybe a small embedded one like H2 (http://www.h2database.com)

Inserting data in RandomAccessFile and updating index

I've got a RandomAccessFile in Java where i manage some data. Simplified:
At the start of the file i have an index. (8 byte long value per dataset which represents the offset where the real data can be found).
So if i want to now where i can find the data of dataset no 3 for example. I read 8 Bytes at offset (2*8). (Indexing starts with 0).
A dataset itsself consists of 4 Bytes which represents the size of the dataset and then all the bytes belonging to the dataset.
So that works fine in case i always rewrite the whole file.
It's pretty important here, that Dataset no 3 could have been written as the first entry in the file so the index is ordered but not the data itsself.
If i insert a new dataset, i always append it to the end of the file. But the number of datasets that could be i n one file is limited. If i can store 100 datasets in the file there will be always 100 entries in the index. If the offset read from the index of a dataset is 0 the dataset is new and will be appended to the file.
Bu there's one case which is not working for me yet. If i read dataset no. 3 from the file and i add some data to it in my application and i want to update it in the file i have no idea how to do this.
If it has the same length as befor i can simply overwrite the old data. But if the new dataset has more bytes than the old one i'll have to move all the data in the file which is behind this dataset and update the indexes for these datasets.
Any idea how to do that?
Or is there maybe a better way to manage storing these datasets in a file?
PS: Yes of course i thought of using a database but this is not applicable for my project. I really do need simple files.

You can't easily insert data into the middle of a file. You'd basically have to read all the remaining data, write the "new" data and then rewrite the "old" data. Alternatively, you could potentially invalidate the old "slow" (potentially allowing it to be reused later) and then just write the whole new record to the end of the file. Your file format isn't really clear to me to be honest, but fundamentally you need to be aware that you can't insert (or delete) in the middle of a file.

I've got a RandomAccessFile in Java where i manage some data.
Stop right there. You have a file. You are presently accessing it via RandomAccessFile in Java. However your entire question relates to the file itself, not to RandomAccessFile or Java. You have a major file design problem, as you are assuming facilities like inserting into the middle of a file that don't exist in any filesystem I have used since about 1979.

As the others answered too, there's no real possibility to make the file longer/shorter without rewriting the whole. There are some workarounds and maybe one solution would work after all.
Limit all datasets to a fixed length.
Delete by changing/removing the index and add by always adding to the end of the file. Update by removing the old dataset and adding the new dataset to the end if the new dataset is longer. Compress the file from time to time by actually deleting the "ignored datasets" and moving all valid datasets together (rewriting everything).
If you can't limit the dataset to a fixed length and you intend to update a dataset making it longer, you can also leave a pointer at the end of the first part of a dataset and continue it later in the file. Thus you get a structure like a linked list. If a lot of editing takes place it would make here sense too, to rearrange & compress the file.
Most solutions have a data overhead but file size is usually not the problem and as mentioned you can let some method "clean it up".
PS: I hope it's ok to answer such old questions - I couldn't find anything about it in the help center and I'm relatively new here.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.