huge data from a file to database using java? - java

i am able to load a huge text file data into database where the number of lines are 33264591.
I used normal BufferedReader for reading line by line and able to push the data.
Here it is taking enormous time for loading almost 3 hrs for reading line by line and insert into database.
Could some one suggest me better way for Quick insertion of data using java?
Thank you in advance

Well, before going any further, I would suggest using a profiler and finding out why it takes so much time. If you know where the problem is, it would be easier to fix.

I believe the best way to read huge files are using BufferedReader and reading it line by line. So this is what you are doing. I am wondering if you are inserting the data in the same loop where you are reading the file. The only optimization i can think of in your scenario is to do the database inserts in a separate thread so that you should not block your file reading because of any delay in the DB inserts. DB inserts will gradually start to become slower and slower as the size of your table grows. So doing DB inserts in a separate thread will be a good idea.

Do batch inserts instead of inserting one row at a time.

Related

j2ee download a file issues if same file used in backend?

Webapp, in my project to provide download CSV file functionality based on a search by end user, is doing the following:
A file is opened "download.csv" (not using File.createTempFile(String prefix,
String suffix, File directory); but always just "download.csv"), writing rows of data from a Sql recordset to it and then using FileUtils to copy that file's content to the servlet's OutputStream.
The recordset is based on a search criteria, like 1st Jan to 30th March.
Can this lead to a potential case where the file has contents of 2 users who make different date ranges/ other filters and submit at the same time so JVM processes the requests concurrently ?
Right now we are in dev and there is very little data.
I know we can write automated tests to test this, but wanted to know the theory.
I suggested to use the OutputStream of the Http Response (pass that to the service layer as a vanilla OutputSteam and directly write to that or wrap in a Buffered Writer and then write to it).
Only down side is that the data will be written slower than the File copy.
As if there is more data in the recordset it will take time to iterate thru it. But the total time of request should be less? (as the time to write to output stream of file will be same + time to copy from file to servlet output stream).
Anyone done testing around this and have test cases or solutions to share?
Well that is a tricky question if you really would like to go into the depth of both parts.
Concurrency
As you wrote this "same name" thing could lead to a race condition if you are working on a multi thread system (almost all of the systems are like that nowadays). I have seen some coding done like this and it can cause a lot of trouble. The result file could have not only lines from both of the searches but merged characters as well.
Examples:
Thread 1 wants to write: 123456789\n
Thread 2 wants to write: abcdefghi\n
Outputs could vary in the mentioned ways:
1st case:
123456789
abcdefghi
2nd case:
1234abcd56789
efghi
I would definitely use at least unique (UUID.randomUUID()) names to "hot-fix" the problem.
Concurrency
Having disk IO is a tricky thing if you go in-depth. The speads could vary in a vide range. In the JVM you can have blocking and non-blocking IO as well. The blocking one could wait until the data is really on the disk and the other will do some "magic" to flush the file later. There is a good read in here.
TL.DR.: As a rule of thumb it is better to have things in the memory (if it could fit) and not bother with the disk. If you use thread memory for that purpose as well you can avoid the concurrency problem as well. So in your case it could be better to rewrite the given part to utilize the memory only and write to the output.

Java - Sorting and csv: good practice with huge data

I need to order a huge csv file (10+ million records) with several algorithms in Java but I've some problem with memory amount.
Basically I have a huge csv file where every record has 4 fields, with different type (String, int, double).
I need to load this csv into some structure and then sort it by all fields.
What was my idea: write a Record class (with its own fields), start read csv file line by line, make a new Record object for every line and then put them into an ArrayList. Then call my sorter algorithms for each field.
It doesn't work.. I got and OutOfMemoryException when I try lo load all Record object into my ArrayList.
In this way I create tons of object and I think that is not a good idea.
What should I do when I have this huge amount of data? Which method/data structure can ben less expensive in terms of memory usage?
My point is just to use sort algs and look how they work with big set of data, it's not important save the result of sorting into a file.
I know that there are some libs for csv, but I should implements it without external libs.
Thank you very much! :D
Cut your file into pieces (depending on the size of the file) and look into merge sort. That way you can sort even big files without using a lot of memory, and it's what databases use when they have to do huge sorts.
I would use an in memory database such as h2 in in-memory-mode (jdbc:h2:mem:)
so everything stays in ram and isn't flushed to disc (provided you have enough ram, if not you might want to use the file based url). Create your table in there and write every row from the csv. Provided you set up the indexes properly sorting and grouping will be a breeze with standard sql

Downloading the Result set as flat file using Java

Friends,
In my application, i came across an scenario, where the user may request for an Report download as a flat file, which may have max of 17 Lakhs records (around 650 MB) of Data. During this request either my application server stops serving other threads or occurs out of memory exception.
As of now i am iterating through the result set and printing it to the file.
When i Google out for this, i came across an API named OpenCSV. I tried that too but i didn't see any improvement in the performance.
Please help me out on this.
Thanks for the quick response guys, Here i added my code snap
try {
response.setContentType("application/csv");
PrintWriter dout = response.getWriter();
while(rs.next()) {
dout.print(data row); // Here i am printing my ResultSet tubles into flat file.
dout.print("\r\n");
dout.flush();
}
OpenCSV will cleanly deal with the eccentricities of the CSV format, but a large report is still a large report. Take a look at the specific memory error, it sounds like you need to increase the Heap or Max Perm Gen space (it will depend of the error to be sure). Without any adjusting the JVM will only occupy s fixed amount of RAM (my experience is this number is 64 MB).
If you only stream the data from resultset to file without using big buffers this should be possible, but maybe you are first collecting the data in a growing list before sending to file? So you should investigate this issue.
Please specify your question more otherwise we have to speculate.
CSV format aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.

Several FileOutputStreams at a time?

The situation is that:
I have a csv file with records (usually 10k but up to 1m records)
I will process each record (very basic arithmetic with 5 basic select queries to the DB for every record)
Each record (now processed) will then be written to a file BUT not the same file every time. A record CAN be written to another file instead.
Basically I have 1 input file but several possible output files (around 1-100 possible output files).
The process itself is basic so I am focusing on how I should handle the records.
Which option is appropriate for this situation?
Store several List s that will represent per possible output file, and then write each List one by one in the end?
To avoid several very large Lists, every after processing each record, I will immediately write it to its respective output file. But this will require that I have streams open at a time.
Please enlighten me on this. Thanks.
The second option is ok: create the file output streams on demand, and keep them open as long as it takes (track them in a Map for example).
The operating system may have a restriction on how many open file handles it allows, but those numbers are usually well beyond a couple hundreds of files.
A third option:
You could also just append to files, FileOutputStream allows that option in the constructor:
new FileOutputStream(File file, boolean append)
This is less performant than keeping the FileOutputStreams open, but works as well.

Inserting data in RandomAccessFile and updating index

I've got a RandomAccessFile in Java where i manage some data. Simplified:
At the start of the file i have an index. (8 byte long value per dataset which represents the offset where the real data can be found).
So if i want to now where i can find the data of dataset no 3 for example. I read 8 Bytes at offset (2*8). (Indexing starts with 0).
A dataset itsself consists of 4 Bytes which represents the size of the dataset and then all the bytes belonging to the dataset.
So that works fine in case i always rewrite the whole file.
It's pretty important here, that Dataset no 3 could have been written as the first entry in the file so the index is ordered but not the data itsself.
If i insert a new dataset, i always append it to the end of the file. But the number of datasets that could be i n one file is limited. If i can store 100 datasets in the file there will be always 100 entries in the index. If the offset read from the index of a dataset is 0 the dataset is new and will be appended to the file.
Bu there's one case which is not working for me yet. If i read dataset no. 3 from the file and i add some data to it in my application and i want to update it in the file i have no idea how to do this.
If it has the same length as befor i can simply overwrite the old data. But if the new dataset has more bytes than the old one i'll have to move all the data in the file which is behind this dataset and update the indexes for these datasets.
Any idea how to do that?
Or is there maybe a better way to manage storing these datasets in a file?
PS: Yes of course i thought of using a database but this is not applicable for my project. I really do need simple files.
You can't easily insert data into the middle of a file. You'd basically have to read all the remaining data, write the "new" data and then rewrite the "old" data. Alternatively, you could potentially invalidate the old "slow" (potentially allowing it to be reused later) and then just write the whole new record to the end of the file. Your file format isn't really clear to me to be honest, but fundamentally you need to be aware that you can't insert (or delete) in the middle of a file.
I've got a RandomAccessFile in Java where i manage some data.
Stop right there. You have a file. You are presently accessing it via RandomAccessFile in Java. However your entire question relates to the file itself, not to RandomAccessFile or Java. You have a major file design problem, as you are assuming facilities like inserting into the middle of a file that don't exist in any filesystem I have used since about 1979.
As the others answered too, there's no real possibility to make the file longer/shorter without rewriting the whole. There are some workarounds and maybe one solution would work after all.
Limit all datasets to a fixed length.
Delete by changing/removing the index and add by always adding to the end of the file. Update by removing the old dataset and adding the new dataset to the end if the new dataset is longer. Compress the file from time to time by actually deleting the "ignored datasets" and moving all valid datasets together (rewriting everything).
If you can't limit the dataset to a fixed length and you intend to update a dataset making it longer, you can also leave a pointer at the end of the first part of a dataset and continue it later in the file. Thus you get a structure like a linked list. If a lot of editing takes place it would make here sense too, to rearrange & compress the file.
Most solutions have a data overhead but file size is usually not the problem and as mentioned you can let some method "clean it up".
PS: I hope it's ok to answer such old questions - I couldn't find anything about it in the help center and I'm relatively new here.

Categories