Assume I have a StringBuffer with values "1 \n 2 \n 3 \n...etc" where \n is a line break.
How would I add these values to an existing CSV file as a column using Java? Specifically, this would be the last column.
For example, let's say I have a CSV file that looks like this:
5, 2, 5
2, 3, 1
3, 5, 2
..
etc.
The output should look like this given the StringBuffer after using the method to add the column to the csv file:
5, 2, 5, 1
2, 3, 1, 2
3, 5, 2, 3
..
etc.
I also plan to add columns with 1000s of values so I am looking for something that does not have high memory consumption.
Thanks ahead of time.
Edit: Columns may be different sizes. I see people saying to add it at the end of each line. The problem is, it will add the values to the wrong columns and I cannot have that happen. I thank you all for your suggestions though as they were very good.
Edit 2: I have received critique about my use of StringBuffer and yes, I agree, if this problem were isolated, I would also suggest StringBuilder. The context of this problem is a program that has synchronized threads (acting as scenarios) collecting response times given a range of concurrent threads. The concurrent threads execute concurrent queries to a database and once the query has been executed, the result is appended to a StringBuffer. All the response times for each synchronized thread is appended to a StringBuffer and written to a CSV document. There can be several threads with the same response time. I can use StringBuilder but then I would have to manually synchronize the threads appending the response times and in my case, I do not think it would make too much of a difference in performance and would add an unnecessary amount of code. I hope this helps and I once again, thank you all for your concerns and suggestions. If after reading this, you are still not convinced that I should use StringBuffer, then I ask that we please take this discussion offline.
Edit 3: I have figured out how to go around the issue of adding the columns if the rows are different sizes. I simply add commas for every missing column (also note, that my rows would be growing with each column). It looks like #BorisTheSpider's conceptual solution actually works with this modification. The problem is I am not sure how to add the text at the end of each line. My code so far (I removed code to conserve space):
//Before this code there is a statement to create a test.csv file (this file has no values before this loop occurs).
for (int p = 0; p<(max+1); p = p + inc){
threadThis2(p);
//threadThis2 appends to the StringBuffer with several comma delimited values.
//p represents the number of threads/queries to execute at the same time.
comma = p/inc; //how many commas to put if there is nothing on the line.
for (int i = 0; i < comma; i++) {
commas.append(",");
}
br = new BufferedReader (new FileReader("test.csv"));
List <String> avg = Arrays.asList(sb.toString().split(", "));
for (int i = 0; i < avg.size(); i++) {
if (br.readLine()==null)
{w.write(commas.toString() + avg.get(i).toString() + ", \n");}
else { w.write(avg.get(i).toString() + ", \n");}
}
br.close();
sb.setLength(0);
commas.setLength(0);
}
Please note this code is in its early stages (I will of course declare all the variables outside the for loop later on). So far this code works. The problem is that the columns are not side by side, which is what I want. I understand I may be required to create temporary files but I need to approach this problem very carefully as I might need to have a lot of columns in the future.
Apparently there are two basic requirements:
Append a column to an existing CSV file
Allow concurrent operation
To achieve Requirement #1, the original file has to be read and rewritten as a new file, including the new column, irrespective of its location (i.e., in a StringBuffer or elsewhere).
The best (and only generic) way of reading a CSV file would be via a mature and field-proven library, such as OpenCSV, which is lightweight and commercially-friendly, given its Apache 2.0 license. Otherwise, one has to either do many simplifications (e.g., always assume single-line CSV records), or re-invent the wheel by implementing a new CSV parser.
In either case, a simple algorithm is needed, e.g.:
Initialize a CSV reader or parser object from the library used (or from whatever custom solution is used), supplying the existing CSV file and the necessary parameters (e.g., field separator).
Read the input file record-by-record, via the reader or parser, as a String[] or List<String> structure.
Manipulate the structure returned for every record to add or delete any extra fields (columns), in memory.
Add blank fields (i.e., just extra separators, 1 per field), if desired or needed.
Use a CSV writer from the library (or manually implement a writer) to write the new record to the output file.
Append a newline character at the end of each record written to the output file.
Repeat for all the records in the original CSV file.
This approach is also scalable, as it does not require any significant in-memory processing.
For Requirement #2, there are many ways of supporting concurrency and in this scenario it is more efficient to do it in a tailored manner (i.e., "manually" in the application), as opposed to relying on a thread-safe data structure like StringBuffer.
Related
I have two large txt files around 150 mb. I want to read some data from each line of file1 and scan through all the lines of file2 till I find the matching data. If the matching data is not found, I want to output that line to another file.
I want the program to use as less memory as possible. Time is not a constraint.
Edit1
I have tried couple of options
Option1 : I have read the file2 using BufferedReader, Scanner and apache commons FileUtils.lineIterator. Loaded data of file2 into HashMap by reading each line. Read the data from file1 one line at a time and compared with data in HashMap. If it didn't match, wrote the line in a file3.
Option 2 : Read the file2 n times for every records in File 1 using the above mentioned three Readers.After every read I had to close the file and read again. I am wondering what's the best way. Is there any other option I can look into
I have to make some assumptions about the file.
I am going to assume the lines are long, and you want the lines that are not the same in the 2 files.
I would read the files 4 times (2 times per file).
Of course, it's not as efficient as reading it 2 times (1 time per file), but reading it 2 times means lots of memory is used.
Pseudo code for 1st read of each file:
Map<MyComparableByteArray, Long> digestMap = new HashMap<>();
try (BufferedReader br = ...)
{
long lineNr = 0;
String line;
while ((line = br.readLine()) != null)
{
digestMap.put(CreateDigest(line), lineNr);
}
}
If the digests are different/unique, I know that the line does not occur in the other file.
If the digests are the same, we will need to check the lines and actually compare them to make sure that they are really the same - this can occur during the second read.
Now what is also important is that we need to be careful of the digest we choose.
If we choose a short digest (i.e. md-5), we might run into lots of collisions, but this is appropriate for files with short lines, and we will need to handle the collisions separately (i.e. convert the map to a map<digest, list> structure.
If we choose a long digest (i.e. sha2-512), we won't run into lots of collisions (still safer to handle it like I mentioned above), BUT we will have the problem of not saving as much memory unless the file lines are very long.
So the general technique is:
Read each file and generate hashes.
Compare the hashes to mark the lines that need to be compared.
Read each file again and generate the output. Recheck all collisions found by the hashes in this step.
By the way, MyComparableByteArray is a custom wrapper around a byte[], to enable it to be a HashMap key (i.e. by implementing equals() and hashCode() methods). The byte[] cannot be used as a key, as it doesn't work with equals() and hashCode(). There are 2 ways to handle this:
custom wrapper as I've mentioned - this will be more efficient than the alternative.
convert it to a string using base64. This will make the memory usage around 2.5x worse than option 1, but does not need the custom code.
I'm writing a file translator for my company that grabs data from a source file and writes a bunch of delimited records to a target file. The records have the form:
HEADER*REC 1*REC 2*REC 3*REC 4
If a record is empty, and there is another record that can come after it, then the value is not printed, but the delimiter is included, e.g.:
HEADER*REC 1**REC 3*REC 4
If a record is empty, and it is the last record in the series, then the value and the delimiter are omitted, e.g.:
HEADER*REC 1*REC 2*REC 3
I was trying to think of a nice way to describe this in code, other than (pseudocode):
if last record is empty
print this
otherwise
print this other thing
I guess the code isn't too ugly, but I'd like a nicer solution. I'm using a StringBuilder to write the data for each transaction (each set of records corresponds to a transaction, so I can iterate through a TransactionSet Object.), and if I can, I try to avoid copious switch/if statements. If anyone knows of a more nicer, or elegant way to do this I would love to hear it.
EDIT: Clarified block of pseudocode
You can do it like this
System.out.print("HEADER");
StringBuilder sep = new StringBuilder();
for(String rec: headings) {
sep.append("*");
if(rec != null && !rec.isEmpty()) {
System.out.print(sep + rec);
sep.setLength(0);
}
}
System.out.println();
This way it will only print a "*" if you have a heading to come after it.
I have a semicolon delimited input file where first column is a 3 char fixed width code, while the remaining columns are some string data.
001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
I want to divide above file into number of files based on different values of first column.
For e.g. in above example, there are three different values in the first column, so I will divide the file into three files viz. 001.txt, 002.txt, 003.txt
The output file should contain item count as line one and data as remaining lines.
So there are 5 001 rows, so 001.txt will be:
5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
Similarly, 002 file will have first line as 4 and then 4 lines of data and 003 file will have first line as 5 and then five lines of data.
What would be the most efficient way to achieve this considering very large input file with greater then 100,000 rows?
I have written below code to read lines from the file:
try{
FileInputStream fstream = new FileInputStream(this.inputFilePath);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null) {
String[] tokens = strLine.split(";");
}
in.close();
}catch(IOException e){
e.printStackTrace();
}
for each line
extract chunk name, e.g 001
look for file named "001-tmp.txt"
if one exist, read first line - it will give you number of lines, then increment the value and write into same file using seek function with argument 0 and then use writeUTF to override the string. Perhaps some string length calculation has to be applied here, leave placeholder for 10 spaces for example.
if one does not exist, then create one and write 1 as first line, padded with 10 spaces
append current line to the file
close current file
proceed with next line of source file
One of the solutions that comes to mind is to keep a 'Map' and only open every file once. But you wont be able to this because you have around 1 lac rows, so no OS will allow you that many open file descriptors.
So one of the way is to open the file in append mode and keep writing to it and closing it. But because the of huge many file open close calls , the process may slow up. You can test it for your self though.
If the above is not providing satisfying results, you may try a mix of approach 1 and 2, where by you only open 100 open files at any time and only closing a file if a new file that is not already opened needs to be written to....
First, create HashMap<String, ArrayList<String>> map to collect all the data from the file.
Second, use strLine.split(";",2) instead of strLine.split(";"). The result will be array of length 2, first element be the code and the second the data.
Then, add decoded string to the map:
ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);
At the end, scan the map.keySet() and for each key, create a file named as that key and write list's size and list's content to it.
For each three character code, you're going to have a list of input lines. To me the obvious solution would be to use a Map, with String keys (your three character codes) pointing to the corresponding List that contains all of the lines.
For each of those keys, you'd create a file with the relevant name, the first line would be the size of the list, and then you'd iterate over it to write the remaining lines.
I guess you are not fixed to three files so I suggest you create a map of writers with your three characters code as key and the writer as value.
For each line you read, you select or create the required reader and write the lines into. Also you need a second map to maintain the line count values for all files.
Once you are done with reading the source file, you flush and close all writers and read the files one by one again. This time you just add the line count in front of the file. There is no other way but to rewrite the entire file to my knowledge because its not directly possible to add anything to the beginning of a file without buffering and rewriting the entire file. I suggest you use a temporary file for this one.
This answer applies only in case you file is too large to be stored fully in memory. In case storing is possible, there are faster solutions to this. Like storing the contents of the file fully in StringBuffer objects before writing it to files.
I have a general question on your opinion about my "technique".
There are 2 textfiles (file_1 and file_2) that need to be compared to each other. Both are very huge (3-4 gigabytes, from 30,000,000 to 45,000,000 lines each).
My idea is to read several lines (as many as possible) of file_1 to the memory, then compare those to all lines of file_2. If there's a match, the lines from both files that match shall be written to a new file. Then go on with the next 1000 lines of file_1 and also compare those to all lines of file_2 until I went through file_1 completely.
But this sounds actually really, really time consuming and complicated to me.
Can you think of any other method to compare those two files?
How long do you think the comparison could take?
For my program, time does not matter that much. I have no experience in working with such huge files, therefore I have no idea how long this might take. It shouldn't take more than a day though. ;-) But I am afraid my technique could take forever...
Antoher question that just came to my mind: how many lines would you read into the memory? As many as possible? Is there a way to determine the number of possible lines before actually trying it?
I want to read as many as possible (because I think that's faster) but I've ran out of memory quite often.
Thanks in advance.
EDIT
I think I have to explain my problem a bit more.
The purpose is not to see if the two files in general are identical (they are not).
There are some lines in each file that share the same "characteristic".
Here's an example:
file_1 looks somewhat like this:
mat1 1000 2000 TEXT //this means the range is from 1000 - 2000
mat1 2040 2050 TEXT
mat3 10000 10010 TEXT
mat2 20 500 TEXT
file_2looks like this:
mat3 10009 TEXT
mat3 200 TEXT
mat1 999 TEXT
TEXT refers to characters and digits that are of no interest for me, mat can go from mat1 - mat50 and are in no order; also there can be 1000x mat2 (but the numbers in the next column are different). I need to find the fitting lines in a way that: matX is the same in both compared lines an the number mentioned in file_2 fits into the range mentioned in file_1.
So in my example I would find one match: line 3 of file_1and line 1 of file_2 (because both are mat3 and 10009 is between 10000 and 10010).
I hope this makes it clear to you!
So my question is: how would you search for the matching lines?
Yes, I use Java as my programming language.
EDIT
I now divided the huge files first so that I have no problems with being out of memory. I also think it is faster to compare (many) smaller files to each other than those two huge files. After that I can compare them the way I mentioned above. It may not be the perfect way, but I am still learning ;-)
Nonentheless all your approaches were very helpful to me, thank you for your replies!
I think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM
In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.
As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.
EDIT: In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.
Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>> that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange int and an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)
boolean isInRange(String material, int value)
{
List<Range> ranges = hashMapName.get(material);
for (Range range : ranges)
{
if (value >= range.getStart() && value <= range.getEnd())
{
return true;
}
}
return false;
}
and call it for each (parsed) line of file_2.
Now that you've given us more specifics, the approach I would take relies upon pre-partitioning, and optionally, sorting before searching for matches.
This should eliminate a substantial amount of comparisons that wouldn't otherwise match anyway in the naive, brute-force approach. For the sake of argument, lets peg both files at 40 million lines each.
Partitioning: Read through file_1 and send all lines starting with mat1 to file_1_mat1, and so on. Do the same for file_2. This is trivial with a little grep, or should you wish to do it programmatically in Java it's a beginner's exercise.
That's one pass through two files for a total of 80million lines read, yielding two sets of 50 files of 800,000 lines each on average.
Sorting: For each partition, sort according to the numeric value in the second column only (the lower bound from file_1 and the actual number from file_2). Even if 800,000 lines can't fit into memory I suppose we can adapt 2-way external merge sort and perform this faster (fewer overall reads) than a sort of the entire unpartitioned space.
Comparison: Now you just have to iterate once through both pairs of file_1_mat1 and file_2_mat1, without need to keep anything in memory, outputting matches to your output file. Repeat for the rest of the partitions in turn. No need for a final 'merge' step (unless you're processing partitions in parallel).
Even without the sorting stage the naive comparison you're already doing should work faster across 50 pairs of files with 800,000 lines each rather than with two files with 40 million lines each.
there is a tradeoff: if you read a big chunk of the file, you save the disc seek time, but you may have read information you will not need, since the change was encountered on the first lines.
You should probably run some experiments [benchmarks], with varying chunk size, to find out what is the optimal chunk to read, in the average case.
No sure how good an answer this would be - but have a look at this page: http://c2.com/cgi/wiki?DiffAlgorithm - it summarises a few diff algorithms. Hunt-McIlroy algorithm is probably the better implementation. From that page there's also a link to a java implementation of the GNU diff. However, I think an implementation in C/C++ and compiled into native code will be much faster. If you're stuck with java, you may want to consider JNI.
Indeed, that could take a while. You have to make 1,200.000,000 line comparisions.
There are several possibilities to speed that up by an order of magnitute:
One would be to sort file2 and do kind of a binary search on file level.
Another approach: compute a checksum of each line, and search that. Depending on average line length, the file in question would be much smaller and you really can do a binary search if you store the checksums in a fixed format (i.e. a long)
The number of lines you read at once from file_1 does not matter, however. This is micro-optimization in the face of great complexity.
If you want a simple approach: you can hash both of the files and compare the hash. But it's probably faster (especially if the files differ) to use your approach. About the memory consumption: just make sure you use enough memory, using no buffer for this kind a thing is a bad idea..
And all those answers about hashes, checksums etc: those are not faster. You have to read the whole file in both cases. With hashes/checksums you even have to compute something...
What you can do is sort each individual file. e.g. the UNIX sort or similar in Java. You can read the sorted files one line at a time to perform a merge sort.
I have never worked with such huge files but this is my idea and should work.
You could look into hash. Using SHA-1 Hashing.
Import the following
import java.io.FileInputStream;
import java.security.MessageDigest;
Once your text file etc has been loaded have it loop through each line and at the end print out the hash. The example links below will go into more depth.
StringBuffer myBuffer = new StringBuffer("");
//For each line loop through
for (int i = 0; i < mdbytes.length; i++) {
myBuffer.append(Integer.toString((mdbytes[i] & 0xff) + 0x100, 16).substring(1));
}
System.out.println("Computed Hash = " + sb.toString());
SHA Code example focusing on Text File
SO Question about computing SHA in JAVA (Possibly helpful)
Another sample of hashing code.
Simple read each file seperatley, if the hash value for each file is the same at the end of the process then the two files are identical. If not then something is wrong.
Then if you get a different value you can do the super time consuming line by line check.
Overall, It seems that reading line by line by line by line etc would take forever. I would do this if you are trying to find each individual difference. But I think hashing would be quicker to see if they are the same.
SHA checksum
If you want to know exactly if the files are different or not then there isn't a better solution than yours -- comparing sequentially.
However you can make some heuristics that can tell you with some kind of probability if the files are identical.
1) Check file size; that's the easiest.
2) Take a random file position and compare block of bytes starting at this position in the two files.
3) Repeat step 2) to achieve the needed probability.
You should compute and test how many reads (and size of block) are useful for your program.
My solution would be to produce an index of one file first, then use that to do the comparison. This is similar to some of the other answers in that it uses hashing.
You mention that the number of lines is up to about 45 million. This means that you could (potentially) store an index which uses 16 bytes per entry (128 bits) and it would use about 45,000,000*16 = ~685MB of RAM, which isn't unreasonable on a modern system. There are overheads in using the solution I describe below, so you might still find you need to use other techniques such as memory mapped files or disk based tables to create the index. See Hypertable or HBase for an example of how to store the index in a fast disk-based hash table.
So, in full, the algorithm would be something like:
Create a hash map which maps Long to a List of Longs (HashMap<Long, List<Long>>)
Get the hash of each line in the first file (Object.hashCode should be sufficient)
Get the offset in the file of the line so you can find it again later
Add the offset to the list of lines with matching hashCodes in the hash map
Compare each line of the second file to the set of line offsets in the index
Keep any lines which have matching entries
EDIT:
In response to your edited question, this wouldn't really help in itself. You could just hash the first part of the line, but it would only create 50 different entries. You could then create another level in the data structure though, which would map the start of each range to the offset of the line it came from.
So something like index.get("mat32") would return a TreeMap of ranges. You could look for the range preceding the value you are looking for lowerEntry(). Together this would give you a pretty fast check to see if a given matX/number combination was in one of the ranges you are checking for.
try to avoid memory consuming and make it disc consuming.
i mean divide each file into loadable size parts and compare them, this may take some extra time but will keep you safe dealing with memory limits.
What about using source control like Mercurial? I don't know, maybe it isn't exactly what you want, but this is a tool that is designed to track changes between revisions. You can create a repository, commit the first file, then overwrite it with another one an commit the second one:
hg init some_repo
cd some_repo
cp ~/huge_file1.txt .
hg ci -Am "Committing first huge file."
cp ~/huge_file2.txt huge_file1.txt
hg ci -m "Committing second huge file."
From here you can get a diff, telling you what lines differ. If you could somehow use that diff to determine what lines were the same, you would be all set.
That's just an idea, someone correct me if I'm wrong.
I would try the following: for each file that you are comparing, create temporary files (i refer to it as partial file later) on disk representing each alphabetic letter and an additional file for all other characters. then read the whole file line by line. while doing so, insert the line into the relevant file that corresponds to the letter it starts with. since you have done that for both files, you can now limit the comparison for loading two smaller files at a time. a line starting with A for example can appear only in one partial file and there will not be a need to compare each partial file more than once. If the resulting files are still very large, you can apply the same methodology on the resulting partial files (letter specific files) that are being compared by creating files according to the second letter in them. the trade-of here would be usage of large disk space temporarily until the process is finished. in this process, approaches mentioned in other posts here can help in dealing with the partial files more efficiently.
I am trying to retrieve the data from the table and convert each row into CSV format like
s12, james, 24, 1232, Salaried
The below code does the job, but takes a long time, with tables of rows exceeding 1,00,000.
Please advise on optimizing technique:
while(rset1.next()!=false) {
sr=sr+"\n";
for(int j=1;j<=rsMetaData.getColumnCount();j++)
{
if(j< 5)
{
sr=sr+rset1.getString(j).toString()+",";
}
else
sr=sr+rset1.getString(j).toString();
}
}
/SR
Two approaches, in order of preference:
Stream the output
PrintWriter csvOut = ... // Construct a write from an outputstream, say to a file
while (rs.next())
csvOut.println(...) // Write a single line
(note that you should ensure that your Writer / OutputStream is buffered, although many are by default)
Use a StringBuilder
StringBuilder sb = new StringBuilder();
while (rs.next())
sb.append(...) // Write a single line
The idea here is that appending Strings in a loop is a bad idea. Imagine that you have a string. In Java, Strings are immutable. That means that to append to a string you have to copy the entire string and then write more to the end. Since you are appending things a little bit at a time, you will have many many copies of the string which aren't really useful.
If you're writing to a File, it's most efficient just to write directly out with a stream or a Writer. Otherwise you can use the StringBuilder which is tuned to be much more efficient for appending many small strings together.
I'm no Java expert, but I think it's always bad practice to use something like getColumnCount() in a conditional check. This is because after each loop, it runs that function to see what the column count is, instead of just referencing a static number. Instead, set a variable equal to that number and use the variable to compare against j.
You might want to use a StringBuilder to build the string, that's much more efficient when you're doing a lot of concatenation. Also if you have that much data, you might want to consider writing it directly to wherever you're going to put it instead of building it in memory at first, if that's a file or a socket, for example.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
sr.append('\n');
for (int j = 1; j <= columnCount; j++) {
sr.append(rset1.getString(j));
if (j < 5) {
sr.append(',');
}
}
}
As a completely different, but undoubtely the most optimal alternative, use the DB-provided export facilities. It's unclear which DB you're using, but as per your question history you seem to be doing a lot with Oracle. In this case, you can export a table into a CSV file using UTL_FILE.
See also:
Generating CSV files using Oracle
Stored procedure example on Ask Tom
As the other answers say, stop appending to a String. In Java, String objects are immutable, so each append must do a full copy of the string, turning this into an O(n^2) operation.
The other is big slowdown is fetch size. By default, the driver is likely to fetch one row at a time. Even if this takes 1ms, that limits you to a thousand rows per second. A remote database, even on the same network, will be much worse. Try calling setFetchSize(1000) on the Statement. Beware that setting the fetch size too big can cause out of memory errors with some database drivers.
I don't believe minor code changes are going to make a substantive difference. I'd surely use a StringBuffer however.
He's going to be reading a million rows over a wire, assuming his database is on a separate machine. First, if performance is unacceptable, I'd run that code on the database server and clip the network out of the equation. If it's the sort of code that gets run once a week as a batch job that may be ok.
Now, what are you going to do with the StringBuffer or String once it is fully loaded from the database? We're looking at a String that could be 50 Mbyte long.
This should be 1 iota faster since it removes the unneeded (i<5) check.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
for (int j = 1; j < columnCount; j++) {
sr.append(rset1.getString(j)).append(",");
}
// I suspect the 'if (j<5)' really meant, "if we aren't on the last
// column then tack on a comma." So we always tack it on above and
// write the last column and a newline now.
sr.append(rset1.getString(columnCount)).append("\n");
}
}
Another answer is to change the select so it returns a comma-sep string. Then we read the single-column result and append it to the StringBuffer.
I forget the syntax now, but something like:
select column1 || "," || column2 || "," ... from table;
Now we don't need to loop and comma concatenation business.
StringBuilder sr = new StringBuilder();
while (rset1.next()) {
sr.append(rset1.getString(1)).append("\n");
}
}