I am trying to retrieve the data from the table and convert each row into CSV format like
s12, james, 24, 1232, Salaried
The below code does the job, but takes a long time, with tables of rows exceeding 1,00,000.
Please advise on optimizing technique:
while(rset1.next()!=false) {
sr=sr+"\n";
for(int j=1;j<=rsMetaData.getColumnCount();j++)
{
if(j< 5)
{
sr=sr+rset1.getString(j).toString()+",";
}
else
sr=sr+rset1.getString(j).toString();
}
}
/SR
Two approaches, in order of preference:
Stream the output
PrintWriter csvOut = ... // Construct a write from an outputstream, say to a file
while (rs.next())
csvOut.println(...) // Write a single line
(note that you should ensure that your Writer / OutputStream is buffered, although many are by default)
Use a StringBuilder
StringBuilder sb = new StringBuilder();
while (rs.next())
sb.append(...) // Write a single line
The idea here is that appending Strings in a loop is a bad idea. Imagine that you have a string. In Java, Strings are immutable. That means that to append to a string you have to copy the entire string and then write more to the end. Since you are appending things a little bit at a time, you will have many many copies of the string which aren't really useful.
If you're writing to a File, it's most efficient just to write directly out with a stream or a Writer. Otherwise you can use the StringBuilder which is tuned to be much more efficient for appending many small strings together.
I'm no Java expert, but I think it's always bad practice to use something like getColumnCount() in a conditional check. This is because after each loop, it runs that function to see what the column count is, instead of just referencing a static number. Instead, set a variable equal to that number and use the variable to compare against j.
You might want to use a StringBuilder to build the string, that's much more efficient when you're doing a lot of concatenation. Also if you have that much data, you might want to consider writing it directly to wherever you're going to put it instead of building it in memory at first, if that's a file or a socket, for example.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
sr.append('\n');
for (int j = 1; j <= columnCount; j++) {
sr.append(rset1.getString(j));
if (j < 5) {
sr.append(',');
}
}
}
As a completely different, but undoubtely the most optimal alternative, use the DB-provided export facilities. It's unclear which DB you're using, but as per your question history you seem to be doing a lot with Oracle. In this case, you can export a table into a CSV file using UTL_FILE.
See also:
Generating CSV files using Oracle
Stored procedure example on Ask Tom
As the other answers say, stop appending to a String. In Java, String objects are immutable, so each append must do a full copy of the string, turning this into an O(n^2) operation.
The other is big slowdown is fetch size. By default, the driver is likely to fetch one row at a time. Even if this takes 1ms, that limits you to a thousand rows per second. A remote database, even on the same network, will be much worse. Try calling setFetchSize(1000) on the Statement. Beware that setting the fetch size too big can cause out of memory errors with some database drivers.
I don't believe minor code changes are going to make a substantive difference. I'd surely use a StringBuffer however.
He's going to be reading a million rows over a wire, assuming his database is on a separate machine. First, if performance is unacceptable, I'd run that code on the database server and clip the network out of the equation. If it's the sort of code that gets run once a week as a batch job that may be ok.
Now, what are you going to do with the StringBuffer or String once it is fully loaded from the database? We're looking at a String that could be 50 Mbyte long.
This should be 1 iota faster since it removes the unneeded (i<5) check.
StringBuilder sr = new StringBuilder();
int columnCount =rsMetaData.getColumnCount();
while (rset1.next()) {
for (int j = 1; j < columnCount; j++) {
sr.append(rset1.getString(j)).append(",");
}
// I suspect the 'if (j<5)' really meant, "if we aren't on the last
// column then tack on a comma." So we always tack it on above and
// write the last column and a newline now.
sr.append(rset1.getString(columnCount)).append("\n");
}
}
Another answer is to change the select so it returns a comma-sep string. Then we read the single-column result and append it to the StringBuffer.
I forget the syntax now, but something like:
select column1 || "," || column2 || "," ... from table;
Now we don't need to loop and comma concatenation business.
StringBuilder sr = new StringBuilder();
while (rset1.next()) {
sr.append(rset1.getString(1)).append("\n");
}
}
Related
I'm trying to create a JSON-like format to load components from files and while writing the parser I've run into an interesting performance question.
The parser reads the file character by character, so I have a LinkedList as a buffer. After reaching the end of a key (:) or a value (,) the buffer has to be emptied and a string constructed of it.
My question is what is the most efficient way to do this.
My two best bets would be:
for (int i = 0; i < buff.size(); i++)
value += buff.removeFirst().toString();
and
value = new String((char[]) buff.toArray(new char[buff.size()]));
Instead of guessing this you should write a benchmark. Take a look at How do I write a correct micro-benchmark in Java to understand how to write a benchmark with JMH.
Your for loop would be inefficient as you are concatenating 1-letter Strings using + operator. This leads to creation and immediate throwing away intermediate String objects. You should use StringBuilder if you plan to concatenate in a loop.
The second option should use a zero-length array as per Arrays of Wisdom of the Ancients article which dives into internal details of the JVM:
value = new String((char[]) buff.toArray(new char[0]));
I have a very simple piece of code which iterates over a list of Database objects and appends certain property to the StringBuilder. Now the results are sometimes going over 100K so the append operations go over 100K
My problem is there is no way I can shorten the number of iterations as I need the data. But Stringbuilder keeps on taking over heap space. And throws OutOfMemoryException.
Has anyone encountered any such situation and is there a solution to this problem or an alternative to StringBuilder.
It is quite possible that what I am doing might as well be wrong so even though code is quite simple I will post it.
StringBuilder endResult = new StringBuilder();
if (dbObjects != null && !dbObjects.isEmpty()) {
for (DBObject dBO : dbObjects) {
endResult.append("Starting to write" + dBO.getOutPut() + "content");
endResult.append(dBO.getResults);
}
output.append("END");
}
Like I said it's quite possible that I will have 100000 results from the DB
You shouldn't do something like this when using StringBuilder:
endResult.append("Starting to write" + dBO.getOutPut() + "content");
The above statement will do string concatenation. Use the append() method like:
endResult.append("Starting to write").append(dBO.getOutPut()).append("content");
My program reads a text file line by line in a while loop. It then processes each line and extracts some information to be written in the output. Everything it does inside the while loop is O(1) except two ArrayList indexOf() method calls which I suppose are O(N). The program runs at a reasonable pace (1M lines per 100 seconds) in the beginning but over time it slows down dramatically. I have 70 M lines in the input file so the loop iterates 70 million times. In theory this should take about 2 hours but in practice it takes 13 hours. Where is the problem?
Here is the code snippet:
BufferedReader corpus = new BufferedReader(
new InputStreamReader(
new FileInputStream("MyCorpus.txt"),"UTF8"));
Writer outputFile = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream("output.txt"), "UTF-8"));
List<String> words = new ArrayList();
//words is being updated with relevant values here
LinkedHashMap<String,Integer> DIC = new LinkedHashMap();
//DIC is being updated with relevant key-value pairs here
String line = "";
while ((line = corpus.readLine()) != null)
String[] parts = line.split(" ");
if (DIC.containsKey(parts[0]) && DIC.containsKey(parts[1])) {
int firstIndexPlusOne = words.indexOf(parts[0])+ 1;
int secondIndexPlusOne = words.indexOf(parts[1]) +1;
outputFile.write(firstIndexPlusOne +" "+secondIndexPlusOne+" "+parts[2]+"\n");
} else {
notFound++;
outputFile.write("NULL\n");
}
}
outputFile.close();
I am assuming you add words to your words ArrayList as you go.
You correctly state that words.indexOf is O(N) and that is the cause of your issue. As N increases (you add words to the list) these operations take longer and longer.
To avoid this keep your list sorted and use binarySearch.
To keep it sorted use binarySearch on each word to work out where to insert it. This takes your complexity from O(n) to O(log(N)).
I think, words is meant to collect unique words, hence use Set.
Set<String> words = new HashSet<>();
Map<String, Integer> DIC = new HashMap<>();
Also DIC seems something like a frequency table, in which case dic.keySet() would be the same as words. A LinkedHashMap maintains an extra list to keep the entries sorted on order of insertion.
The writing of separate strings, instead of first creating new strings is faster.
outputFile.write(firstIndexPlusOne);
outputFile.write(" ");
outputFile.write(secondIndexPlusOne);
outputFile.write(" ");
outputFile.write(parts[2]);
outputFile.write("\n");
I think one of your problem is that line:
outputFile.write(firstIndexPlusOne +" "+secondIndexPlusOne+" "+parts[2]+"\n");
Since strings are immutable, you are cluttering the memory. Also, maybe try to flush the write buffer every turn in the loop it maybe improve a bit (my hypothesis here)
Try something like:
String line = "";
StringBuilder sb = new StringBuilder();
while ...
...
sb.append(firstIndexPlusOne);
sb.append(" ");
sb.append(secondIndexPlusOne);
sb.append(" ");
sb.append(parts[2]);
sb.append("\n");
outputFile.write(sb.toString());
sb.setLength(0);
outputFile.flush();
Also, maybe a good read: Tuning Java I/O Performance (Oracle)
If the corpus and the word list are both sorted, the linear search performed by the words.indexOf(..) call would become slower in each iteration.
Building a HashMap(..) from your word list before processing the corpus would even things out. It might be a good idea to do so for optimization, even if that is not the problem.
Assuming that you don't update neither words nor DIC in your loop, obviously the most runtime is consumed when DIC.containsKey(parts[0]) && DIC.containsKey(parts[1]) evaluates to true.
If your question is "why is it slowing down", and not "how can I speed it up", I'd suggest that you take the first 10M lines of your file, copy them into another file and duplicate them so you receive 70M lines consisting of copies of your first 10M lines. Then, execute your code. If it slows down even though the same content is examined again and again, you may check the other answers regarding string builders and such.
If you don't experience the slowing down, then obviously it's dependent on the actual content of your 70M file. Propably, for the remaining 60M lines of your original file, DIC.containsKey(parts[0]) && DIC.containsKey(parts[1]) evaluates to true more often and therefore the inner loop is executed more often, taking more time.
In the latter case, I doubt that you can trick the I/O load by applying single writes such that a performance gain is obtained, but of course I may be very wrong there. You'd have to try. But first, I'd recommend exploring the source of the problem, which I think lies in the file content's structure. After you understand how your code performs with respect to the input given, you may try to optimize (althoug I would try to keep the whole string in memory and write its contents in one operation after the loop instead of performing very many small write operations).
Assume I have a StringBuffer with values "1 \n 2 \n 3 \n...etc" where \n is a line break.
How would I add these values to an existing CSV file as a column using Java? Specifically, this would be the last column.
For example, let's say I have a CSV file that looks like this:
5, 2, 5
2, 3, 1
3, 5, 2
..
etc.
The output should look like this given the StringBuffer after using the method to add the column to the csv file:
5, 2, 5, 1
2, 3, 1, 2
3, 5, 2, 3
..
etc.
I also plan to add columns with 1000s of values so I am looking for something that does not have high memory consumption.
Thanks ahead of time.
Edit: Columns may be different sizes. I see people saying to add it at the end of each line. The problem is, it will add the values to the wrong columns and I cannot have that happen. I thank you all for your suggestions though as they were very good.
Edit 2: I have received critique about my use of StringBuffer and yes, I agree, if this problem were isolated, I would also suggest StringBuilder. The context of this problem is a program that has synchronized threads (acting as scenarios) collecting response times given a range of concurrent threads. The concurrent threads execute concurrent queries to a database and once the query has been executed, the result is appended to a StringBuffer. All the response times for each synchronized thread is appended to a StringBuffer and written to a CSV document. There can be several threads with the same response time. I can use StringBuilder but then I would have to manually synchronize the threads appending the response times and in my case, I do not think it would make too much of a difference in performance and would add an unnecessary amount of code. I hope this helps and I once again, thank you all for your concerns and suggestions. If after reading this, you are still not convinced that I should use StringBuffer, then I ask that we please take this discussion offline.
Edit 3: I have figured out how to go around the issue of adding the columns if the rows are different sizes. I simply add commas for every missing column (also note, that my rows would be growing with each column). It looks like #BorisTheSpider's conceptual solution actually works with this modification. The problem is I am not sure how to add the text at the end of each line. My code so far (I removed code to conserve space):
//Before this code there is a statement to create a test.csv file (this file has no values before this loop occurs).
for (int p = 0; p<(max+1); p = p + inc){
threadThis2(p);
//threadThis2 appends to the StringBuffer with several comma delimited values.
//p represents the number of threads/queries to execute at the same time.
comma = p/inc; //how many commas to put if there is nothing on the line.
for (int i = 0; i < comma; i++) {
commas.append(",");
}
br = new BufferedReader (new FileReader("test.csv"));
List <String> avg = Arrays.asList(sb.toString().split(", "));
for (int i = 0; i < avg.size(); i++) {
if (br.readLine()==null)
{w.write(commas.toString() + avg.get(i).toString() + ", \n");}
else { w.write(avg.get(i).toString() + ", \n");}
}
br.close();
sb.setLength(0);
commas.setLength(0);
}
Please note this code is in its early stages (I will of course declare all the variables outside the for loop later on). So far this code works. The problem is that the columns are not side by side, which is what I want. I understand I may be required to create temporary files but I need to approach this problem very carefully as I might need to have a lot of columns in the future.
Apparently there are two basic requirements:
Append a column to an existing CSV file
Allow concurrent operation
To achieve Requirement #1, the original file has to be read and rewritten as a new file, including the new column, irrespective of its location (i.e., in a StringBuffer or elsewhere).
The best (and only generic) way of reading a CSV file would be via a mature and field-proven library, such as OpenCSV, which is lightweight and commercially-friendly, given its Apache 2.0 license. Otherwise, one has to either do many simplifications (e.g., always assume single-line CSV records), or re-invent the wheel by implementing a new CSV parser.
In either case, a simple algorithm is needed, e.g.:
Initialize a CSV reader or parser object from the library used (or from whatever custom solution is used), supplying the existing CSV file and the necessary parameters (e.g., field separator).
Read the input file record-by-record, via the reader or parser, as a String[] or List<String> structure.
Manipulate the structure returned for every record to add or delete any extra fields (columns), in memory.
Add blank fields (i.e., just extra separators, 1 per field), if desired or needed.
Use a CSV writer from the library (or manually implement a writer) to write the new record to the output file.
Append a newline character at the end of each record written to the output file.
Repeat for all the records in the original CSV file.
This approach is also scalable, as it does not require any significant in-memory processing.
For Requirement #2, there are many ways of supporting concurrency and in this scenario it is more efficient to do it in a tailored manner (i.e., "manually" in the application), as opposed to relying on a thread-safe data structure like StringBuffer.
I am decoding a byte file made by huffman encoding, i turn the bytes into string and then search the values i have been given by the huffman tree. I have a hash table with the encode value and the byte value of the original file. Here is my code.
for(int i = 0, j = 1; j <= encodedString.length(); j++){
if(huffEncodeTable.get( encodedString.substring(i, j)) != null){
decodedString.append(huffEncodeTable.get( encodedString.substring(i, j)));
i = j;
}
Its pretty simple, its a loop that itterates over all the string, the problem comes when the string its too large, -with compress files of size larger that 100KB- its takes a really long time to process them, so i want to know if its a way to make this process in a faster way or if its better to store my encode values in another structure intead of the hastable.
huffEncodeTable -> hashtable
encodedString -> String with the huffman values
decodedString -> The String that will represent the original bytes of the original file
A couple of suggestions:
Every time you append to a String, a new String is created. You should use StringBuilder instead. This is probably the main problem, as I see it.
Also, I'd use hashtable.containsKey instead of get to check for a key's existence. I doubt it impacts your performance much though.
You also might save a bit of time if you store the results of the call to substring, and so only call it once.
So, something like.
StringBuilder sb = new StringBuilder()
String currentString;
for(int i = 0, j = 1; j <= encodedString.length(); j++){
currentString = encodedString.substring(i, j)
if(huffEncodeTable.containsKey( currentString )){
sb.append(huffEncodeTable.get( currentString ));
i = j;
}
}
return sb.toString(); //Or whatever you do with it.
Using substring for different lengths of strings would really slow things down. In Java 7 it takes a copy of the original string creating two objects. You are much better off creating one substring and doing a search against a NavigableMap.
Using a NavigableMap will allow you to find the longest matching string in one operation and reduce the number of strings you need to store in the map.
Note: even so the size of the Map will be O(N^2) where N is the maximum string length you can look back, so you have to place a sensible limit on the size of N.
Note2: You will be lucky to get within a tenth of the speed of the built in huffman code (which is written for you, is standard and works) So if performance matters, use that.