Using StringBuilder to process csv files to save heap space

Using StringBuilder to process csv files to save heap space - java

I am reading a csv file that has about 50,000 lines and 1.1MiB in size (and can grow larger).
In Code1, I use String to process the csv, while in Code2 I use StringBuilder (only one thread executes the code, so no concurrency issues)
Using StringBuilder makes the code a little bit harder to read that using normal String class.
Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory?
Code1
fr = new FileReader(file);
BufferedReader reader = new BufferedReader(fr);
String line = reader.readLine();
while ( line != null )
{
int separator = line.indexOf(',');
String symbol = line.substring(0, seperator);
int begin = separator;
separator = line.indexOf(',', begin+1);
String price = line.substring(begin+1, seperator);
// Publish this update
publisher.publishQuote(symbol, price);
// Read the next line of fake update data
line = reader.readLine();
}
Code2
fr = new FileReader(file);
StringBuilder stringBuilder = new StringBuilder(reader.readLine());
while( stringBuilder.toString() != null ) {
int separator = stringBuilder.toString().indexOf(',');
String symbol = stringBuilder.toString().substring(0, separator);
int begin = separator;
separator = stringBuilder.toString().indexOf(',', begin+1);
String price = stringBuilder.toString().substring(begin+1, separator);
publisher.publishQuote(symbol, price);
stringBuilder.replace(0, stringBuilder.length(), reader.readLine());
}
Edit
I eliminated the toString() call, so there will be less string objects produced.
Code3
while( stringBuilder.length() > 0 ) {
int separator = stringBuilder.indexOf(",");
String symbol = stringBuilder.substring(0, separator);
int begin = separator;
separator = stringBuilder.indexOf(",", begin+1);
String price = stringBuilder.substring(begin+1, separator);
publisher.publishQuote(symbol, price);
Thread.sleep(10);
stringBuilder.replace(0, stringBuilder.length(), reader.readLine());
}
Also, the original code is downloaded from http://www.devx.com/Java/Article/35246/0/page/1

Will the optimized code increase performance of the app? - my question
The second code sample will not save you any memory nor any computation time. I am afraid you might have misunderstood the purpose of StringBuilder, which is really meant for building strings - not reading them.
Within the loop or your second code sample, every single line contains the expression stringBuilder.toString(), essentially turning the buffered string into a String object over and over again. Your actual string operations are done against these objects. Not only is the first code sample easier to read, but it is most certainly as performant of the two.
Am I prematurely optimizing things with StringBuilder? - your question
Unless you have profiled your application and have come to the conclusion that these very lines causes a notable slowdown on the execution speed, yes. Unless you are really sure that something will be slow (eg if you recognize high computational complexity), you definately want to do some profiling before you start making optimizations that hurt the readability of your code.
What kind of optimizations could be done to this code? - my question
If you have profiled the application, and decided this is the right place for an optimization, you should consider looking into the features offered by the Scanner class. Actually, this might both give you better performance (profiling will tell you if this is true) and more simple code.

Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory?
Most probably: yes. But, only one way to find out: profile your code.
Also, I'd use a proper CSV parser instead of what you're doing now: http://ostermiller.org/utils/CSV.html

Code2 is actually less efficient than Code1 because every time you call stringBuilder.toString() you're creating a new java.lang.String instance (in addition to the existing StringBuilder object). This is less efficient in terms of space and time due to the object creation overhead.
Assigning the contents of readLine() directly to a String and then splitting that String will typically be performant enough. You could also consider using the Scanner class.
Memory Saving Tip
If you encounter multiple repeating tokens in your input consider using String.intern() to ensure that each identical token references the same String object; e.g.
String[] tokens = parseTokens(line);
for (String token : tokens) {
// Construct business object referencing interned version of token.
BusinessObject bo = new BusinessObject(token.intern());
// Add business object to collection, etc.
}

StringBuilder is usually used like this:
StringBuilder sb = new StringBuilder();
sb.append("You").append(" can chain ")
.append(" your ").append(" strings ")
.append("for better readability.");
String myString = sb.toString(); // only call once when you are done
System.out.prinln(sb); // also calls sb.toString().. print myString instead

StringBuilder has several good things
StringBuffer's operations are synchronized but StringBuilder is not, so using StringBuilder will improve performance in single threaded scenarios
Once the buffer is expanded the buffer can be reused by invoking setLength(0) on the object. Interestingly if you step into the debugger and examine the contents of StringBuilder you will see that contents are still exists even after invoking setLength(0). The JVM simply resets the pointer beginning of the string. Next time when you start appending the chars the pointer moves
If you are not really sure about length of string, it is better to use StringBuilder because once the buffer is expanded you can reuse the same buffer for smaller or equal size
StringBuffer and StringBuilder are almost same in all operations except that StringBuffer is synchronized and StringBuilder is not
If you dont have multithreading then it is better to use StringBuilder

Related

How can I efficiently use StringBuilder?

In the past, I've always used printf to format printing to the console but the assignment I currently have (creating an invoice report) wants us to use StringBuilder, but I have no idea how to do so without simply using " " for every gap needed. For example... I'm supposed to print this out
Invoice Customer Salesperson Subtotal Fees Taxes Discount Total
INV001 Company Eccleston, Chris $ 2357.60 $ 40.00 $ 190.19 $ -282.91 $ 2304.88
But I don't know how to get everything to line up using the StringBuilder. Any advice?

StringBuilder aims to reduce the overhead associated with creating strings.
As you may or may not know, strings are immutable. What this means that something like
String a = "foo";
String b = "bar";
String c = a + b;
String d = c + c;
creates a new string for each line. If all we are concerned about is the final string d, the line with string c is wasting space because it creates a new String object when we don't need it.
String builder simply delays actually building the String object until you call .toString(). At that point, it converts an internal char[] to an actual string.
Let's take another example.
String foo() {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < 100; i++)
sb.append(i);
return sb.toString();
}
Here, we only create one string. StringBuilder will keep track of the chars you have added to your string in its internal char[] value. Note that value.length will generally be larger than the total chars you have added to your StringBuilder, but value might run out of room for what you're appending if the string you are building gets too big. When that happens, it'll resize, which just means replacing value with a larger char[], and copying over the old values to the new array, along with the chars of whatever you appended.
Finally, when you call sb.toString(), the StringBuilder will call a String constructor that takes an argument of a char[].
That means only one String object was created, and we only needed enough memory for our char[] and to resize it.
Compare with the following:
String foo() {
String toReturn = "";
for (int i = 0; i < 100; i++)
toReturn += "" + i;
toReturn;
}
Here, we have 101 string objects created (maybe more, I'm unsure). We only needed one though! This means that at every call, we're disposing the original string toReturn represented, and creating another string.
With a large string, especially, this is very expensive, because at every call you need to first acquire as much memory as the new string needs, and dispose of as much memory as the old string had. It's not a big deal when things are kept short, but when you're working with entire files this can easily become a problem.
In a nutshell: if you're working appending / removing information before finalizing an output: use a StringBuilder. If your strings are very short, I think it is OK to just concatenate normally for convenience, but this is up to you to define what "short" is.

Java large String returned from findWithinHorizon converted to InputStream

I have wrote an application which in one of its modules parses huge file and saves this file chunk by chunk into a database.
First of all the following code works, and my main problem is to reduce memory usage and general increase in performance.
The following code snippet is a small part of the big picture, but is the most problematic after doing some YourKit profiling, The lines that are marked by /*Here*/ allocate huge amount of memory.
....
Scanner fileScanner = new Scanner(file,"UTF-8");
String scannedFarm;
try{
Pattern p = Pattern.compile("(?:^.++$(?:\\r?+\\n)?+){2,100000}+",Pattern.MULTILINE);
String [] tableName = null;
/*HERE*/while((scannedFarm = fileScanner.findWithinHorizon(p, 0)) != null){
boolean continuePrevStream = false;
Scanner scanner = new Scanner(scannedFarm);
String[] tmpTableName = scanner.nextLine().split(getSeparator());
if (tmpTableName.length==2){
tableName = tmpTableName;
}else{
if (tableName==null){
continue;
}
continuePrevStream = true;
}
scanner.close();
/*HERE*/ InputStream is = new ByteArrayInputStream(scannedFarm.getBytes("UTF-8"));
....
It is acceptable to allocate huge amount of memory since the String is large (i need it too be such large chunk), My main problem is that the same allocation happens twice as a result of getBytes,
So my question is their a way to transfer the findWithinHorizon Result directly to InputStream without allocating memory twice?
Is their more efficient way to achieve the same functionality?

Not exactly the same approach but instead of findWithinHorizon, you could try reading each line and searching for the pattern within the line context. This is sure to reduce memory pressure as you're not buffering the whole file as the API states:
If horizon is 0, then the horizon is ignored and this method continues
to search through the input looking for the specified pattern without
bound. In this case it may buffer all of the input searching for the
pattern.
Something like:
while(String line = fileScanner.nextLine() != null) {
if(grep for pattern in line) {
}
}

which code is more efficient?

which of the following is an efficient way to reverse words in a string ?
public String Reverse(StringTokenizer st){
String[] words = new String[st.countTokens()];
int i = 0;
while(st.hasMoreTokens()){
words[i] = st.nextToken();i++}
for(int j = words.length-1;j--)
output = words[j]+" ";}
OR
public String Reverse(StringTokenizer st, String output){
if(!st.hasMoreTokens()) return output;
output = st.nextToken()+" "+output;
return Reverse(st, output);}
public String ReverseMain(StringTokenizer st){
return Reverse(st, "");}
while the first way seems more readable and straight forward, there are two loops in it. In the 2nd method, I've tried doing it in tail-recursive way. But I am not sure whether java does optimize tail-recursive code.

you could do this in just one loop
public String Reverse(StringTokenizer st){
int length = st.countTokens();
String[] words = new String[length];
int i = length - 1;
while(i >= 0){
words[i] = st.nextToken();i--}
}

But I am not sure whether java does optimize tail-recursive code.
It doesn't. Or at least the Sun/Oracle Java implementations don't, up to and including Java 7.
References:
"Tail calls in the VM" by John Rose # Oracle.
Bug 4726340 - RFE: Tail Call Optimization
I don't know whether this makes one solution faster than the other. (Test it yourself ... taking care to avoid the standard micro-benchmarking traps.)
However, the fact that Java doesn't implement tail-call optimization means that the 2nd solution is liable to run out of stack space if you give it a string with a large (enough) number of words.
Finally, if you are looking for a more space efficient way to implement this, there is clever way that uses just a StringBuilder.
Create a StringBuilder from your input String
Reverse the characters in the StringBuilder using reverse().
Step through the StringBuilder, identifying the start and end offset of each word. For each start/end offset pair, reverse the characters between the offsets. (You have to do this using a loop.)
Turn the StringBuilder back into a String.

You can test results by timing both of them on a large amount of results
eg. You reverse 100000000 strings and see how many seconds it takes. You could also compare start and end system timestamps to get the exact difference between the two functions.

StringTokenizer is not deprecated but if you read the current JavaDoc...
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
String[] strArray = str.split(" ");
StringBuilder sb = new StringBuilder();
for (int i = strArray.length() - 1; i >= 0; i--)
sb.append(strArray[i]).append(" ");
String reversedWords = sb.substring(0, sb.length -1) // strip trailing space

Why use append() instead of + [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Why to use StringBuffer in Java instead of the string concatenation operator
what is the advantage or aim of doing this
int a= 42
StringBuffer sb = new StringBuffer(40);
String s = sb.append("a = ").append(a).append("!").toString();
System.out.println(sb);
result > a = 42!
instead of
int a= 42
String s = "a = " + a + "!";
System.out.println(sb);

In your scenario, I'm not sure there is a difference b/c all of your "+" are on one line (which only creates a String once). In general, though, Strings are immutable objects and are not truly manipulated but rather created and discarded using StringBuffers.
So ultimately, you will have more efficient code if you use StringBuffers (and generally StringBuilders). If you google "String vs. StringBuffer vs. StringBuilder" you can find many articles detailing the statistics.

Efficiency. String concatenation in Java uses StringBuilders in the background anyway, so in some cases you can eke out a bit of efficiency by controlling that yourself.

Just run the code for 10000 time and measure the time. It should be obvious.
Some background-information: String is immutable while StringBuilder is not. So everytime you concatenate a String you have to copy an array.
PS: Sometimes the compiler optimizes things though. Maybe if you make your variable static final it would be just one String internally and no concatenation.

First of all, StringBuffer is synchronized, so you would typically use StringBuilder. + has been reimplemented to use StringBuilder a while ago.
Second, as #Riggy mentioned Java actually does optimize + as long as they occur in a single expression. But if you were to do:
String s = "";
s += a;
s += b;
s += c;
s += d;
Then the effective code would become:
String s ="";
s = new StringBuilder(s).append(a).toString();
s = new StringBuilder(s).append(b).toString();
s = new StringBuilder(s).append(c).toString();
s = new StringBuilder(s).append(d).toString();
which is suboptimal to
String s = new StringBuilder(s).append(a).append(b).append(c).append(d).toString();

Because of compiler optimizations, it may or may not make any difference in your app. You'll have to run comparison speed tests to see.
But before you obsess about performance, get the program working right. "Premature optimization is the root of all evil."

How to Reassign value of StringBuffer?

How can we re assign the value of a StringBuffer or StringBuilder Variable?
StringBuffer sb=new StringBuffer("teststr");
Now i have to change the value of sb to "testString" without emptying the contents.
I am looking at a method which can do this assignment directly without using separate memory allocation.I think we can do it only after emptying the contents.

sb.setLength(0);
sb.append("testString");

It should first be mentioned that StringBuilder is generally preferred to StringBuffer. From StringBuffer's own API:
As of release JDK 5, this class has been supplemented with an equivalent class designed for use by a single thread, StringBuilder. The StringBuilder class should generally be used in preference to this one, as it supports all of the same operations but it is faster, as it performs no synchronization.
That said, I will stick to StringBuffer for the rest of the answer because that's what you're asking; everything that StringBuffer does, StringBuilder also... except synchronization, which is generally unneeded. So unless you're using the buffer in multiple threads, switching to StringBuilder is a simple task.
The question
StringBuffer sb = new StringBuffer("teststr");
"Now i have to change the value of sb to "testString" without emptying the contents"
So you want sb to have the String value "testString" in its buffer? There are many ways to do this, and I will list some of them to illustrate how to use the API.
The optimal solution: it performs the minimum edit from "teststr" to "testString". It's impossible to do it any faster than this.
StringBuffer sb = new StringBuffer("teststr");
sb.setCharAt(4, 'S');
sb.append("ing");
assert sb.toString().equals("testString");
This needlessly overwrites "tr" with "tr".
StringBuffer sb = new StringBuffer("teststr");
sb.replace(4, sb.length(), "String");
assert sb.toString().equals("testString");
This involves shifts due to deleteCharAt and insert.
StringBuffer sb = new StringBuffer("teststr");
sb.deleteCharAt(4);
sb.insert(4, 'S');
sb.append("ing");
assert sb.toString().equals("testString");
This is a bit different now: it doesn't magically know that it has "teststr" that it needs to edit to "testString"; it assumes only that the StringBuffer contains at least one occurrence of "str" somewhere, and that it needs to be replaced by "String".
StringBuffer sb = new StringBuffer("strtest");
int idx = sb.indexOf("str");
sb.replace(idx, idx + 3, "String");
assert sb.toString().equals("Stringtest");
Let's say now that you want to replace ALL occurrences of "str" and replace it with "String". A StringBuffer doesn't have this functionality built-in. You can try to do it yourself in the most efficient way possible, either in-place (probably with a 2-pass algorithm) or using a second StringBuffer, etc.
But instead I will use the replace(CharSequence, CharSequence) from String. This will be more than good enough in most cases, and is definitely a lot more clear and easier to maintain. It's linear in the length of the input string, so it's asymptotically optimal.
String before = "str1str2str3";
String after = before.replace("str", "String");
assert after.equals("String1String2String3");
Discussions
"I am looking for the method to assign value later by using previous memory location"
The exact memory location shouldn't really be a concern for you; in fact, both StringBuilder and StringBuffer will reallocate its internal buffer to different memory locations whenever necessary. The only way to prevent that would be to ensureCapacity (or set it through the constructor) so that its internal buffer will always be big enough and it would never need to be reallocated.
However, even if StringBuffer does reallocate its internal buffer once in a while, it should not be a problem in most cases. Most data structures that dynamically grows (ArrayList, HashMap, etc) do them in a way that preserves algorithmically optimal operations, taking advantage of cost amortization. I will not go through amortized analysis here, but unless you're doing real-time systems etc, this shouldn't be a problem for most applications.
Obviously I'm not aware of the specifics of your need, but there is a fear of premature optimization since you seem to be worrying about things that most people have the luxury of never having to worry about.

What do you mean with "reassign"? You can empty the contents by using setLength() and then start appending new content, if that's what you mean.
Edit: For changing parts of the content, you can use replace().
Generally, this kind of question can be easily answered by looking at the API doc of the classes in question.

You can use a StringBuilder in place of a StringBuffer, which is typically what people do if they can (StringBuilder isn't synchronized so it is faster but not threadsafe). If you need to initialize the contents of one with the other, use the toString() method to get the string representation. To recycle an existing StringBuilder or StringBuffer, simply call setLength(0).
Edit
You can overwrite a range of elements with the replace() function. To change the entire value to newval, you would use buffer.replace(0,buffer.length(),newval). See also:
StringBuilder
StringBuffer

You might be looking for the replace() method of the StringBuffer:
StringBuffer sb=new StringBuffer("teststr");
sb.replace(0, sb.length() - 1, "newstr");
Internally, it removes the original string, then inserts the new string, but it may save you a step from this:
StringBuffer sb=new StringBuffer("teststr");
sb.delete(0, sb.length() - 1);
sb.append("newstr");
Using setLength(0) reassigns a zero length StringBuffer to the variable, which, I guess, is not what you want:
StringBuffer sb=new StringBuffer("teststr");
// Reassign sb to a new, empty StringBuffer
sb.setLength(0);
sb.append("newstr");

Indeed, I think replace() is the best way. I checked the Java-Source code. It really overwrites the old characters.
Here is the source code from replace():
public AbstractStringBuffer replace(int start, int end, String str)
{
if (start < 0 || start > count || start > end)
throw new StringIndexOutOfBoundsException(start);
int len = str.count;
// Calculate the difference in 'count' after the replace.
int delta = len - (end > count ? count : end) + start;
ensureCapacity_unsynchronized(count + delta);
if (delta != 0 && end < count)
VMSystem.arraycopy(value, end, value, end + delta, count - end);
str.getChars(0, len, value, start);
count += delta;
return this;
}

Changing entire value of StringBuffer:
StringBuffer sb = new StringBuffer("word");
sb.setLength(0); // setting its length to 0 for making the object empty
sb.append("text");
This is how you can change the entire value of StringBuffer.

You can convert to/from a String, as follows:
StringBuffer buf = new StringBuffer();
buf.append("s1");
buf.append("s2");
StringBuilder sb = new StringBuilder(buf.toString());
// Now sb, contains "s1s2" and you can further append to it

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Using StringBuilder to process csv files to save heap space - java

Am I prematurely optimizing things with StringBuilder in Code2 to save a bit of heap space and memory? Most probably: yes. But, only one way to find out: profile your code. Also, I'd use a proper CSV parser instead of what you're doing now: http://ostermiller.org/utils/CSV.html

Related

How can I efficiently use StringBuilder?

Java large String returned from findWithinHorizon converted to InputStream

which code is more efficient?

Why use append() instead of + [duplicate]

How to Reassign value of StringBuffer?

Categories

Resources