Existing solution to "smart" initial capacity for StringBuilder - java

I have a piece logging and tracing related code, which called often throughout the code, especially when tracing is switched on. StringBuilder is used to build a String. Strings have reasonable maximum length, I suppose in the order of hundreds of chars.
Question: Is there existing library to do something like this:
// in reality, StringBuilder is final,
// would have to create delegated version instead,
// which is quite a big class because of all the append() overloads
public class SmarterBuilder extends StringBuilder {
private final AtomicInteger capRef;
SmarterBuilder(AtomicInteger capRef) {
int len = capRef.get();
// optionally save memory with expense of worst-case resizes:
// len = len * 3 / 4;
super(len);
this.capRef = capRef;
}
public syncCap() {
// call when string is fully built
int cap;
do {
cap = capRef.get();
if (cap >= length()) break;
} while (!capRef.compareAndSet(cap, length());
}
}
To take advantage of this, my logging-related class would have a shared capRef variable with suitable scope.
(Bonus Question: I'm curious, is it possible to do syncCap() without looping?)
Motivation: I know default length of StringBuilder is always too little. I could (and currently do) throw in an ad-hoc intitial capacity value of 100, which results in resize in some number of cases, but not always. However, I do not like magic numbers in the source code, and this feature is a case of "optimize once, use in every project".

Make sure you do the performance measurements to make sure you really are getting some benefit for the extra work.
As an alternative to a StringBuilder-like class, consider a StringBuilderFactory. It could provide two static methods, one to get a StringBuilder, and the other to be called when you finish building a string. You could pass it a StringBuilder as argument, and it would record the length. The getStringBuilder method would use statistics recorded by the other method to choose the initial size.
There are two ways you could avoid looping in syncCap:
Synchronize.
Ignore failures.
The argument for ignoring failures in this situation is that you only need a random sampling of the actual lengths. If another thread is updating at the same time you are getting an up-to-date view of the string lengths anyway.

You could store the string length of each string in a statistic array. run your app, and at shutdown you take the 90% quartil of your string length (sort all str length values, and take the length value at array pos = sortedStrings.size() * 0,9
That way you created an intial string builder size where 90% of your strings will fit in.
Update
The value could be hard coded (like java does for value 10 in ArrayList), or read from a config file, or calclualted automatically in a test phase. But the quartile calculation is not for free, so best you run your project some time, measure the 90% quartil on the fly inside the SmartBuilder, output the 90% quartil from time to time, and later change the property file to use the value.
That way you would get optimal results for each project.
Or if you go one step further: Let your smart Builder update that value from time to time in the config file.
But this all is not worth the effort, you would do that only for data that have some millions entries, like digital road maps, etc.

Related

How to implement a LIMITLESS String and StringBuilder

Java's String and StringBuilder are limited to a length of Integer.MAX_VALUE. In most use cases this is more than adequate, but I have just encountered a use case in which I need to handle and return a String greater than 2,684,354,560 characters.
This is required for capturing an incoming stream of characters, in which I do not have control over the size of the stream, nor do I have the option of re-architecting the solution. What I can do at most is replace a method in an existing module, or introduce a new class that replaces String and StringBuilder in that method.
As a temporary workaround, to prevent the OutOfMemory exception thrown when the StringBuilder length exceeds Integer.MAX_VALUE, I implemented the follow safeAppend():
private void safeAppend(StringBuilder ret, String current) {
if ((long)ret.length() + current.length() > Integer.MAX_VALUE) {
String truncateLeadingPart;
if (current.length() < ret.length()) {
truncateLeadingPart = ret.substring(current.length());
}
else {
int startIndex = (int)((long)ret.length()+current.length()-Integer.MAX_VALUE);
truncateLeadingPart = ret.substring(Math.min(ret.length(), startIndex));
}
ret.setLength(0);
ret.append(truncateLeadingPart);
}
ret.append(current);
}
This methods truncates the leading part and always keeps the trailing 2,147,483,647 characters part. However, this workaround/safeguard proved to be inadequate for the task at hand because we cannot afford losing any data captured from the stream.
What is a recommended approach to implementing a String and StringBuilder that are NOT limited by an int max size?
A limit of a long max size could be sufficient. Also, a single LimitlessString class that can be appended efficiently like StringBuilder is also adequate.
You wont be able to String or StringBuffer as the 32-bit length is baked into the interface. That's also true of arrays and NIO buffers, unfortunately (there have been proposals to fix this, but nothing at the time of writing).
Obviously streaming or using random file access would be a good solution if that is possible.
You are left with implementing something else. Ropes use a binary tree to represent composition of string parts. More common is to use an array of arrays, or for better GC an array of directly allocated (or memory-mapped file) NIO buffers. Someone remarked a few years ago that this area of Computer Science still has scope for more PhDs.
Well, if you Really-Really need to extend String/StringBuilder classes in such way you have to either create new class, that won't extend String/StringBuilder, because thay are marked as final, or you can change JRE binaries to make String/StringBuilder non-final. Anyway, both solutions sucks and will lead to huge support effort and will generate a lot of WTFs in future.
String and StringBuilder are final classes and cannot be patched. StringWriter would have been better.
Nice would have been:
not using two-byte chars, but bytes (CharBuffer upon ByteBuffer);
compressing (GzipOutputStream);
(as you did) periodically remove a huge chunk to a file or such;
[An aside] in the newer java there is support for single byte encodings which would not allow more characters but would use half the memory.
You'll meet resizing on appending, so the system will slow down.

Spark streaming gc setup questions

My logic is as follows.
Use createDirectStream to get a topic by log type in Kafka.
After repartition, the log is processed through various processing.
Create a single string using combineByKey for each log type (use StringBuilder).
Finally, save to HDFS by log type.
There are a lot of operations that add strings, so GC happens frequently.
How is it better to set up GC in this situation?
//////////////////////
There are various logic, but I think there is a problem in doing combineByKey.
rdd.combineByKey[StringBuilder](
(s: String) => new StringBuilder(s),
(sb: StringBuilder, s: String) => sb.append(s),
(sb1: StringBuilder, sb2: StringBuilder) => sb1.append(sb2)
).mapValues(_.toString)
The simplest thing with greatest impact you can do with that combineByKey expression is to size the StringBuilder you create so that it does not have to expand its backing character array as you merge string values into it; the resizing amplifies the allocation rate and wastes memory bandwidth by copying from old to new backing array. As a guesstimate, I would say pick the 90th percentile of string length of the resulting data set's records.
A second thing to look at (after collecting some statistics on your intermediate values) would be for your combiner function to pick the StringBuilder instance that has room to fit in the other one when you call sb1.append(sb2).
A good thing to take care of would be to use Java 8; it has optimizations that make a significant difference when there's heavy work on strings and string buffers.
Last but not least, profile to see where you are actually spending your cycles. This workload (excluding any additional custom processing you are doing) shouldn't need to promote a lot of objects (if any) to old generation, so you should make sure that young generation has ample size and is collected in parallel.

Reading large files for a simulation (Java crashes with out of heap space)

For a school assignment, I need to create a Simulation for memory accesses. First I need to read 1 or more trace files. Each contains memory addresses for each access. Example:
0 F001CBAD
2 EEECA89F
0 EBC17910
...
Where the first integer indicates a read/write etc. then the hex memory address follows. With this data, I am supposed to run a simulation. So the idea I had was parse these data into an ArrayList<Trace> (for now I am using Java) with trace being a simple class containing the memory address and the access type (just a String and an integer). After which I plan to loop through these array lists to process them.
The problem is even at parsing, it running out of heap space. Each trace file is ~200MB. I have up to 8. Meaning minimum of ~1.6 GB of data I am trying to "cache"? What baffles me is I am only parsing 1 file and java is using 2GB according to my task manager ...
What is a better way of doing this?
A code snippet can be found at Code Review
The answer I gave on codereview is the same one you should use here .....
But, because duplication appears to be OK, I'll duplicate the answer here.
The issue is almost certainly in the structure of your Trace class, and it's memory efficiency. You should ensure that the instrType and hexAddress are stored as memory efficient structures. The instrType appears to be an int, which is good, but just make sure that it is declared as an int in the Trace class.
The more likely problem is the size of the hexAddress String. You may not realise it but Strings are notorious for 'leaking' memory. In this case, you have a line and you think you are just getting the hexString from it... but in reality, the hexString contains the entire line.... yeah, really. For example, look at the following code:
public class SToken {
public static void main(String[] args) {
StringTokenizer tokenizer = new StringTokenizer("99 bottles of beer");
int instrType = Integer.parseInt(tokenizer.nextToken());
String hexAddr = tokenizer.nextToken();
System.out.println(instrType + hexAddr);
}
}
Now, set a break-point in (I use eclipse) your IDE, and then run it, and you will see that hexAddr contains a char[] array for the entire line, and it has an offset of 3 and a count of 7.
Because of the way that String substring and other constructs work, they can consume huge amounts of memory for short strings... (in theory that memory is shared with other strings though). As a consequence, you are essentially storing the entire file in memory!!!!
At a minimum, you should change your code to:
hexAddr = new String(tokenizer.nextToken().toCharArray());
But even better would be:
long hexAddr = parseHexAddress(tokenizer.nextToken());
Like rolfl I answered your question in the code review. The biggest issue, to me, is the reading everything into memory first and then processing. You need to read a fixed amount, process that, and repeat until finished.
Try use class java.nio.ByteBuffer instead of java.util.ArrayList<Trace>. It should also reduce the memory usage.
class TraceList {
private ByteBuffer buffer;
public TraceList(){
//allocate byte buffer
}
public void put(byte operationType, int addres) {
//put data to byte buffer
}
public Trace get(int index) {
//get data from byte buffer by index
byte type = ...//read type
int addres = ...//read addres
return new Trace(type, addres)
}
}

Someone told me that it is saving memory to use numbers directly instead of static final int fields, is that true?

In my Android project, there are many constances to represent bundle extra keys, Handler's message arguments, dialog ids ant etc.
Someone in my team uses some normal number to do this, like:
handler.sendMessage(handler.obtainMessage(MESSAGE_OK, 1, 0));
handler.sendMessage(handler.obtainMessage(MESSAGE_OK, 2, 0));
handler.sendMessage(handler.obtainMessage(MESSAGE_OK, 3, 0));
in handler:
switch (msg.arg1) {
case 1:
break;
case 2:
break;
case 3:
break;
}
he said too many static final constances cost a lot of memory. but i think his solution makes the code hard to read and refactor.
I have read this question and googled a lot and failed to find an answer.
java: is using a final static int = 1 better than just a normal 1?
I hope someone could show me the memory cost of static finals.
Sorry for my poor English.
You shouldn't bother to change it to literals, it will make your code less readable and less maintainable.
In the long run you will benefit from this "lose of memory"
Technically, he is right - static int fields do cost some additional memory.
However, the cost is negligible. It's an int, plus the associated metadata for the reflection support. The benefits of using meaningfull names that make your code more readable, and ensure that the semantic of that number is well known and consistent evewhere it is used, clearly outweight that cost.
You can do a simple test. Write a small application that calls handler.sendMessage 1000 times with different number literal, build it and note down the size of the .dex file. Then replace the 1000 literals with 1000 static int consts, and do the same. Compare the two sizes and you will get an idea of the order of magnitude of additional memory your app will need. (And just for completeness, post the numbers here as comment :-))
It saves a very small amount of memory - basically just the extra metadata required to record the extra constant in the relevant class and refer to it from other classes.
It is NOT worth worrying about this, unless you are extremely memory constrained.
Using well-named static final constants rather than mysterious magic numbers is much better for your code maintainability and sanity in the long run.

Efficient pattern search from a buffer in Java?

I've been trying to figure out an efficient way to work without reading twice from a file when using a buffer to search for a pattern of bytes. I've chosen to implement Runnable so I can divide my task to work in concurrent threads. My code looks something like this:
// constructor: initializes local variables.
public BytePatternSearcher(RandomAccessFile raFile, byte[] pattern, int bufferSize, long startPos, long endPos);
public void run()
{
while(amountToRead - raFile.read(buffer) > 0)
{
// search code
}
{
Now, I've hit a snag: my algorithm works in simple cases, but not in complex ones. I made the assumption that there are no cases of a pattern starting within one already being searched, that the pattern length is shorter than the buffer, and so on, limiting to one scan at a time and just iterating through the file. Naturally, this is not a very robust solution. Suppose I have a pattern of 'xxxxx' (length 5), my file is 'xxxxxxyxxxxxx' and my buffer size is 2 (x and y represent certain byte values). The string appears 4 times, and each check requires more than twice the buffer length.
How do I go about working things out without checking the same byte more than once for all cases?
The wikipedia has an entry for Boyer–Moore string search algorithm which also contains some sample implementations.

Categories