How to implement a LIMITLESS String and StringBuilder

How to implement a LIMITLESS String and StringBuilder - java

Java's String and StringBuilder are limited to a length of Integer.MAX_VALUE. In most use cases this is more than adequate, but I have just encountered a use case in which I need to handle and return a String greater than 2,684,354,560 characters.
This is required for capturing an incoming stream of characters, in which I do not have control over the size of the stream, nor do I have the option of re-architecting the solution. What I can do at most is replace a method in an existing module, or introduce a new class that replaces String and StringBuilder in that method.
As a temporary workaround, to prevent the OutOfMemory exception thrown when the StringBuilder length exceeds Integer.MAX_VALUE, I implemented the follow safeAppend():
private void safeAppend(StringBuilder ret, String current) {
if ((long)ret.length() + current.length() > Integer.MAX_VALUE) {
String truncateLeadingPart;
if (current.length() < ret.length()) {
truncateLeadingPart = ret.substring(current.length());
}
else {
int startIndex = (int)((long)ret.length()+current.length()-Integer.MAX_VALUE);
truncateLeadingPart = ret.substring(Math.min(ret.length(), startIndex));
}
ret.setLength(0);
ret.append(truncateLeadingPart);
}
ret.append(current);
}
This methods truncates the leading part and always keeps the trailing 2,147,483,647 characters part. However, this workaround/safeguard proved to be inadequate for the task at hand because we cannot afford losing any data captured from the stream.
What is a recommended approach to implementing a String and StringBuilder that are NOT limited by an int max size?
A limit of a long max size could be sufficient. Also, a single LimitlessString class that can be appended efficiently like StringBuilder is also adequate.

You wont be able to String or StringBuffer as the 32-bit length is baked into the interface. That's also true of arrays and NIO buffers, unfortunately (there have been proposals to fix this, but nothing at the time of writing).
Obviously streaming or using random file access would be a good solution if that is possible.
You are left with implementing something else. Ropes use a binary tree to represent composition of string parts. More common is to use an array of arrays, or for better GC an array of directly allocated (or memory-mapped file) NIO buffers. Someone remarked a few years ago that this area of Computer Science still has scope for more PhDs.

Well, if you Really-Really need to extend String/StringBuilder classes in such way you have to either create new class, that won't extend String/StringBuilder, because thay are marked as final, or you can change JRE binaries to make String/StringBuilder non-final. Anyway, both solutions sucks and will lead to huge support effort and will generate a lot of WTFs in future.

String and StringBuilder are final classes and cannot be patched. StringWriter would have been better.
Nice would have been:
not using two-byte chars, but bytes (CharBuffer upon ByteBuffer);
compressing (GzipOutputStream);
(as you did) periodically remove a huge chunk to a file or such;
[An aside] in the newer java there is support for single byte encodings which would not allow more characters but would use half the memory.
You'll meet resizing on appending, so the system will slow down.

Related

from InputStream to List<String>, why java is allocating space twice in the JVM?

I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?

tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).

In Java, how to copy data from String to char[]/byte[] efficiently?

I need to copy many big and different String strs' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.
For the reason above, str.toCharArray() was banned, since it allocates space for every String.
As we all know, charAt(i) is more slowly and more complex than using square brackets [i]. So I want to use byte[] or char[].
One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin). But the bad news is it was (or is to be?) deprecated.
So how can we finish this demanding job?

I believe you want getChars(int, int, char[], int). That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".
You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.

A small stocktaking:
String does Unicode text; it can be normalized (java.text.Normalizer).
int[] code points are Unicode symbols
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.
byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.
Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.
When dealing with Asian scripts, int code points probably is most feasible.
Otherwise bytes seem best.
Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.
Performance would best be done in bytes as often most compact. UTF-8 probably.
One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.
What one can do, is using fast ByteBuffers.
Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.
In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.
Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!
In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.
But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. The relevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.

Existing solution to "smart" initial capacity for StringBuilder

I have a piece logging and tracing related code, which called often throughout the code, especially when tracing is switched on. StringBuilder is used to build a String. Strings have reasonable maximum length, I suppose in the order of hundreds of chars.
Question: Is there existing library to do something like this:
// in reality, StringBuilder is final,
// would have to create delegated version instead,
// which is quite a big class because of all the append() overloads
public class SmarterBuilder extends StringBuilder {
private final AtomicInteger capRef;
SmarterBuilder(AtomicInteger capRef) {
int len = capRef.get();
// optionally save memory with expense of worst-case resizes:
// len = len * 3 / 4;
super(len);
this.capRef = capRef;
}
public syncCap() {
// call when string is fully built
int cap;
do {
cap = capRef.get();
if (cap >= length()) break;
} while (!capRef.compareAndSet(cap, length());
}
}
To take advantage of this, my logging-related class would have a shared capRef variable with suitable scope.
(Bonus Question: I'm curious, is it possible to do syncCap() without looping?)
Motivation: I know default length of StringBuilder is always too little. I could (and currently do) throw in an ad-hoc intitial capacity value of 100, which results in resize in some number of cases, but not always. However, I do not like magic numbers in the source code, and this feature is a case of "optimize once, use in every project".

Make sure you do the performance measurements to make sure you really are getting some benefit for the extra work.
As an alternative to a StringBuilder-like class, consider a StringBuilderFactory. It could provide two static methods, one to get a StringBuilder, and the other to be called when you finish building a string. You could pass it a StringBuilder as argument, and it would record the length. The getStringBuilder method would use statistics recorded by the other method to choose the initial size.
There are two ways you could avoid looping in syncCap:
Synchronize.
Ignore failures.
The argument for ignoring failures in this situation is that you only need a random sampling of the actual lengths. If another thread is updating at the same time you are getting an up-to-date view of the string lengths anyway.

You could store the string length of each string in a statistic array. run your app, and at shutdown you take the 90% quartil of your string length (sort all str length values, and take the length value at array pos = sortedStrings.size() * 0,9
That way you created an intial string builder size where 90% of your strings will fit in.
Update
The value could be hard coded (like java does for value 10 in ArrayList), or read from a config file, or calclualted automatically in a test phase. But the quartile calculation is not for free, so best you run your project some time, measure the 90% quartil on the fly inside the SmartBuilder, output the 90% quartil from time to time, and later change the property file to use the value.
That way you would get optimal results for each project.
Or if you go one step further: Let your smart Builder update that value from time to time in the config file.
But this all is not worth the effort, you would do that only for data that have some millions entries, like digital road maps, etc.

Is there an easier way to change BufferedReader to string?

Right Now I have
;; buffer->string: BufferedReader -> String
(defn buffer->string [buffer]
(loop [line (.readLine buffer) sb (StringBuilder.)]
(if(nil? line)
(.toString sb)
(recur (.readLine buffer) (.append sb line)))))
This is too slow.
Edit:
I have a BufferedReader
when i try to do (str BufferedReader) it gives me "java.io.BufferedReader#1ce784b"
the above loop is too slow and I run out of memory space.

(clojure.contrib.duck-streams/slurp* your-buffer) ; is what you want
Your code is slow because buffer isn't hinted.

I don't know Clojure, so I can't tell if you have some detail wrong in your code, but using StringBuffer and appending the input line by line is the correct way to do it (well, using a StringBuilder initialized to the expected final size if known would bring significant but not dramatic improvements).
If you run out of memory, then maybe the content of your BufferedReader is simply too large to fit into your memory and there is no way to have it as a single string - in that case, you'll either have to increase your heap size or find a way to process the data one small chunk at a time.
BTW, if you know the size of your input, a more efficient method would be to use a CharBuffer and fill it by using Reader.read() (you'll have to pay attention to the return method and use it in a loop).

buffer.ToString()? Or in your case, maybe (.toString buffer)?

in java you would do something like;
public String getStringFromBuffer(){
BufferedReader bRead = new BufferedReader();
String line = null;
StringBuffer theText = new StringBuffer();
while((line=bRead.readLine())!=null){
theText.append(line+"\n);
}
return theText.toString();
}

I don't know clojure, just Java. Lets work from there.
Some points to consider:
If your target JVM version is >= 1.5 you can use StringBuilder instead of StringBuffer for a small performance improvement (no synchronization and you don't need it). Read about it here
http://java.sun.com/j2se/1.5.0/docs/api/java/lang/StringBuilder.html
But your big performance cost is probably on the buffer expansion. When you instantiate a StringBuffer/StringBuilder without using the constructor with the capacity argument you get a small capacity.
When starting with a small capacity (the internal buffer size) you have many expansions - every time you exceed that capacity, its internal buffer is reallocated to a new capacity, just large enough to hold the newly appended text, which means copying all previously held text to the new buffer.
This is very slow when you are appending more text to an already very large string.
If you have access to the size of the text you are reading (a file size would be an approximation) you can significantly reduce the amount of expansions.
I could also tell you to use read() method of the BufferedReader, the one with 3 arguments, this one:
BufferedReader.read(char[], int, int)
You could then use one of the String's class constructors that accept a char array to convert the char buffer into a String:
String.String(char[], int, int)
...however, I suspect that the performance improvement will not be that big, especially when compared with the one of reducing how many StringBuilder expansions you'll have.
Whatever the approximation, you seem to have memory capacity problem:
In the end you will need at least twice as much memory as the whole text occupies.
Either if you use the StringBuilder/StringBuffer approach or the other one, in the end you will have to copy the text contents to the new String holding the result.
In the end you will probably need to work out of this box:
Are you sure you only have a BufferedReader as a start and a String as an end? You should provide the broader picture!
If this is the broadest you have, you will need at least a JVM instance configured with more heap since you will probably run out of memory with any of this solutions anyway.

use slurp to read (reasonably sized files) in
use spit to write them back out again.

Convert ASCII byte[] to String

I am trying to pass a byte[] containing ASCII characters to log4j, to be logged into a file using the obvious representation. When I simply pass in the byt[] it is of course treated as an object and the logs are pretty useless. When I try to convert them to strings using new String(byte[] data), the performance of my application is halved.
How can I efficiently pass them in, without incurring the approximately 30us time penalty of converting them to strings.
Also, why does it take so long to convert them?
Thanks.
Edit
I should add that I am optmising for latency here - and yes, 30us does make a difference! Also, these arrays vary from ~100 all the way up to a few thousand bytes.

ASCII is one of the few encodings that can be converted to/from UTF16 with no arithmetic or table lookups so it's possible to convert manually:
String convert(byte[] data) {
StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ++ i) {
if (data[i] < 0) throw new IllegalArgumentException();
sb.append((char) data[i]);
}
return sb.toString();
}
But make sure it really is ASCII, or you'll end up with garbage.

What you want to do is delay processing of the byte[] array until log4j decides that it actually wants to log the message. This way you could log it at DEBUG level, for example, while testing and then disable it during production. For example, you could:
final byte[] myArray = ...;
Logger.getLogger(MyClass.class).debug(new Object() {
#Override public String toString() {
return new String(myArray);
}
});
Now you don't pay the speed penalty unless you actually log the data, because the toString method isn't called until log4j decides it'll actually log the message!
Now I'm not sure what you mean by "the obvious representation" so I've assumed that you mean convert to a String by reinterpreting the bytes as the default character encoding. Now if you are dealing with binary data, this is obviously worthless. In that case I'd suggest using Arrays.toString(byte[]) to create a formatted string along the lines of
[54, 23, 65, ...]

If your data is in fact ASCII (i.e. 7-bit data), then you should be using new String(data, "US-ASCII") instead of depending on the platform default encoding. This may be faster than trying to interpret it as your platform default encoding (which could be UTF-8, which requires more introspection).
You could also speed this up by avoiding the Charset-Lookup hit each time, by caching the Charset instance and calling new String(data, charset) instead.
Having said that: it's been a very, very long time since I've seen real ASCII data in production environment

Halved performance? How large is this byte array? If it's for example 1MB, then there are certainly more factors to take into account than just "converting" from bytes to chars (which is supposed to be fast enough though). Writing 1MB of data instead of "just" 100bytes (which the byte[].toString() may generate) to a log file is obviously going to take some time. The disk file system is not as fast as RAM memory.
You'll need to change the string representation of the byte array. Maybe with some more sensitive information, e.g. the name associated with it (filename?), its length and so on. After all, what does that byte array actually represent?
Edit: I can't remember to have seen the "approximately 30us" phrase in your question, maybe you edited it in within 5 minutes after asking, but this is actually microoptimization and it should certainly not cause "halved performance" in general. Unless you write them a million times per second (still then, why would you want to do that? aren't you overusing the phenomenon "logging"?).

Take a look here: Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.