How do I avoid the overhead associated with String.getBytes(Charset ch) - java

Using String.getBytes(Charset ch), allocates a new buffer, in fact it returns a byte[]. Is there a way to avoid this? I'd like to have a reusable byte array and have the strings encoded in this buffer.

You can use the Charset and CharsetEncoder APIs directly, in particular calling encode(CharBuffer, ByteBuffer, boolean). However, I wouldn't expect it to end up being particularly pleasant code.

If you're like me an don't master ByteBuffer, to complement Jon's answer, you could also create your own OutputStream implementation wrapping your byte array, and use an OutputStreamWriter to write the String to this custom OutputStream.

You can use
getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)
//Copies characters from this string into the destination character array.
and manage the array by yourself.

Related

Go from ByteBuffer to String directly but with no intermediary byte[]?

Is there an approach that avoids having to copy byte[] from ByteBuffer with the ByteBuffer.get() operation.
I was looking at this post Java: Converting String to and from ByteBuffer and associated problems
and that causes an intermediary CharBuffer which I don't want as well.
I would like it to go from ByteBuffer to String.
When I know I have a byte[] underlying, this is easy with the code like so
new String(data, offset, length, charSet);
I was hoping for something similar with ByteBuffer. I am beginning to think this may not be possible? I need to decode N bytes of my ByteBuffer really.
This may be a bit of premature optimization but I am really just curious and wanted to test out the performance and squeeze every little bit out. (personal project really).
thanks,
Dean
Not really for a direct ByteBuffer, no. You need to have intermediate something, because String doesn't take a ByteBuffer as a constructor argument, and you can't wrap one (or even a char[]). If the buffer is non-direct, you can use the array() method to get a reference to the backing array (which isn't an intermediate array) and create a String out of that.
On the plus side, there's probably a lot more performance sensitive places in your project.

Any util class/method to take a large String and return an InputStream?

I am looking for some util class/method to take a large String and return an InputStream.
If the String is small, I can just do:
InputStream is = new ByteArrayInputStream(str.getBytes(<charset>));
But when the String is large(1MB, 10MB or more), a byte array 1 to 2 times(or more?) as large as my String is allocated on the spot. (And since you won't know how many bytes to allocate exactly before all the encoding is done, I think there must be other arrays/buffers allocated before the final byte array is allocated).
I have performance requirements, and want to optimize this operation.
Ideally I think, the class/method I am looking for would encode the characters on the fly one small block at a time as the InputStream is being consumed, thus no big surge of mem allocation.
Looking at the source code of apache commons IOUtils.toInputStream(..), I see that it also converts the String to a big byte array in one go.
And StringBufferInputStream is Deprecated, and does not do the job properly.
Is there such util class/method from anywhere? Or I can just write a couple of lines of code to do this?
The functional need for this is that, elsewhere, I am using a util method that takes an InputStream and stream out the bytes from this InputStream.
I haven't seem other people looking for something like this. Am I mistaking something somewhere?
I have started writing a custom class for this, but would stop if there is a better/proper/right solution/correction to my need.
The Java built-in libraries assume you'd only need to go from chars to bytes in output, not input. The Apache Commons IO libraries have ReaderInputStream, however, which can wrap a StringReader to get what you want.
For me there is a fundamental problem. Why do you have such huge Strings in memory in the first place...
Anyway, you can try this:
public static InputStream largeStringToBytes(final String tooLarge,
final Charset charset)
{
final CharsetEncoder encoder = charset.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT);
final ByteBuffer buf = charset.encode(CharBuffer.wrap(tooLarge));
return new ByteArrayInputStream(buf.array());
}
If you are passing the large string as parameter then the memory is already allocated. A string that big cannot even be pushed on to the stack (most of the time max stack size is 1MB) so this is getting allocated on the heap just to pass it as a parameter. The only way I can see to avoid this would be to create a tree on disk where you streamed back a chracter at a time as you walked the tree. If you have multiple large strings perhaps to can index them in a Trie or a DAWG and walk that structure. This will eliminate many of the duplicate characters between strings. But, I will need to know more about what the strings represent to assist further.
Implement your own String-backed input stream:
class StringifiedInputStream extends InputStream {
private int idx=0;
private final String str;
private final int len;
StringifiedInputStream(String str) {
this.str = str;
this.len = str.length();
}
#Override
public int read() throws IOException {
if(idx>=len)
return -1;
return (byte) str.charAt(idx++);
}
}
This is slow, but it streams the bytes without byte array duplication. Add the 3-arg method to this implementation if speed is an issue.

How to convert "java.nio.HeapByteBuffer" to String

I have a data structure java.nio.HeapByteBuffer[pos=71098 lim=71102 cap=94870], which I need to convert into Int (in Scala), the conversion might look simple but whatever which I approach , i did not get right conversion. could you please help me?
Here is my code snippet:
val v : ByteBuffer= map.get("company").get
val utf_str = new String(v, java.nio.charset.StandardCharsets.UTF_8)
println (utf_str)
the output is just "R" ??
I can't see how you can even get that to compile, String has constructors that accepts another string or possibly an array, but not a ByteBuffer or any of its parents.
To work with the nio buffer api you first write to a buffer, then do a flip before you read from the buffer, there are lots of good resources online about that. This one for example: http://tutorials.jenkov.com/java-nio/buffers.html
How to read that as a string entirely depends on how the characters are encoded inside the buffer, if they are two bytes per character (as strings are in Java/the JVM) you can convert your buffer to a character buffer by using asCharBuffer.
So, for example:
val byteBuffer = ByteBuffer.allocate(7).order(ByteOrder.BIG_ENDIAN);
byteBuffer.putChar('H').putChar('i').putChar('!')
byteBuffer.flip()
val charBuffer = byteBuffer.asCharBuffer
assert(charBuffer.toString == "Hi!")

Understanding Java character encodings [duplicate]

What´s the difference between
"hello world".getBytes("UTF-8");
and
Charset.forName("UTF-8").encode("hello world").array();
?
The second code produces a byte array with 0-bytes at the end in most cases.
Your second snippet uses ByteBuffer.array(), which just returns the array backing the ByteBuffer. That may well be longer than the content written to the ByteBuffer.
Basically, I would use the first approach if you want a byte[] from a String :) You could use other ways of dealing with the ByteBuffer to convert it to a byte[], but given that String.getBytes(Charset) is available and convenient, I'd just use that...
Sample code to retrieve the bytes from a ByteBuffer:
ByteBuffer buffer = Charset.forName("UTF-8").encode("hello world");
byte[] array = new byte[buffer.limit()];
buffer.get(array);
System.out.println(array.length); // 11
System.out.println(array[0]); // 104 (encoded 'h')

Compact a string into a smaller string?

This may sound foolish, but I'm wondering all the same...
Is it possible to take a string composed of a given character set and compress it by using a bigger character set, or composing it into a number then converting it back at one?
For example, if you had a string that you know what be composed of [a-z][A-Z][0-9]-_+=, could you turn that into a number, the swap it back using more characters in order to compress it?
This is an area I'm not familiar with, I still want to keep it as a string, just a shorter one. (for displaying/echoing/etc, not memory)
I wouldn't bother doing that, unless the string is huge. You can then try to compress it with commons-compress or java.util.zip
A String internally keeps an array of 16 bit characters, which for western european languages is a waste, you can convert to utf-8 which should give you 50% reduction by doing
String myString = .....
ByteArrayOutputStream baos = new ByteArrayOutputStream();
baos.write(myString.getBytes("UTF-8");
byte[] data = baos.toByteArray();
and hold onto it as a byte array.
Of course this is rather inconvienent if you actually want to use them as Strings, but if the point is long term storage, without much access, this would save you a bunch.
You would have to do the reverse to recreate a String.
String is a primitive type, you are unlikely to regain any space by converting unless you use Java's zip library, and even that will not yield the performance benefits you are presumably seeking.

Categories