In Java, how to copy data from String to char[]/byte[] efficiently?

In Java, how to copy data from String to char[]/byte[] efficiently? - java

I need to copy many big and different String strs' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.
For the reason above, str.toCharArray() was banned, since it allocates space for every String.
As we all know, charAt(i) is more slowly and more complex than using square brackets [i]. So I want to use byte[] or char[].
One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin). But the bad news is it was (or is to be?) deprecated.
So how can we finish this demanding job?

I believe you want getChars(int, int, char[], int). That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".
You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.

A small stocktaking:
String does Unicode text; it can be normalized (java.text.Normalizer).
int[] code points are Unicode symbols
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.
byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.
Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.
When dealing with Asian scripts, int code points probably is most feasible.
Otherwise bytes seem best.
Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.
Performance would best be done in bytes as often most compact. UTF-8 probably.
One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.
What one can do, is using fast ByteBuffers.
Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.
In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.
Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!
In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.
But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. The relevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.

Related

from InputStream to List<String>, why java is allocating space twice in the JVM?

I am currently trying to process a large txt file (a bit less than 2GB) containing lines of strings.
I am loading all its content from an InputStream to a List<String>. I do that via the following snippet :
try(BufferedReader reader = new BufferedReader(new InputStreamReader(zipInputStream))) {
List<String> data = reader.lines()
.collect(Collectors.toList());
}
The problem is, the file itsef is less than 2GB, but when I look at the memory, the JVM is allocating twice the size of the file :
Also, here are the heaviest objects in memory :
So what I Understand is that Java is allocating twice the memory needed for the operation, one to put the content of the file in a byte array and another one to instanciate the string list.
My question is : can we optimize that ? avoid having twice the memory size needed ?

tl;dr String objects can take 2 bytes per character.
The long answer: conceptually a String is a sequence of char. Each char will represent one Codepoint (or half of one, but we can ignore that detail for now).
Each codepoint tends to represent a character (sometimes multiple codepoints make up one "character", but that's another detail we can ignore for this answer).
That means that if you read a 2 GB text file that was stored with a single-byte encoding (usually a member of the ISO-8859-* family) or variable-byte encoding (mostly UTF-8), then the size in memory in Java can easily be twice the size on disk.
Now there's a good amount on caveats on this, primarily that Java can (as an internal, invisible operation) use a single byte for each character in a String if and only if the characters used allow it (effectively if they fit into the fixed internal encoding that the JVM picked for this). But that didn't seem to happen for you.
What can you do to avoid that? That depends on what your use-case is:
Don't use String to store the data in the first place. Odds are that this data is actually representing some structure, and if you parse it into a dedicated format, you might get away with way less memory usage.
Don't keep the whole thing in memory: more often then not, you don't actually need everything in memory at once. Instead process and write away the data as you read it, thus never having more than a hand-full of records in memory at once.
Build your own string-like data type for your specific use-case. While building a full string replacement is a massive undertaking, if you know what subset of features you need it might actually be a quite surmountable challenge.
try to make sure that the data is stored as compact strings, if possible, by figuring out why that's not already happening (this requires digging deep in to the details of your JVM).

What is the drawback for using Strings for non-String specific data?

I know this might be a kind of "silly" question. I have created software applications before where I initialized basically all of my variables as strings, and saved them in my database as VARCHARs. Then, I would gather them from the database and convert them as needed. Is there any reason this is not an efficient method for initializing variables and saving them in my database?
I know that for extremely large applications, this can cause an issue with computing time, because I am unnecessarily converting variables that could have been initialized as the appropriate type to begin with. But, for smaller applications, is this "okay" to do?

Some reasons to use proper types
1. Least surprise. If developers are going to grab numerical data from your database, they would find it weird that you're storing them as strings.
2. Developer convenience. Another is the nuisance of having to parse the data into the correct type every time. If you just store it as the correct type, then you would save people the trouble of having to put
int age = 0;
try {
age = Integer.parseInt(ageStr);
} catch (NumberFormatException e) {
throw new RuntimeException(e);
}
all over the code.
3. Data quality. The code example above hints at a third problem. Now it's possible for somebody to store "no_age" or "foo" or something in the column, which is a data quality issue. The best way to deal with errors is to make them impossible in the first place.
4. Storage efficiency. Storage efficiency is a factor as well. Different types have different ways of encoding data, and strings are not an efficient way to store numbers, bits, etc.
5. Network efficiency. If you store data in wasteful formats, then that often translates to unnecessary network utilization. This is why binary formats are generally more efficient than text formats like JSON or XML. But web services don't typically treat network efficiency as the driving engineering concern.
6. Processing efficiency. If the data is inherently numeric, then forcing everybody to parse it incurs processing cost.
7. Different types support different rules. In his answer, Hightower makes the good point that different types have special rules for ordering, which impacts ranges and sorts. I like this point because it impacts actual program behavior, whereas the concerns I mention above might be more academic for small apps with a single developer.
An example illustrating the efficiency benefit
Suppose you want to store eight bits. If you were to store that as a string you might have "TFFTFFTF", which under UTF-8 and ASCII would take 64 bits (8 chars x 8 bits per char) to store eight bits of actual information. Relatively speaking that's a big difference.
Incidentally, even if your data is numeric, it's not good to just use BIGINT, for example. The different types of integer in a database have different storage requirements and so you should think about the number of bits you actually need, use unsigned representations if appropriate (no reason to waste a sign bit on numbers that can't be negative), etc. Wrong choices tend to add up quickly as you create new foreign keys that have to be BIGINTs now, new rows that all have a bunch of BIGINTs, etc. Your storage and backup requirements end up being needlessly demanding.
So. Is it "OK" to use strings?
These efficiency concerns may not matter at all for something small, which is what you were asking. Or there may be reasons to prefer an inefficient format over one that's more efficient, as my JSON/XML example above suggests. So as far as whether it's "OK", I can't answer that, but hopefully the considerations above give you some tools to make that decision yourself.
Still I'd try to get into the habit of using the right type, and I certainly wouldn't go out of my way to store things as strings without some reason. In bitset cases I could see potentially avoiding having to deal with bit manipulation, which can be tricky til you get the hang of it. (But some databases have special bitset types.) You mention not knowing the type and maybe that's a plausible reason in some cases, though I would lean more on refactoring here.

There are some reasons. For examples, think about searching for a time range. This is easy to find using datetime fields. But not easy with strings, because you have to do it at your application.
Other point is sorting on a varchar will be different to a int type field. At varchar 10 is before 2, but at int it comes after that.

Performance of HashMap

I have to process 450 unique strings about 500 million times. Each string has unique integer identifier. There are two options for me to use.
I can append the identifier with the string and on arrival of the
string I can split the string to get the identifier and use it.
I can store the 450 strings in HashMap<String, Integer> and on
arrival of the string, I can query HashMap to get the identifier.
Can someone suggest which option will be more efficient in terms of processing?

It all depends on the sizes of the strings, etc.
You can do all sorts of things.
You can use a binary search to get the index in a list, and at that index is the identifier.
You can hash just the first 2 characters, rather than the entire string, that would likely be faster than the binary search, assuming the strings have an OK distribution.
You can use the first character, or first two characters, if they're unique as a "perfect index" in to 255 or 65K large array that points to the identifier.
Also, if your identifier is numeric, it's better to pre-calculate that, rather than convert it on the fly all the time. Text -> Binary is actually rather expensive (Binary -> Text is worse). So it's probably nice to avoid that if possible.
But it behooves you work the problem. 1 million anything at 1ms each, is 20 minutes of processing. At 500m, every nano-second wasted adds up to 8+ minutes extra of processing. You may well not care, but just demonstrating that at these scales "every little bit helps".
So, don't take our words for it, test different things to find what gives you the best result for your work set, and then go with that. Also consider excessive object creation, and avoiding that. Normally, I don't give it a second thought. Object creation is fast, but a nano-second is a nano-second.
If you're working in Java, and you don't REALLY need Unicode (i.e. you're working with single characters of the 0-255 range), I wouldn't use strings at all. I'd work with raw bytes. String are based on Java characters, which are UTF-16. Java Readers convert UTF-8 in to UTF-16 every. single. time. 500 million times. Yup! Another few nano-seconds. 8 nano-seconds adds an hour to your processing.
So, again, look in all the corners.
Or, don't, write it easy, fire it up, run it over the weekend and be done with it.

If each String has a unique identifier then retrieval is O(1) only in case of hashmaps.
I wouldn't suggest the first method because you are splitting every string for 450*500m, unless your order is one string for 500m times then on to the next. As Will said, appending numeric to strings then retrieving might seem straight forward but is not recommended.
So if your data is static (just the 450 strings) put them in a Hashmap and experiment it. Good luck.

Use HashMap<Integer, String>. Splitting a string to get the identifier is an expensive operation because it involves creating new Strings.

I don't think anyone is going to be able to give you a convincing "right" answer, especially since you haven't provided all of the background / properties of the computation. (For example, the average length of the strings could make a lot of difference.)
So I think your best bet would be to write a benchmark ... using the actual strings that you are going to be processing.
I'd also look for a way to extract and test the "unique integer identifier" that doesn't entail splitting the string.

Splitting the string should work faster if you write your code well enough. In fact if you already have the int-id, I see no reason to send only the string and maintain a mapping.
Putting into HashMap would need hashing the incoming string every time. So you are basically comparing the performance of the hashing function vs the code you write to append (prepending might be a bit more tricky) on sending end and to parse on receiving end.
OTOH, only 450 strings aren't a big deal, and if you're into it, writing your own hashing algo/function would actually be the most elegant and performant.

What's the fastest way in java to insert characters into a string?

I'm writing a routine that takes a string and formats it as quoted printable. And it's got to be as fast as possible. My first attempt copied characters from one stringbuffer to another encoding and line wrapping along the way. Then I thought it might be quicker to just modify the original stringbuffer rather than copy all that data which is mostly identical. Turns out the inserts are far worse than copying, the second version (with the stringbuffer inserts) was 8 times slower, which makes sense, as it must be moving a lot of memory.
What I was hoping for was some kind of gap buffer data structure so the inserts wouldn't involve physically moving all the characters in the rest of the stringbuffer.
So any suggestions about the fastest way to rip through a string inserting characters every once in a while?
Suggestions to use the standard mimeutils library are not helpful because I'm also dot escaping the string so it can be dumped out to an smtp server in one shot.

At the end, your gap data structure would have to be transformed into a String, which would need assembling all the chunks in a single array by appending them to a StringBuilder.
So using a StringBuilder directly will be faster. I don't think you'll find a faster technique than that. Make sure to initialize the StringBuilder with a large enough size to avoid copies of the whole buffer once the capacity is exhausted.

So taking the advice of some of the other answers here I've been writing many versions of this function, seeing what goes quickest and for future reference if anybody can gain from what I found:
1) The slowest: stringbuffer.append() but we knew that.
2) Almost twice as fast: stringbuilder.append(). locks are very expensive it seems.
3) another 20% faster is.... copying from one char[] to another.
4) and finally, coming in three times faster than even that... a JNI call to the exact same code compiled in C that copies from one char array to another.
You may consider #4 cheating, but cheaters win. It is by far the fastest way to go.
There is a risk of the GetCharArrayElements call causing the java char array to be copied so it can be handed to the C program, but I can't tell if that's happening, and it's still wicked fast compared to any java implementation.

I think a good balance between speed and coding grace would be using Matcher.appendReplacement. Formulate a regex that will catch all insertion points. In a loop you use find, analyze Matcher.group() to see what exactly has matched, and use your program logic to decide what to give to appendReplacement.
In any case, it is important not to copy the text over char by char. You must copy in the largest chunks possible.
The Matcher API is quite unfortunately bound to the StringBuffer, but, as you find, that only steels the final 5% from you.

Compact a string into a smaller string?

This may sound foolish, but I'm wondering all the same...
Is it possible to take a string composed of a given character set and compress it by using a bigger character set, or composing it into a number then converting it back at one?
For example, if you had a string that you know what be composed of [a-z][A-Z][0-9]-_+=, could you turn that into a number, the swap it back using more characters in order to compress it?
This is an area I'm not familiar with, I still want to keep it as a string, just a shorter one. (for displaying/echoing/etc, not memory)

I wouldn't bother doing that, unless the string is huge. You can then try to compress it with commons-compress or java.util.zip

A String internally keeps an array of 16 bit characters, which for western european languages is a waste, you can convert to utf-8 which should give you 50% reduction by doing
String myString = .....
ByteArrayOutputStream baos = new ByteArrayOutputStream();
baos.write(myString.getBytes("UTF-8");
byte[] data = baos.toByteArray();
and hold onto it as a byte array.
Of course this is rather inconvienent if you actually want to use them as Strings, but if the point is long term storage, without much access, this would save you a bunch.
You would have to do the reverse to recreate a String.

String is a primitive type, you are unlikely to regain any space by converting unless you use Java's zip library, and even that will not yield the performance benefits you are presumably seeking.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.