Strings transcoding in Java

Strings transcoding in Java - java

I've found a piece of code recently, which does the following:
String s = ... // whatever
...
s = new String(s.getBytes(myEncoding), myEncoding);
For me it appears to be absolutely non-sense.
Is it possible that under certain circumstances (some specific combination of locale settings, used technologies, etc.), this code will do something useful?
Thanks in advance

yes, that code is generally nonsense. yes, it's possible that that code could be doing "something" to the string (probably not anything good). generally speaking, if you have already incorrectly converted bytes to chars, trying to re-convert is rarely going to give you legitimate results. (there may be isolated instances where the right combination of character encodings may work).

Related

Java string concatenation optimisation is applied in this case?

Let's imagine I have a lib which contains the following simple method:
private static final String CONSTANT = "Constant";
public static String concatStringWithCondition(String condition) {
return "Some phrase" + condition + CONSTANT;
}
What if someone wants to use my method in a loop? As I understand, that string optimisation (where + gets replaced with StringBuilder or whatever is more optimal) is not working for that case? Or this is valid for strings initialised outside of the loop?
I'm using java 11 (Dropwizard).
Thanks.

No, this is fine.
The only case that string concatenation can be problematic is when you're using a loop to build one single string. Your method by itself is fine. Callers of your method can, of course, mess things up, but not in a way that's related to your method.

The code as written should be as efficient as making a StringBuilder and appending these 3 constants to it. There certainly is absolutely no difference at all between a literal ("Some phrase"), and an expression that the compiler can treat as a Compile Time Constant (which CONSTANT, here, clearly is - given that CONSTANT is static, final, not null, and of a CTCable type (All primitives and strings)).
However, is that 'efficient'? I doubt it - making a stringbuilder is not particularly cheap either. It's orders of magnitude cheaper than continually making new strings, sure, but there's always a bigger fish:
It doesn't matter
Computers are fast. Really, really fast. It is highly likely that you can write this incredibly badly (performance wise) and it still won't be measurable. You won't even notice. Less than a millisecond slower.
In general, anybody that worries about performance at this level simply lacks perspective and knowledge: If you apply that level of fretting to your java code and you have the knowledge to know what could in theory be non-perfectly-performant, you'll be sweating every 3rd character you ever type. That's no way to program. So, gain that perspective (or take it from me, "just git gud" is not exactly something you can do in a week - take it on faith for now, as you learn you can start verifying) - and don't worry about it. Unless you actually run into an actual situation where the code is slower than it feels like it could be, or slower than it needs to be, and then toss profilers and microbenchmark testing frameworks at it, and THEN, armed with all that information (and not before!), consider optimizing. The reports tell you what to optimize, because literally less than 1% of the code is responsible for 99% of the performance loss, so spending any time on code that isn't in that 1% is an utter waste of time, hence why you must get those reports first, or not start at all.
... or perhaps it does
But if it does matter, and it's really that 1% of the code that is responsible for 99% of the loss, then usually you need to go a little further than just 'optimize the method'. Optimize the entire pipeline.
What is happening with this string? Take that into consideration.
For example, let's say that it, itself, is being appended to a much bigger stringbuilder. In which case, making a tiny stringbuilder here is incredibly inefficient compared to rewriting the method to:
public static void concatStringWithCondition(StringBuilder sb, String condition) {
sb.append("Some phrase").append(condition).append(CONSTANT);
}
Or, perhaps this data is being turned into bytes using UTF_8 and then tossed onto a web socket. In that case:
private static final byte[] PREFIX = "Some phrase".getBytes(StandardCharsets.UTF_8);
private static final byte[] SUFFIX = "Some Constant".getBytes(StandardCharsets.UTF_8);
public void concatStringWithCondition(OutputStream out, String condition) {
out.write(PREFIX);
out.write(condition.getBytes(StandardCharsets.UTF_8));
out.write(SUFFIX);
}
and check if that outputstream is buffered. If not, make it buffered, that'll help a ton and would completely dwarf the cost of not using string concatenation. If the 'condition' string can get quite large, the above is no good either, you want a CharsetEncoder that encodes straight to the OutputStream, and may even want to replace all that with some ByteBuffer based approach.
Conclusion
Assume performance is never relevant until it is.
IF performance truly must be tackled, strap in, it'll take ages to do it right. Doing it 'wrong' (applying dumb rules of thumb that do not work) isn't useful. Either do it right, or don't do it.
IF you're still on bard, always start with profiler reports and use JMH to gather information.
Be prepared to rewrite the pipeline - change the method signatures, in order to optimize.
That means that micro-optimizing, which usually sacrifices nice abstracted APIs, is actively bad for performance - because changing pipelines is considerably more difficult if all code is micro-optimized, given that this usually comes at the cost of abstraction.
And now the circle is complete: Point 5 shows why the worrying about performance as you are doing in this question is in fact detrimental: It is far too likely that this worry results in you 'optimizing' some code in a way that doesn't actually run faster (because the JVM is a complex beast), and even if it did, it is irrelevant because the code path this code is on is literally only 0.01% or less of the total runtime expenditure, and in the mean time you've made your APIs worse and lack abstraction which would make any actually useful optimization much harder than it needs to be.
But I really want rules of thumb!
Allright, fine. Here are 2 easy rules of thumb to follow that will lead to better performance:
When in rome...
The JVM is an optimising marvel and will run the craziest code quite quickly anyway. However, it does this primarily by being a giant pattern matching machine: It finds recognizable code snippets and rewrites these to the fastest, most carefully tuned to juuust your combination of hardware machine code it can. However, this pattern machine isn't voodoo magic: It's got limited patterns. Which patterns do JVM makers 'ship' with their JVMs? Why, the common patterns, of course. Why include a pattern for exotic code virtually nobody ever writes? Waste of space.
So, write code the way java programmers tend to write it. Which very much means: Do not write crazy code just because you think it might be faster. It'll likely be slower. Just follow the crowd.
Trivial example:
Which one is faster:
List<String> list = new ArrayList<String>();
for (int i = 0; i < 10000; i++) list.add(someRandomName());
// option 1:
String[] arr = list.toArray(new String[list.size()]);
// option 2:
String[] arr = list.toArray(new String[0]);
You might think, obviously, option 1, right? Option 2 'wastes' a string array, making a 0-length array just to toss it in the garbage right after. But you'd be wrong: Option 2 is in fact faster (if you want an explanation: The JVM recognizes it, and does a hacky move: It makes an new string array that does not need to be initialized with all zeroes first. Normal java code cannot do this (arrays are neccessarily initialized blank, to prevent memory corruption issues), but specifically .toArray(new X[0])? Those pattern matching machines I told you about detect this and replace it with code that just blits the refs straight into a patch of memory without wasting time writing zeroes to it first.
It's a subtle difference that is highly unlikely to matter - it just highlights: Your instincts? They will mislead you every time.
Fortunately, .toArray(new X[0]) is common java code. And easier and shorter. So just write nice, convenient code that looks like how other folks write and you'd have gotten the right answer here. Without having to know such crazy esoterics as having to reason out how the JVM needs to waste time zeroing out that array and how hotspot / pattern matching might possibly eliminate this, thus making it faster. That's just one of 5 million things you'd have to know - and nobody can do that. Thus: Just write java code in simple, common styles.
Algorithmic complexity is a thing hotspot can't fix for you
Given an O(n^3) algorithm fighting an O(log(n) * n^2) algorithm, make n large enough and the second algorithm has to win, that's what big O notation means. The JVM can do a lot of magic but it can pretty much never optimize an algorithm into a faster 'class' of algorithmic complexity. You might be surprised at the size n has to be before algorithmic complexity dominates, but it is acceptable to realize that your algorithm can be fundamentally faster and do the work on rewriting it to this more efficient algorithm even without profiler reports and benchmark harnesses and the like.

In Java, how to copy data from String to char[]/byte[] efficiently?

I need to copy many big and different String strs' content to a static char array and use the array frequently in a efficiency-demanding job, thus it's important to avoid allocating too much new space.
For the reason above, str.toCharArray() was banned, since it allocates space for every String.
As we all know, charAt(i) is more slowly and more complex than using square brackets [i]. So I want to use byte[] or char[].
One good news is, there's a str.getBytes(srcBegin, srcEnd, dst, dstBegin). But the bad news is it was (or is to be?) deprecated.
So how can we finish this demanding job?

I believe you want getChars(int, int, char[], int). That will copy the characters into the specified array, and I'd expect it to do it "as efficiently as reasonably possible".
You should avoid converting between text and binary representations unless you really need to. Aside from anything else, that conversion itself is likely to be time-consuming.

A small stocktaking:
String does Unicode text; it can be normalized (java.text.Normalizer).
int[] code points are Unicode symbols
char[] is Unicode UTF-16BE (2 bytes per char), sometimes for a code point 2 chars are needed: a surrogate pair.
byte[] is for binary data. Holding Unicode text in UTF-8 is relative compact when there is much ASCII resp. Latin-1.
Processing might be done on a ByteBuffer, CharBuffer, IntBuffer.
When dealing with Asian scripts, int code points probably is most feasible.
Otherwise bytes seem best.
Code points (or chars) also make sense when the Character class is utilized for classification of Unicode blocks and scripts, digits in several scripts, emoji, whatever.
Performance would best be done in bytes as often most compact. UTF-8 probably.
One cannot efficiently deal with memory allocation. getBytes should be used with a Charset. Almost always a kind of conversion happens. As new java versions can keep a byte array instead of a char array for an encoding like Latin-1, ISO-8859-1, even using an internal char array would not do. And new arrays are created.
What one can do, is using fast ByteBuffers.
Alternatively for lingual analysis one can use databases, maybe graph databases. At least something which can exploit parallelism.

You are pretty much restricted to the APIs offered within the string class, and obviously, that deprecated method is supposed to be replaced with getBytes() (or an alternative that allows to specify a charset.
In other words: that problem you are talking about "having many large strings, that need to go into arrays" can't be solved easily.
Thus a distinct non-answer: look into your design. If performance is really critical, then do not create those many large strings upfront!
In other words: if your measurements convince you that you do have real performance issue, then adapt your design as needed. Maybe there is a chance that in the place where your strings are "coming" in ... you already do not use String objects, but something that works better for you, later on, performance wise.
But of course: that will lead to a complex, error prone solution, where you do a lot of "memory management" yourself. Thus, as said: measure first. Ensure that you have a real problem, and it actually sits in the place you think it sits.

str.getBytes(srcBegin, srcEnd, dst, dstBegin) is indeed deprecated. The relevant documentation recommends getBytes() instead. If you needed str.getBytes(srcBegin, srcEnd, dst, dstBegin) because sometimes you don't have to convert the entire string I suppose you could substring() first, but I'm not sure how badly that would impact your code's efficiency, if at all. Or if it's all the same to you if you store it in char[] then you can use getChars(int,int,char[],int) which is not deprecated.

What is the drawback for using Strings for non-String specific data?

I know this might be a kind of "silly" question. I have created software applications before where I initialized basically all of my variables as strings, and saved them in my database as VARCHARs. Then, I would gather them from the database and convert them as needed. Is there any reason this is not an efficient method for initializing variables and saving them in my database?
I know that for extremely large applications, this can cause an issue with computing time, because I am unnecessarily converting variables that could have been initialized as the appropriate type to begin with. But, for smaller applications, is this "okay" to do?

Some reasons to use proper types
1. Least surprise. If developers are going to grab numerical data from your database, they would find it weird that you're storing them as strings.
2. Developer convenience. Another is the nuisance of having to parse the data into the correct type every time. If you just store it as the correct type, then you would save people the trouble of having to put
int age = 0;
try {
age = Integer.parseInt(ageStr);
} catch (NumberFormatException e) {
throw new RuntimeException(e);
}
all over the code.
3. Data quality. The code example above hints at a third problem. Now it's possible for somebody to store "no_age" or "foo" or something in the column, which is a data quality issue. The best way to deal with errors is to make them impossible in the first place.
4. Storage efficiency. Storage efficiency is a factor as well. Different types have different ways of encoding data, and strings are not an efficient way to store numbers, bits, etc.
5. Network efficiency. If you store data in wasteful formats, then that often translates to unnecessary network utilization. This is why binary formats are generally more efficient than text formats like JSON or XML. But web services don't typically treat network efficiency as the driving engineering concern.
6. Processing efficiency. If the data is inherently numeric, then forcing everybody to parse it incurs processing cost.
7. Different types support different rules. In his answer, Hightower makes the good point that different types have special rules for ordering, which impacts ranges and sorts. I like this point because it impacts actual program behavior, whereas the concerns I mention above might be more academic for small apps with a single developer.
An example illustrating the efficiency benefit
Suppose you want to store eight bits. If you were to store that as a string you might have "TFFTFFTF", which under UTF-8 and ASCII would take 64 bits (8 chars x 8 bits per char) to store eight bits of actual information. Relatively speaking that's a big difference.
Incidentally, even if your data is numeric, it's not good to just use BIGINT, for example. The different types of integer in a database have different storage requirements and so you should think about the number of bits you actually need, use unsigned representations if appropriate (no reason to waste a sign bit on numbers that can't be negative), etc. Wrong choices tend to add up quickly as you create new foreign keys that have to be BIGINTs now, new rows that all have a bunch of BIGINTs, etc. Your storage and backup requirements end up being needlessly demanding.
So. Is it "OK" to use strings?
These efficiency concerns may not matter at all for something small, which is what you were asking. Or there may be reasons to prefer an inefficient format over one that's more efficient, as my JSON/XML example above suggests. So as far as whether it's "OK", I can't answer that, but hopefully the considerations above give you some tools to make that decision yourself.
Still I'd try to get into the habit of using the right type, and I certainly wouldn't go out of my way to store things as strings without some reason. In bitset cases I could see potentially avoiding having to deal with bit manipulation, which can be tricky til you get the hang of it. (But some databases have special bitset types.) You mention not knowing the type and maybe that's a plausible reason in some cases, though I would lean more on refactoring here.

There are some reasons. For examples, think about searching for a time range. This is easy to find using datetime fields. But not easy with strings, because you have to do it at your application.
Other point is sorting on a varchar will be different to a int type field. At varchar 10 is before 2, but at int it comes after that.

Is specifying String encoding when parsing byte[] really necessary?

Supposedly, it is "best practice" to specify the encoding when creating a String from a byte[]:
byte[] b;
String a = new String(b, "UTF-8"); // 100% safe
String b = new String(b); // safe enough
If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?

Different use cases have to be distinguished here: If you get the bytes from an external source via some protocol with a specified encoding then always use the first form (with explicit encoding).
If the source of the bytes is the local machine, for example a local text file, the second form (without explicit encoding) is better.
Always keep in mind, that your program may be used on a different machine with a different platform encoding. It should work there without any changes.

If I know my installation has default encoding of utf8, is it really necessary to specify the encoding to still be "best practice"?
But do you know for sure that your installation will always have a default encoding of UTF-8? (Or at least, for as long as your code is used ...)
And do you know for sure that your code is never going to be used in a different installation that has a different default encoding?
If the answer to either of those is "No" (and unless you are prescient, it probably has to be "No") then I think that you should follow best practice ... and specify the encoding if that is what your application semantics requires:
If the requirement is to always encode (or decode) in UTF-8, then use "UTF-8".
If the requirement is to always encode (or decode) in using the platform default, then do that.
If the requirement is to support multiple encodings (or the requirement might change) then make the encoding name a configuration (or command line) parameter, resolve to a Charset object and use that.
The point of this "best practice" recommendation is to avoid a foreseeable problem that will arise if your platform's characteristics change. You don't think that is likely, but you probably can't be completely sure about it. But at the end of the day, it is your decision.
(The fact that you are actually thinking about whether "best practice" is appropriate to your situation is a GOOD THING ... in my opinion.)

filename encoding using java

I want to write a reversible Encoder along with the corresponding Decoder, so that any string may be encoded to a legal file name corresponding to file naming rules of the Unix file system.
How to achieve this?
Example:
"xyz.txt" would be a valid file name, while "xyz/.txt" would not.

tl;dr: Your approach is flawed. Stick with the limitations of the file system. They're pretty hard to gracefully overcome (especially without introducing your own, even weirder limitations).
It's not possible to make one that is strictly decodable. You're trying to map a larger domain to a smaller domain which means that the reverse mapping cannot be known-correctly reversible.
This is easy to demonstrate with a simple example: how do you encode / such that it can be reversed? "Easy," you might say, "I'll just replace with the token x." But now how do you know when an x is an actual x and when your x is a 'special' x that should be converted to /? You can't.
You can of course make a system that is very unlikely to have any accidental clashes. For example, rather than changing / to - (which would be very error prone), you could change it to ---.
Oh, also, for what it's worth, most unix file systems actually consider any characters other than / or a null char a valid character (more). Obviously using that is a pain in the ass though.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.