Is String.getBytes being safely used? - java

Currently, I need to work with the bytes of a String in Java, and it has raised so many questions about encodings and implementation details of the JVM. I would like to know if what I'm doing makes sense, or it is redundant.
To begin with, I understand that at runtime a Java char in a String will always represent a symbol in Unicode.
Secondly, the UTF-8 encoding is always able to successfully encode any symbol in Unicode. In turn, the following snippet will always return a byte[] without doing any replacement. getBytes documentation is here.
byte[] stringBytes = myString.getBytes(StandardCharsets.UTF_8);
Then, if stringBytes is used in a different JVM instance in the following way, it will always yield a string equivalent to myString.
new String(stringBytes, StandardCharsets.UTF_8);
Do you think that my understanding of getBytes is correct? If that is the case, how would you justify it? Am I missing a pathological case which could lead me not to get an equivalent version of myString?
Thanks in advance.
EDIT:
Would you agree that by doing the following any non-exceptional flow leads to a handled case, which allow us to successfully reconstruct the string?
EDIT:
Based on the answers, here goes the solution which allows you to safely reconstruct strings when no exception is thrown. You still need to handle the exception somehow.
First, get the bytes using the encoder:
final CharsetEncoder encoder =
StandardCharsets.UTF_8.
.newEncoder()
.onUnmappableCharacter(CodingErrorAction.REPORT)
.onMalformedInput(CodingErrorAction.REPORT);
// It throws a CharacterCodingException in case there is a replacement or malformed string
// The given array is actually bigger than required because it is the internal array used by the ByteBuffer. Read its doc.
byte[] stringBytes = encoder.encode(CharBuffer.wrap(string)).array();
Second, construct the string using the bytes given by the encoder (non-exceptional path):
new String(stringBytes, StandardCharsets.UTF_8);

it will always yield a string equivalent to myString.
Well, not always. Not a lot of things in this world happens always.
One edge case I can think of is that myString could be an "invalid" string when you call getBytes. For example, it could have a lone surrogate pair:
String myString = "\uD83D";
How often this will happen heavily depends on what you are doing with myString, so I'll let you think about that on your own.
If myString has a lone surrogate pair, getBytes would encode a question mark character for it:
// prints "?"
System.out.println(
new String(myString.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)
);
I wouldn't say a ? is "equivalent" to a malformed string.
See also: Is an instance of a Java string always valid UTF-16?

Related

bytes of a string is transformed between creation of the string and getBytes()

I have an unexpected behavior and I'm wondering if it is expected behavior and what's the reason behind that? I create a new String using a byte array and when I get back the byte array using the same encoding the byte array is not the same.
byte[] bytes = new byte[24];
new Random().nextBytes(bytes);
assertEquals( // fails
DatatypeConverter.printHexBinary(bytes),
DatatypeConverter.printHexBinary(new String(bytes, UTF_8).getBytes(UTF_8))
);
Not every random byte array is valid UTF-8. in fact, I'd say few of them are. So when creating the string you will have some characters converted to U+FFFD which signals that there was an error in deciding the original bytes. Those will then of course look different when converting back to bytes.
If you want a clean round-trip, don't put data in that isn't valid. Or you could use an encoding like Latin-1 instead where every byte is valid and thus stays the same. But generally putting random data that is not text in a string is rarely a useful or good idea. This isn't C where there is no distinction between binary data and text.
You're using randomly generated bytes to create a String. There is no guarantee that these randomly generated bytes will be valid UTF-8 (or any encoding). If you look at the documentation of String(byte[],Charset) you'll see:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
This means that the bytes going in, if not valid, won't necessarily be the same bytes that come out; even when using the same Charset.

Importance of specifying the encoding in getBytes in Java

I understand the need to specify encoding when converting a byte[] to String in Java using appropriate format i.e. hex, base64 etc. because the default encoding may not be same in different platforms. But I am not sure I understand the same while converting a string to bytes. So this question, is to wrap my head around the concept of need to specify character set while transferring Strings over web.
Consider foll. code in Java
Note: The String in example below is not read from a file, another resource, it is created.
1: String message = "a good message";
2: byte[] encryptedMsgBytes = encrypt(key,,message.getBytes());
3: String base64EncodedMessage = new String (Base64.encodeBase64(encryptedMsgBytes));
I need to send this over the web using Http Post & will be received & processed (decrypted, converted from base64 etc.) at other end.
Based on reading up article, the recommended practice is to use .getBytes("utf-8")
on line 2, i.e message.getBytes("UTF-8")
& the similar approach is recommended on other end to process the data as shown on line 7 below
4: String base64EncodedMsg =
5: byte[] base64EncodedMsgBytes = Base64.encodeBase64(base64EncodedMsg));
6: byte[] decryptedMsgBytes = decrypt(aesKey, "AES", Base64.decodeBase64(base64EncodedMessage);
7: String originalMsg = new String(decryptedMsgBytes, "UTF-8");
Given that Java's internal in-memory string representation is utf-16. ( excluding: UTF8 during serialization & file saving) , do we really need this if the decryption was also done in Java (Note: This is not a practical assumption, just for sake of discussion to understand the need to mention encoding)? Since, in JVM the String 'message' on line 1 was represented using UTF-16, wouldn't the .getBytes() method without specifying the encoding always return the UTF-16 bytes ? or is that incorrect and .getBytes() method without specifying the encoding always returns raw bytes ? Since the internal representation is UTF-16 why would the default character encoding on a particular JVM matter ?
If indeed it returns UTF-16, then is there is need to use new String(decryptedMsgBytes, "UTF-8") on other end ?
wouldn't the .getBytes() method without specifying the encoding
always return the UTF-16 bytes ?
This is incorrect. Per the Javadoc, this uses the platform's default charset:
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

UTF-8 Encoding ; Only some Japanese characters are not getting converted

I am getting the parameter value as parameter from the Jersey Web Service, which is in Japaneses characters.
Here, 'japaneseString' is the web service parameter containing the characters in japanese language.
String name = new String(japaneseString.getBytes(), "UTF-8");
However, I am able to convert a few sting literals successfully, while some of them are creating problems.
The following were successfully converted:
1) アップル
2) 赤
3) 世丕且且世两上与丑万丣丕且丗丕
4) 世世丗丈
While these din't:
1) ひほわれよう
2) 存在する
When I further investigated, i found that these 2 strings are getting converted in to some JUNK characters.
1) Input: ひほわれよう Output : �?��?��?れよ�?�
2) Input: 存在する Output: 存在�?�る
Any idea why some of the japanese characters are not converted properly?
Thanks.
You are mixing concepts here.
A String is just a sequence of characters (chars); a String in itself has no encoding at all. For what it's worth, replace characters in the above with carrier pigeons. Same thing. A carrier pigeon has no encoding. Neither does a char. (1)
What you are doing here:
new String(x.getBytes(), "UTF-8")
is a "poor man's encoding/decoding process". You will probably have noticed that there are two versions of .getBytes(): one where you pass a charset as an argument and the other where you don't.
If you don't, and that is what happens here, it means you will get the result of the encoding process using your default character set; and then you try and re-decode this byte sequence using UTF-8.
Don't do that. Just take in the string as it comes. If, however, you have trouble reading the original byte stream into a string, it means you use a Reader with the wrong charset. Fix that part.
For more information, read this link.
(1) the fact that, in fact, a char is a UTF-16 code unit is irrelevant to this discussion
Try with JVM parameter file.encoding to set with value UTF-8 in startup of Tomcat(JVM).
E.x.: -Dfile.encoding=UTF-8
I concur with #fge.
Clarification
In java String/char/Reader/Writer handle (Unicode) text, and can combine all scripts in the world.
And byte[]/InputStream/OutputStream are binary data, which need an indication of some encoding to be converted to String.
In your case japaneseStingr should already be a correct String, or be substituted by the original byte[].
Traps in Java
Encoding often is an optional parameter, which then defaults to the platform encoding. You fell in that trap too:
String s = "...";
byte[] b = s.getBytes(); // Platform encoding, non-portable.
byte[] b = s.getBytes("UTF-8"); // Explicit
byte[] b = s.getBytes(StandardCharsets.UTF_8); // Explicit,
// better (for UTF-8, ISO-8859-1)
In general avoid the overloaded methods without encoding parameter, as they are for current-computer only data: non-portable. For completeness: classes FileReader/FileWriter should be avoided as they even provide no encoding parameters.
Error
japaneseString is already wrong. So you have to read that right.
It could have been read erroneouslyas Windows-1252 (Windows Latin-1) and suffered when recoding to UTF-8. Evidently only some cases get messed up.
Maybe you had:
String japanesString = new String(bytes);
instead of:
String japanesString = new String(bytes, StandardCharsets.UTF_8);
At the end:
String name = japaneseString;
Show the code for reading japaneseString for further help.

Does Class URLDecoder.decode handle double encoding?

I am trying to debug a flaky Java application. I can't (easily) debug it in the only way I would know how - by putting a log statement in it and re-compiling. Then checking the logs.
(I don't have access to a reliable set of source code). And I'm not a Java developer.
The actual question:
If I did this:
str = URLDecoder.decode("%25C3%2596");
What would be in str?
Would it realize that this is double-encoded and handle that i.e. turn it into %C3%96 - and then decode that? (Which decodes into a German Umlaut).
Thanks
--Justin Wyllie
From the Java API URLDecoder:
A sequence of the form "%xy" will be treated as representing a byte where xy is the two-digit hexadecimal representation of the 8 bits.
So my guess would be - most likely not.
You could however call the decode method twice.
str = URLDecoder.decode(URLDecoder.decode("%25C3%2596"));
str = URLDecoder.decode("%25C3%2596");
The result of this operation is system-dependent (the reason the method is deprecated.)
The result of this call:
str = URLDecoder.decode("%25C3%2596", "UTF-8");
...would be %C3%96 which is Ö in percent-encoded UTF-8. The API does not try to recursively decode any percent-signs.

Need help removing strange characters from string

Currently when I make a signature using java.security.signature, it passes back a string.
I can't seem to use this string since there are special characters that can only be seen when i copy the string into notepad++, from there if I remove these special characters I can use the remains of the string in my program.
In notepad they look like black boxes with the words ACK GS STX SI SUB ETB BS VT
I don't really understand what they are so its hard to tell how to get ride of them.
Is there a function that i can run to remove these and potentially similar characters?
when i use the base64 class supplied in the posts, i cant go back to a signature
System.out.println(signature);
String base64 = Base64.encodeBytes(sig);
System.out.println(base64);
String sig2 = new String (Base64.decode(base64));
System.out.println(sig2);
gives the output
”zÌý¥y]žd”xKmËY³ÕN´Ìå}ÏBÊNÈ›`Αrp~jÖüñ0…Rõ…•éh?ÞÀ_û_¥ÂçªsÂk{6H7œÉ/”âtTK±Ï…Ã/Ùê²
lHrM/aV5XZ5klHhLbctZs9VOtMzlfc9Cyk7Im2DOkXJwfmoG1vzxMIVS9YWV6Wg/HQLewF/7X6XC56pzwmt7DzZIN5zJL5TidFRLsc+Fwy/Z6rIaNA2uVlCh3XYkWcu882tKt2RySSkn1heWhG0IeNNfopAvbmHDlgszaWaXYzY=
[B#15356d5
The odd characters are there because cryptographic signatures produce bytes rather than strings. Consequently if you want a printable representation you should Base64 encode it (here's a public domain implementation for Java).
Stripping the non-printing characters from a cryptographic signature will render it useless as you will be unable to use it for verification.
Update:
[B#15356d5
This is the result of toString called on a byte array. "[" means array, "B" means byte and "15356d5" is the address of the array. You should be passing the array you get out of decode to [Signature.verify](http://java.sun.com/j2se/1.4.2/docs/api/java/security/Signature.html#verify(byte[])).
Something like:
Signature sig = new Signature("dsa");
sig.initVerify(key);
sig.verify(Base64.decode(base64)); // <-- bytes go here
How are you "making" the signature? If you use the sign method, you get back a byte array, not a string. That's not a binary representation of some text, it's just arbitrary binary data. That's what you should use, and if you need to convert it into a string you should use a base64 conversion to avoid data corruption.
If I understand your problem correctly, you need to get rid of characters with code below 32, except maybe char 9 (tab), char 10 (new line) and char 13 (return).
Edit: I agree with the others as handling a crypto output like this is not what you usually want.

Categories