byte array to string in java - java

SecureRandom random = SecureRandom.getInstance("SHA1PRNG");
byte[] salt = new byte[16];
random.nextBytes(salt);
I would like to convert salt to a string to store/read. I don't seem to be able to get this to work. I have read that I need to use the right encoding but I'm not sure what encoding to use. I have tried the following but get junk:
String s = new String(salt, "UTF-8");
String s = new String(salt, "UTF-16");
String s = new String(salt);
Edit: For context, I'm trying to work through and understand this code. I'm trying to view the salt and password so I can monkey with the code.

You need to use Base64 (Apache Commons) class or sun.misc.BASE64Encoder/BASE64Decode to encode the byte array.

Like AVD says, the solution is to use Base64 encoding or some other binary-as-text encoding. (For example, Hex encoding.)
Why? Because binary data is not text!
What you are currently doing is telling the String constructor that the bytes are text that has been correctly encoded as UTF-8 or UTF-16 or (in the last case) the platform's default encoding. This is patently false. The "junk" you are seeing is what you get if you attempt to decode random binary stuff as text.
Worse still, the decoding process is probably lossy when you apply it to random binary data. For instance, some sequences of bytes are simply invalid if you try to treat them as UTF-8. (The spec for UTF-8 says so!) When the UTF-8 decoder sees one of these invalid sequences, it replaces it with a character (such as a '?') that means "invalid character". If you then turn the characters in the string back into bytes, you will get a different byte sequence to the one that you started with. That's probably a disaster for your use-case.

Related

Java- How to verify if Thai characters are encoded correctly from UTF-8 to TIS620

Get input string in UTF-8, I applied TIS620 encoding and created new string from it now how to retain the bytes? since UTF-8 represents Thai char in 3 bytes where as TIS620 in 1 byte. I've requirement where the backend system stores characters in string as 1 byte only so default UTF-8 breaks it.
How to convert String character encoding from UTF-8 to TIS620?
How to retain the byte size while passing it to backend system?
If the string is reassigned to new String , Does character encoding is retained or it again gets converted to UTF-16 (Java default)?
Is it possible in Java? Any lib/utility which can be integrated?
I've tried below code and can check that post TIS620 the byte count matches the character count i.e.1 byte/char. But if encodedString gets new String assignment will it loose TIS620 format?
(Convert String with encoding UTF-8 to TIS620 (Thai encoding) in Java.What are the ways to do it and it there any data loss?)
public String encode() {
try {
String input = " "ใบใบใบใบ"";
byte [] encodedBytes= input.getBytes("TIS620");
String encodedString = new String(encodedBytes,"TIS620");
}catch (UnsupportedEncodingException e){
//Encoding failed
}
}
Expected result is, if I convert 5 Thai character from UTF-8 format to TIS620 the byte count should be converted and retained from 15 (UTF-8) to 5 (TIS620)?
A String in Java is always encoded in UTF-16, no matter how it was constructed. Or put differently: as soon as you have a String object, you should not care about which encoding it has. The encoding only comes back into the picture once you want to go back towards a byte[] (or OutputStream or the like).
This is correct and almost certainly exactly what you want to do. You should not try to work around that fact.
If you need to write the string to disk or send it to some other system in some specific encoding then you can get that encoded data from the String by using getBytes() as you did in your sample code.
In other words:
A String object in Java can not "have TIS620" encoding. A byte[] can contain TIS620 encoded data and you create that from a String using .getBytes("TIS620").
If you pass the encoded byte[] to the other system, it will have the correct byte size, simply because it was created with the correct encoding.
String always uses UTF-16. Creating a String with the content "ใบใบใบใบ" from UTF-8 data and from TIS620 data will produce exactly identical String objects, there's no way to know what encoding was used to create them.
InputStreamReader, OutputStreamWriter and comparable classes can also be passed an encoding to decode/encode with that encoding respectively. Other than that, no special handling is required.
Java's text datatypes (String, char and Character)—same goes for .NET, JavaScript, VB4/5/6/A/Script, …) always use the UTF-16 character encoding of the Unicode character set.
Many interfaces, bindings, drivers, data adaptors, and what not, understand that the text datatype is UTF-16 and which character encoding the target needs and so does a conversion itself. As long as you are using Java datatypes, if you have text encoding as UTF-8 or TIS620, you would typically use a byte array.
That it for straightforward text as text.
Now, if you had an array of arbitrary bytes and you want to write it into a text context, you could use Base64. Such a function takes a byte array and returns a String (UTF-16 encoded, of course). But since the characters used are supported by every character set, there would be no loss of data to convert the data to using whichever character encoding is needed.
People do like dealing with text datatypes so the above scheme is great. But for some reason, instead of Base64, some people use what I call Base256. They have an array of bytes (very often created from encoding text with a character encoding) and they apply an encoding function to convert the bytes to text, choosing to encode by decoding with a character encoding. You need to identify if that's what you are dealing with and if so, which character encoding was co-opted as a Base256 encoding. (Often the character encoding used for this is ISO 8859-1.)

bytes of a string is transformed between creation of the string and getBytes()

I have an unexpected behavior and I'm wondering if it is expected behavior and what's the reason behind that? I create a new String using a byte array and when I get back the byte array using the same encoding the byte array is not the same.
byte[] bytes = new byte[24];
new Random().nextBytes(bytes);
assertEquals( // fails
DatatypeConverter.printHexBinary(bytes),
DatatypeConverter.printHexBinary(new String(bytes, UTF_8).getBytes(UTF_8))
);
Not every random byte array is valid UTF-8. in fact, I'd say few of them are. So when creating the string you will have some characters converted to U+FFFD which signals that there was an error in deciding the original bytes. Those will then of course look different when converting back to bytes.
If you want a clean round-trip, don't put data in that isn't valid. Or you could use an encoding like Latin-1 instead where every byte is valid and thus stays the same. But generally putting random data that is not text in a string is rarely a useful or good idea. This isn't C where there is no distinction between binary data and text.
You're using randomly generated bytes to create a String. There is no guarantee that these randomly generated bytes will be valid UTF-8 (or any encoding). If you look at the documentation of String(byte[],Charset) you'll see:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.
This means that the bytes going in, if not valid, won't necessarily be the same bytes that come out; even when using the same Charset.

String encoding (UTF-8) JAVA

Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.
String string2 = new String(string1.getBytes("UTF-8"), "UTF-8"));
OR
String string3 = new String(string1.getBytes(),"UTF-8"));
ALSO if I use above two code together i.e.
line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8"));
line 2 :string1 = new String(string1.getBytes(),"UTF-8"));
Will the value of string1 will be the same in both the lines?
PS: Purpose of doing all this is to send Japanese text in web service call.
So I want to send it with UTF-8 encoding.
According to the javadoc of String#getBytes(String charsetName):
Encodes this String into a sequence of bytes using the named charset,
storing the result into a new byte array.
And the documentation of String(byte[] bytes, Charset charset)
Constructs a new String by decoding the specified array of bytes using
the specified charset.
Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:
// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));
String and char (two-bytes UTF-16) in java is for (Unicode) text.
When converting from and to byte[]s one needs the Charset (encoding) of those bytes.
Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.
So use
byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");
Or better, not throwing an UnsupportedCharsetException:
byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);
(Android does not know StandardCharsets however.)
The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).
Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String.
Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[].
There are no UTF-8-encoded strings in Java.
If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.
If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.

Converting String from One Charset to Another

I am working on converting a string from one charset to another and read many example on it and finally found below code, which looks nice to me and as a newbie to Charset Encoding, I want to know, if it is the right way to do it .
public static byte[] transcodeField(byte[] source, Charset from, Charset to) {
return new String(source, from).getBytes(to);
}
To convert String from ASCII to EBCDIC, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("US-ASCII"), Charset.forName("Cp1047"))));
And to convert from EBCDIC to ASCII, I have to do:
System.out.println(new String(transcodeField(ebytes,
Charset.forName("Cp1047"), Charset.forName("US-ASCII"))));
The code you found (transcodeField) doesn't convert a String from one encoding to another, because a String doesn't have an encoding¹. It converts bytes from one encoding to another. The method is only useful if your use case satisfies 2 conditions:
Your input data is bytes in one encoding
Your output data needs to be bytes in another encoding
In that case, it's straight forward:
byte[] out = transcodeField(inbytes, Charset.forName(inEnc), Charset.forName(outEnc));
If the input data contains characters that can't be represented in the output encoding (such as converting complex UTF8 to ASCII) those characters will be replaced with the ? replacement symbol, and the data will be corrupted.
However a lot of people ask "How do I convert a String from one encoding to another", to which a lot of people answer with the following snippet:
String s = new String(source.getBytes(inputEncoding), outputEncoding);
This is complete bull****. The getBytes(String encoding) method returns a byte array with the characters encoded in the specified encoding (if possible, again invalid characters are converted to ?). The String constructor with the 2nd parameter creates a new String from a byte array, where the bytes are in the specified encoding. Now since you just used source.getBytes(inputEncoding) to get those bytes, they're not encoded in outputEncoding (except if the encodings use the same values, which is common for "normal" characters like abcd, but differs with more complex like accented characters éêäöñ).
So what does this mean? It means that when you have a Java String, everything is great. Strings are unicode, meaning that all of your characters are safe. The problem comes when you need to convert that String to bytes, meaning that you need to decide on an encoding. Choosing a unicode compatible encoding such as UTF8, UTF16 etc. is great. It means your characters will still be safe even if your String contained all sorts of weird characters. If you choose a different encoding (with US-ASCII being the least supportive) your String must contain only the characters supported by the encoding, or it will result in corrupted bytes.
Now finally some examples of good and bad usage.
String myString = "Feng shui in chinese is 風水";
byte[] bytes1 = myString.getBytes("UTF-8"); // Bytes correct
byte[] bytes2 = myString.getBytes("US-ASCII"); // Last 2 characters are now corrupted (converted to question marks)
String nordic = "Här är några merkkejä";
byte[] bytes3 = nordic.getBytes("UTF-8"); // Bytes correct, "weird" chars take 2 bytes each
byte[] bytes4 = nordic.getBytes("ISO-8859-1"); // Bytes correct, "weird" chars take 1 byte each
String broken = new String(nordic.getBytes("UTF-8"), "ISO-8859-1"); // Contains now "Här är några merkkejä"
The last example demonstrates that even though both of the encodings support the nordic characters, they use different bytes to represent them and using the wrong encoding when decoding results in Mojibake. Therefore there's no such thing as "converting a String from one encoding to another", and you should never use the broken example.
Also note that you should always specify the encoding used (with both getBytes() and new String()), because you can't trust that the default encoding is always the one you want.
As a last issue, Charset and Encoding aren't the same thing, but they're very much related.
¹ Technically the way a String is stored internally in the JVM is in UTF-16 encoding up to Java 8, and variable encoding from Java 9 onwards, but the developer doesn't need to care about that.
NOTE
It's possible to have a corrupted String and be able to uncorrupt it by fiddling with the encoding, which may be where this "convert String to other encoding" misunderstanding originates from.
// Input comes from network/file/other place and we have misconfigured the encoding
String input = "Här är några merkkejä"; // UTF-8 bytes, interpreted wrongly as ISO-8859-1 compatible
byte[] bytes = input.getBytes("ISO-8859-1"); // Get each char as single byte
String asUtf8 = new String(bytes, "UTF-8"); // Recreate String as UTF-8
If no characters were corrupted in input, the string would now be "fixed". However the proper approach is to use the correct encoding when reading input, not fix it afterwards. Especially if there's a chance of it becoming corrupted.

Does an array of bytes with negative values lose information when converted to String?

I've got a code like this where in the encoding i convert the letters to bytes and then flip them with unary bitwise complement ~ at the end convert it to String.
After that i want to decrypt it with a similar method. The problem is that for two similar input Strings (but not the same) i get the same encoded String with the same hashcode.
Does the String(bytes) method lose the information because the bytes are negative or can i retrieve it somehow without changing my encryption part?
thanx
static String encrypt(String s){
byte[] bytes=s.getBytes();
byte[] enc=new byte[bytes.length];
for (int i=0;i<bytes.length;i++){
enc[i]=(byte) ~bytes[i];
}
return new String(enc);
}
static String decrypt(String s){
...
You should never use new String(...) to encode arbitrary binary data. That's not what it's there for.
Additionally, you should only very rarely use the default platform encoding, which is what you get when you call String.getBytes() and new String(byte[]) without specifying an encoding.
In general, encryption converts binary data to binary data. The normal process of encrypting a string to a string is therefore:
Convert the string into bytes with a known encoding (e.g. UTF-8)
Encrypt the binary data
Convert the encrypted binary data back into a string using base64.
Base64 is used to encode arbitrary binary data as ASCII data in a lossless fashion. Decryption is just a matter of reversing the steps:
Convert the base64 text back to a byte array
Decrypt the byte array
Decode the decrypted byte array as a string using UTF-8
(Note that what you've got currently is not really encryption - it's obfuscation at best.)
Your effectively converting arbitrary byte data into a String.
That's not what that constructor is for.
The String constructor that takes a byte[] is meant to convert text in the platform default encoding into a String. Since what you have is not text, the behaviour will be "bad".
If, for example, your platform default encoding is a 8-bit encoding (such as ISO-8859-*), then you'll "only" get random characters.
If your platform default encoding is UTF-8 you'll probably get random characters and some replacement characters for malformed byte sequences.
To summarize: don't do that. I can't tell you what to do instead, since it's not obvious what you're trying to achieve.

Categories