Since today I'm fronting a really weird error related to byte[] to String conversion.
Here is the code:
private static final byte[] test_key = {-112, -57, -45, 125, 91, 126, -118, 13, 83, -60, -119, 57, 38, 118, -115, -52, -92, 39, -24, 75, 59, -21, 88, 84, 66, -125};
public static void main(String[] args) {
byte[] encryptedArray = xor("ciao".getBytes(), test_key);
System.out.println("Encrypted arrray: " + Arrays.toString(encryptedArray));
final String encrypted = new String(encryptedArray);
System.out.println("Length: " + new String(encryptedArray).length());
System.out.println(Arrays.toString(encrypted.getBytes()));
System.out.println("Encrypted value: " + encrypted);
System.out.println("Decrypted value: " + new String(xor(encrypted.getBytes(), test_key)));
}
private static byte[] xor(byte[] data, byte[] key) {
byte[] result = new byte[data.length];
for (int i = 0; i < data.length; i++) {
result[i] = (byte) (data[i] ^ key[i % key.length]);
}
return result;
}
My output is:
Encrypted arrray: [-13, -82, -78, 18]
Length: 2
[-17, -65, -67, 18]
Encrypted value: �
Decrypted value: xno
Why does length() return 2? What am I missing?
There is no 1-to-1 mapping between byte and char, rather it depends on the charset you use. Strings are logically chars sequences. So if you want to convert between chars and bytes, you need a character encoding, which specifies the mapping from chars to bytes, and vice versa. Your bytes in encryptedArray are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes.
If you want to use String and revert back the exact bytes, you need to do a Base64 of the encryptedArray and then do a new String() of it:
String encoded = new String(Base64.getEncoder().encode(encryptedArray));
To retreive, just decode:
Base64.getDecoder().decode(encoded);
I just thought of a good way of showing what happens by simply replacing the new String(byte[]) method by another one, which is why I will answer the question. This one performs the same basic action as the constructor, with one change: it throws an exception if any invalid characters are found.
private static final byte[] test_key = {-112, -57, -45, 125, 91, 126, -118, 13, 83, -60, -119, 57, 38, 118, -115, -52, -92, 39, -24, 75, 59, -21, 88, 84, 66, -125};
public static void main(String[] args) throws Exception {
byte[] encryptedArray = xor("ciao".getBytes(), test_key);
System.out.println("Encrypted arrray: " + Arrays.toString(encryptedArray));
final String encrypted = new String(encryptedArray);
// original
System.out.println("Length: " + new String(encryptedArray).length());
// replacement
System.out.println("Length: " + decode(encryptedArray).length());
System.out.println(Arrays.toString(encrypted.getBytes()));
System.out.println("Encrypted value: " + encrypted);
System.out.println("Decrypted value: " + new String(xor(encrypted.getBytes(), test_key)));
}
private static String decode(byte[] encryptedArray) throws CharacterCodingException {
var decoder = Charset.defaultCharset().newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
var decoded = decoder.decode(ByteBuffer.wrap(encryptedArray));
return decoded.toString();
}
private static byte[] xor(byte[] data, byte[] key) {
byte[] result = new byte[data.length];
for (int i = 0; i < data.length; i++) {
result[i] = (byte) (data[i] ^ key[i % key.length]);
}
return result;
}
The method is called decode because that's what you are actually doing: you are decoding the bytes to a text. A character encoding is the encoding of characters as bytes, which means that the opposite must be decoding after all.
As you will see, the above will first print out 2 if your platform uses the default UTF-8 encoding (Linux, Android, MacOS). You can get the same result by replacing Charset.defaultCharset() with StandardCharsets.UTF_8 on Windows which uses the Windows-1252 charset instead (a single byte encoding which is an expansion of Latin-1, which itself is an expansion of ASCII). However, it will generate the following exception if you use the decode method:
java.nio.charset.MalformedInputException: Input length = 3
at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:274)
at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:815)
at StackExchange/com.stackexchange.so.ShowBadEncoding.decode(ShowBadEncoding.java:36)
at StackExchange/com.stackexchange.so.ShowBadEncoding.main(ShowBadEncoding.java:24)
Now maybe you'd expect 4 here, the size of the byte array. But note that UTF-8 characters may be encoded over multiple bytes. The error occurs not on the entire string, but on the last character it is trying to read. Obviously it is expecting a longer encoding based on the previous byte values.
If you replace REPORT with the default decoding action REPLACE (heh) you will see that the result is identical to the constructor, and length() will now return the value 2 again.
Of course, Topaco is correct when he says you need to use base 64 encoding. This encodes bytes to characters instead so that all of the meaning of the bytes is maintained, and the reverse is of course the decoding of text back to bytes.
The elements of a String are not bytes, they are chars. A char is not a byte.
There are many ways of converting a char to a sequence of bytes (i.e., many character-set encodings).
Not every sequence of chars can be converted to a sequence of bytes; there is not always a mapping for every char. It depends on your chosen character-set encoding.
Not every sequence of bytes can be converted to a String; the bytes have to be syntactically valid for the specified character set.
Related
There are a lot of questions with this topic, the same solution, but this doesn't work for me. I have a simple test with an encryption. The encryption/decryption itself works (as long as I handle this test with the byte array itself and not as Strings). The problem is that don't want to handle it as byte array but as String, but when I encode the byte array to string and back, the resulting byte array differs from the original byte array, so the decryption doesn't work anymore. I tried the following parameters in the corresponding string methods: UTF-8, UTF8, UTF-16, UTF8. None of them work. The resulting byte array differs from the original. Any ideas why this is so?
Encrypter:
public class NewEncrypter
{
private String algorithm = "DESede";
private Key key = null;
private Cipher cipher = null;
public NewEncrypter() throws NoSuchAlgorithmException, NoSuchPaddingException
{
key = KeyGenerator.getInstance(algorithm).generateKey();
cipher = Cipher.getInstance(algorithm);
}
public byte[] encrypt(String input) throws Exception
{
cipher.init(Cipher.ENCRYPT_MODE, key);
byte[] inputBytes = input.getBytes("UTF-16");
return cipher.doFinal(inputBytes);
}
public String decrypt(byte[] encryptionBytes) throws Exception
{
cipher.init(Cipher.DECRYPT_MODE, key);
byte[] recoveredBytes = cipher.doFinal(encryptionBytes);
String recovered = new String(recoveredBytes, "UTF-16");
return recovered;
}
}
This is the test where I try it:
public class NewEncrypterTest
{
#Test
public void canEncryptAndDecrypt() throws Exception
{
String toEncrypt = "FOOBAR";
NewEncrypter encrypter = new NewEncrypter();
byte[] encryptedByteArray = encrypter.encrypt(toEncrypt);
System.out.println("encryptedByteArray:" + encryptedByteArray);
String decoded = new String(encryptedByteArray, "UTF-16");
System.out.println("decoded:" + decoded);
byte[] encoded = decoded.getBytes("UTF-16");
System.out.println("encoded:" + encoded);
String decryptedText = encrypter.decrypt(encoded); //Exception here
System.out.println("decryptedText:" + decryptedText);
assertEquals(toEncrypt, decryptedText);
}
}
It is not a good idea to store encrypted data in Strings because they are for human-readable text, not for arbitrary binary data. For binary data it's best to use byte[].
However, if you must do it you should use an encoding that has a 1-to-1 mapping between bytes and characters, that is, where every byte sequence can be mapped to a unique sequence of characters, and back. One such encoding is ISO-8859-1, that is:
String decoded = new String(encryptedByteArray, "ISO-8859-1");
System.out.println("decoded:" + decoded);
byte[] encoded = decoded.getBytes("ISO-8859-1");
System.out.println("encoded:" + java.util.Arrays.toString(encoded));
String decryptedText = encrypter.decrypt(encoded);
Other common encodings that don't lose data are hexadecimal and base64, but sadly you need a helper library for them. The standard API doesn't define classes for them.
With UTF-16 the program would fail for two reasons:
String.getBytes("UTF-16") adds a byte-order-marker character to the output to identify the order of the bytes. You should use UTF-16LE or UTF-16BE for this to not happen.
Not all sequences of bytes can be mapped to characters in UTF-16. First, text encoded in UTF-16 must have an even number of bytes. Second, UTF-16 has a mechanism for encoding unicode characters beyond U+FFFF. This means that e.g. there are sequences of 4 bytes that map to only one unicode character. For this to be possible the first 2 bytes of the 4 don't encode any character in UTF-16.
Accepted solution will not work if your String has some non-typical charcaters such as š, ž, ć, Ō, ō, Ū, etc.
Following code worked nicely for me.
byte[] myBytes = Something.getMyBytes();
String encodedString = Base64.encodeToString(bytes, Base64.NO_WRAP);
byte[] decodedBytes = Base64.decode(encodedString, Base64.NO_WRAP);
Now, I found another solution too...
public class NewEncrypterTest
{
#Test
public void canEncryptAndDecrypt() throws Exception
{
String toEncrypt = "FOOBAR";
NewEncrypter encrypter = new NewEncrypter();
byte[] encryptedByteArray = encrypter.encrypt(toEncrypt);
String encoded = String.valueOf(Hex.encodeHex(encryptedByteArray));
byte[] byteArrayToDecrypt = Hex.decodeHex(encoded.toCharArray());
String decryptedText = encrypter.decrypt(byteArrayToDecrypt);
System.out.println("decryptedText:" + decryptedText);
assertEquals(toEncrypt, decryptedText);
}
}
Your problem is that you cannot build a UTF-16 (or any other encoding) String from an arbitrary byte array (see UTF-16 on Wikipedia). It is up to you, however, to serialize and deserialize the encrypted byte array without any loss, in order to, say, persist it, and make use of it later. Here's the modified client code that should give you some insight of what's actually happening with the byte arrays:
public static void main(String[] args) throws Exception {
String toEncrypt = "FOOBAR";
NewEncrypter encrypter = new NewEncrypter();
byte[] encryptedByteArray = encrypter.encrypt(toEncrypt);
System.out.println("encryptedByteArray:" + Arrays.toString(encryptedByteArray));
String decoded = new String(encryptedByteArray, "UTF-16");
System.out.println("decoded:" + decoded);
byte[] encoded = decoded.getBytes("UTF-16");
System.out.println("encoded:" + Arrays.toString(encoded));
String decryptedText = encrypter.decrypt(encryptedByteArray); // NOT the "encoded" value!
System.out.println("decryptedText:" + decryptedText);
}
This is the output:
encryptedByteArray:[90, -40, -39, -56, -90, 51, 96, 95, -65, -54, -61, 51, 6, 15, -114, 88]
decoded:<some garbage>
encoded:[-2, -1, 90, -40, -1, -3, 96, 95, -65, -54, -61, 51, 6, 15, -114, 88]
decryptedText:FOOBAR
The decryptedText is correct, when restored from the original encryptedByteArray. Please note that the encoded value is not the same as encryptedByteArray, due to the data loss during the byte[] -> String("UTF-16")->byte[] conversion.
I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251.
How can I convert this string to something readable?
I tried this:
import java.io.UnsupportedEncodingException;
public class Program {
public void run() throws UnsupportedEncodingException {
final String windows1251String = getWindows1251String();
System.out.println("String (Windows-1251): " + windows1251String);
final String readableString = convertString(windows1251String);
System.out.println("String (converted): " + readableString);
}
private String convertString(String windows1251String) throws UnsupportedEncodingException {
return new String(windows1251String.getBytes(), "UTF-8");
}
private String getWindows1251String() {
final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
return new String(bytes);
}
public static void main(final String[] args) throws UnsupportedEncodingException {
final Program program = new Program();
program.run();
}
}
The variable bytes contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes(). I just copy and pasted that array here.
This doesn't work - readableString contains garbage.
How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?
Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString to Windows-1251, nothing changes. See the screenshot below.
Update 2: Another attempt:
Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.
Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251, but UTF-8.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
System.out.println("Detected encoding: " + encoding);
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
I fixed this problem by modifying the piece of code, which read the text from the web site.
private String readContent(final String urlAsString) {
final StringBuilder content = new StringBuilder();
BufferedReader reader = null;
InputStream inputStream = null;
try {
final URL url = new URL(urlAsString);
inputStream = url.openStream();
reader =
new BufferedReader(new InputStreamReader(inputStream);
String inputLine;
while ((inputLine = reader.readLine()) != null) {
content.append(inputLine);
}
} catch (final IOException exception) {
exception.printStackTrace();
} finally {
IOUtils.closeQuietly(reader);
IOUtils.closeQuietly(inputStream);
}
return content.toString();
}
I changed the line
new BufferedReader(new InputStreamReader(inputStream);
to
new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));
and then it worked.
(In the light of updates I deleted my original answer and started again)
The text which appears
пїЅпїЅпїЅпїЅпїЅпїЅ
is an accurate decoding of these byte values
-17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67
(Padded at either end with 32, which is space.)
So either
1) The text is garbage or
2) The text is supposed to look like that or
3) The encoding is not Windows-1215
This line is notably wrong
return new String(windows1251String.getBytes(), "UTF-8");
Extracting the bytes out of a string and constructing a new string from that is not a way of "converting" between encodings. Both the input String and the output String use UTF-16 encoding internally (and you don't normally even need to know or care about that). The only times other encodings come into play are when text data is stored OUTSIDE of a string object - ie in your initial byte array. Conversion occurs when the String is constructed and then it is done. There is no conversion from one String type to another - they are all the same.
The fact that this
return new String(bytes);
does the same as this
return new String(bytes, "Windows-1251");
suggests that Windows-1251 is the platforms default encoding. (Which is further supported by your timezone being MSK)
Just to make sure you understand 100% how java deals with char and byte.
byte[] input = new byte[1];
// values > 127 become negative when you put them in an array.
input[0] = (byte)239; // the array contains value -17 now.
// but all 255 values are preserved.
// But if you cast them to integers, you should use their unsigned value.
// (casting alone isn't enough).
int output = input[0] & 0xFF; // output is 239 again
// you shouldn't cast directly from a single-byte to a char.
// because: char is 16-bit ; but you only want to use 1 byte ; unfortunately your negative values will be applied in the 2nd byte, and break it.
char corrupted = (char) input[0]; // char-code: 65519 (2 bytes are used)
char corrupted = (char) ((int)input[0]); // char-code: 66519 (2 bytes are used)
// just casting to an integer/character is ok for values < 0x7F though
// values < 0x7F are always positive, even when casted to byte
// AND the first 7-bits in any ascii-encodings (e.g. windows-1251) are identical.
byte simple = (byte) 'a';
char chr = (char) ascii_LT_7F; // will result in 'a' again
// But it's still more reliable to use the & 0xFF conversion.
// Because it ensures that your character can never be greater than char code 255 (a single byte), even when the byte is unexpectedly negative (> 0x7F).
char chr = (char) ((byte)simple & 0xFF); // also results in 'a'
// for value 239 (which is 0xEF) it's impossible though.
// a java char is 16-bit encoded internally, following the unicode character set.
// characters 0x00 to 0x7F are identical in most encodings.
// but e.g. 0xEF in windows-1251 does not match 0xEF in UTF-16.
// so, this is a bad idea.
char corrupted = (char) (input[0] & 0xFF);
// And that's something you can only fix by using encodings.
// It's good practice to use encodings really just ALWAYS.
// the encoding indicates what your bytes[] are encoded in NOW.
// your bytes will be converted to 16-bit characters.
String text = new String(bytes, "from-encoding");
// if you want to change that text back to bytes, use an encoding !!
// this time the encoding specifies is the TARGET-ENCODING.
byte[] bytes = text.getBytes("to-encoding");
I hope this helps.
As for the displayed values:
I can confirm that the byte[] is displayed correctly. I checked them in the Windows-1251 code page. (byte -17 = int 239 = 0xEF = char 'п')
In other words, your byte values are incorrect, or it's a different source-encoding.
If I transfer byte to char in Java, I have following problem:
In Netbeans is all O.K., but if I run program from Windows cmd I have a problem with charset. I don't know why.
What to do?
Code:
char tmp = (byte) charFromByteInt; // it's byte in int variable
Guessed on your output image it's a charset problem.
following snippet return the same byte values for your input string.
String testString = "TEST for stackoverflow";
byte[] bytes = testString.getBytes(StandardCharsets.UTF_8);
System.out.println("bytes = " + Arrays.toString(bytes));
bytes = testString.getBytes(StandardCharsets.ISO_8859_1);
System.out.println("bytes = " + Arrays.toString(bytes));
output
bytes = [84, 69, 83, 84, 32, 102, 111, 114, 32, 115, 116, 97, 99, 107, \
111, 118, 101, 114, 102, 108, 111, 119]
As your output in the console generates more characters then in the input string, you should check the source (from where you take the byte / int values) for the output.
I did a POC using apache codec base64 library, where I encrypted a string using SHA. (This can be ignored).
Step 1 - I printed byte array for that string.
Step 2 - Encoded the byte array and printed its value.
Step 3 - Decoded the encoded value and printed it.
public static void main(String[] args)
{
MessageDigest messageDigest = null;
String ALGORITHM = "SHA";
try
{
messageDigest = MessageDigest.getInstance(ALGORITHM);
byte[] arr = "admin1!".getBytes();
byte[] arr2 = messageDigest.digest(arr);
System.out.println(arr2);
String encoded = Base64.encodeBase64String(arr2);
System.out.println(encoded);
byte[] decoded = Base64.decodeBase64(encoded);
System.out.println(decoded);
}
catch (NoSuchAlgorithmException e)
{
e.printStackTrace();
}
}
Expected result : Step 1 and Step 3 should produce same output. But I am not getting that.
Output :
[B#5ca801b0
90HMfRqqpfwRJge0anZat98BTdI=
[B#68d448a1
Your program is all good and fine. Just one mistake.
System.out.println(byteArray); prints hashCode of byte array object. (Note: Arrays are object in Java not primitive type)
You should use System.out.println(Arrays.toString(byteArray)); instead and you will get same value for both steps 1 and 3.
As per javadocs Arrays.toString(byte[] a) returns a string representation of the contents of the specified array.
Your code after changes will be :
public static void main(String[] args)
{
MessageDigest messageDigest = null;
String ALGORITHM = "SHA";
try
{
messageDigest = MessageDigest.getInstance(ALGORITHM);
byte[] arr = "admin1!".getBytes();
byte[] arr2 = messageDigest.digest(arr);
System.out.println(Arrays.toString(arr2));
String encoded = Base64.encodeBase64String(arr2);
System.out.println(encoded);
byte[] decoded = Base64.decodeBase64(encoded);
System.out.println(Arrays.toString(decoded));
}
catch (NoSuchAlgorithmException e)
{
e.printStackTrace();
}
}
and output will be :
[-9, 65, -52, 125, 26, -86, -91, -4, 17, 38, 7, -76, 106, 118, 90, -73, -33, 1, 77, -46]
90HMfRqqpfwRJge0anZat98BTdI=
[-9, 65, -52, 125, 26, -86, -91, -4, 17, 38, 7, -76, 106, 118, 90, -73, -33, 1, 77, -46]
Note value of byte array is same.
I am currently facing an error called Bad Base64Coder input character at ...
Here is my code in java.
String nonce2 = strNONCE;
byte[] nonceBytes1 = Base64Coder.decode(nonce2);
System.out.println("nonceByte1 value : " + nonceBytes1);
The problem now is I get Bad Base64Coder input character error and the nonceBytes1 value is printed as null. I am trying to decode the nonce2 from Base64Coder. My strNONCE value is 16
/** Generating nonce value */
public static String generateNonce() {
try {
byte[] nonce = new byte[16];
Random rand;
rand = SecureRandom.getInstance ("SHA1PRNG");
rand.nextBytes(nonce);
//convert byte array to string.
strNONCE = new String(nonce);
}catch (NoSuchAlgorithmException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return strNONCE;
}
//convert byte array to string.
strNONCE = new String(nonce);
That is not going to work. You need to base64 encode it.
strNONCE = Base64Coder.encode(nonce);
It simply look like you're confusing some independent concepts and are pretty new to Java as well. Base64 is a type of encoding which converts "human unreadable" byte arrays into "human readable" strings (encoding) and the other way round (decoding). It is usually used to transfer or store binary data as characters there where it is strictly been required (due to the protocol or the storage type).
The SecureRandom thing is not an encoder or decoder. It returns a random value which is in no way to be corelated with a certain cipher or encoder. Here are some extracts from the before given links:
ran·dom
adj.
1. Having no specific pattern, purpose, or objective
Cipher
In cryptography, a cipher (or cypher)
is an algorithm for performing
encryption or decryption — a series
of well-defined steps that can be
followed as a procedure.
Encoding
Encoding is the process of
transforming information from one
format into another. The opposite
operation is called decoding.
I'd strongly recommend you to align those concepts out for yourself (click the links to learn more about them) and not to throw them in one big and same hole. Here's at least an SSCCE which shows how you can properly encode/decode a (random) byte array using base64 (and how to show arrays as string (a human readable format)):
package com.stackoverflow.q2535542;
import java.security.SecureRandom;
import java.util.Arrays;
import org.apache.commons.codec.binary.Base64;
public class Test {
public static void main(String[] args) throws Exception {
// Generate random bytes and show them.
byte[] bytes = new byte[16];
SecureRandom.getInstance("SHA1PRNG").nextBytes(bytes);
System.out.println(Arrays.toString(bytes));
// Base64-encode bytes and show them.
String base64String = Base64.encodeBase64String(bytes);
System.out.println(base64String);
// Base64-decode string and show bytes.
byte[] decoded = Base64.decodeBase64(base64String);
System.out.println(Arrays.toString(decoded));
}
}
(using Commons Codec Base64 by the way)
Here's an example of the output:
[14, 52, -34, -74, -6, 72, -127, 62, -37, 45, 55, -38, -72, -3, 123, 23]
DjTetvpIgT7bLTfauP17Fw==
[14, 52, -34, -74, -6, 72, -127, 62, -37, 45, 55, -38, -72, -3, 123, 23]
A base64 encoded string would only have printable characters in it. You're generating strNONCE directly from random bytes, so it will have non-printable characters in it.
What exactly is it you're trying to do?