Converting bytes with Charset results in diamonds at end of string? - java

I am currently storing a String as an array of bytes. However, when I try to use the following code to convert the bytes back to a String using Charset, I have diamonds at the end:
byte[] testbytes = "abc123".getBytes(); // tried getBytes("UTF-8"/StandardCharsets.UTF_8) too
Charset charset = Charset.forName("UTF-8"); // ISO-8859-1 has no diamonds
CharBuffer charBuffer = charset.decode( ByteBuffer.wrap( Arrays.copyOfRange(testbytes,0,testbytes.length) ) );
System.out.println("converted = " + String.valueOf(charBuffer.array()) );
// returns this - abc123����������
If I set the encoding to ISO-8859-1 instead, it converts fine. I thought it might be the encoding of the source code file but opening that in Notepad++ suggests it is also in UTF-8.
Am I missing something or is this just a problem with Android Studio's Logcat window?
- Edit 1 -
Further testing shows that 3 character strings do not have this padding at the end problem. If you use longer strings, Charset.decode seems to pad out the char array with \u0000 values according to the break point.
String.valueOf will end up printing the padded characters as diamonds while creating a new String object removes the padding but, I would like to not use String at all to convert a byte array to a char array due to sensitive values.
- Edit 2 -
It appears the above happens if you call charset.decode() again so, I'm guessing there's a buffer that's being appended to but not sure at what point. Tried clearing with charBuffer.clear() but the second block of code's output appears to be the same i.e. 3 char + 2 spaces + 6 new chars.
String test1 = "123";
byte[] test1b = test1.getBytes();
char[] expected1 = test1.toCharArray();
CharBuffer charBuffer = charset.decode( ByteBuffer.wrap( test1b ) );
char[] actual1 = charBuffer.array(); // size 3, correct
String test2 = "123456";
byte[] test2b = test2.getBytes();
char[] expected2 = test2.toCharArray();
CharBuffer charBuffer2 = charset.decode( ByteBuffer.wrap( test2b ) );
char[] actual2 = charBuffer2.array(); // size 11, padded with '\u0000' 0

Did you try to use the String constructor that receives an array of bytes?
Like:
byte[] testbytes = "abc123".getBytes(StandardCharsets.UTF_8);
String stringDecoded = new String(testbytes, StandardCharsets.UTF_8);
Maybe it can solve your problem.

Related

When encoding a UTF-8 string with 3 character bytes how come the length only increases by 1

So I'm trying to make sure I understand encoding correctly so I wrote a sample test:
public class TestEncoding {
public static void main(String[] args) throws UnsupportedEncodingException {
TestEncoding testEncoding = new TestEncoding();
testEncoding.isLengthDifferenceBetweenUTF16UTF32();
}
private void isLengthDifferenceBetweenUTF16UTF32() throws UnsupportedEncodingException {
String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));
System.err.println("8 bit: " + eightBitString.length());
System.err.println("16 bit: " + sixteenBitString.length());
System.err.println("32 bit: " + thirtyTwoBitString.length());
}
}
And then for output I get:
8 bit: 16
16 bit: 32
32 bit: 64
My question is why didn't the special character ࢤ on the end of Hi how are you?, make it 15, for the Hi how are you? + 3 for the special character, giving me a total of 18.
String eightBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8));
String sixteenBitString = new String("Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_16LE));
String thirtyTwoBitString= new String("Hi how are you?ࢤ".getBytes("UTF-32"));
These lines are taking a string, converting them to bytes in the specified charset, and then converting them back to a string in the JVM's default charset.
It depends on what that default is, but it is possible that the byte sequences are not valid in the default charset. In that case, the strings will contain placeholder chars for the invalid sequences. They look like � in the resulting string, which is a single char.
For example, if the default charset is UTF-8, these 3 strings are:
Hi how are you?ࢤ
Hi how are you?�
Hi how are you?�
If you want to compare the lengths of the byte representations in those charsets, don't convert back to strings:
byte[] eightBit = "Hi how are you?ࢤ".getBytes(StandardCharsets.UTF_8);
System.out.println(eightBit.length);
etc.

Problem in putting string in bytebuffer java

when I put a string to byte buffer it adds some unknown chars to it.
here is my code:
String request="HELLO";
ByteBuffer buffer=ByteBuffer.allocate(1024);
buffer.clear();
buffer.put(request.getBytes());
buffer.flip();
when I convert it to the string I get the following result: HELLO��������
The way I convert ByteBuffer to string is below:
new String(buffer.array())
When creating the string, you didn't take into account that only some of the bytes in the buffer had valid data. The first 5 bytes contain "hello" encoded in some form, the rest are filled with zeros.
To convert a byte buffer to a string, use the Charset class:
CharBuffer cb = Charset.defaultCharset().decode(buffer);
String str = cb.toString();

Java convert unicode code point to string

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?
I have tried something like:
Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));
but it does not convert to a valid code point.
That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).
Update
Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
assert src.charAt(j) == '=';
bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);
System.out.println(str);
Output
Газета
In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.
All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.
Here is some sample code that shows one way this can be achieved:
public static void main(String[] args) throws Exception {
String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";
// Parse string into hex string tokens.
String[] tokens = Arrays.stream(src.split("="))
.filter(s -> s.length() != 0)
.toArray(String[]::new);
// Convert the hex string representations to a byte array.
byte[] utf8bytes = new byte[tokens.length];
for (int i = 0; i < utf8bytes.length; i++) {
utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
}
// Convert UTF-8 bytes to Java String.
String str = new String(utf8bytes, StandardCharsets.UTF_8);
// Display string + individual unicode code points.
System.out.println(str);
str.codePoints().forEach(System.out::println);
}
Output:
Газета
1043
1072
1079
1077
1090
1072

random characters in byte to string conversion java

I am converting byte [] into a string. Everytime that I convert the byte array to a string, it has a prefixed-type character before it every single time. I have tried different characters, uppercase, etc.. Still has the prefix.
When I write the byte code to system output, it still has the character.
System.out.write(theByteArray);
System.out.println(new String(theByteArray, "UTF-8"));
When I write the text to a file, it seems like the byte array printed flawlessly, but then I scan it and end up with the weird prefix symbol...
Text to be encrypted >
"aaaa"
Text when decrypted and converted to a string >
"aaaa"
The Character seems to disappear, here is an image of it.
I am wanting to compare the given string to another string, kind of like decrypting a password, and comparing it to a database. If one matches, then it gives access.
Code that is generating this byte code.
Keep in mind, the byte I am looking at is decData, and this is NOT my code.
byte[] encData;
byte[] decData;
File inFile = new File(fileName+ ".encrypted");
//Generate the cipher using pass:
Cipher cipher = FileEncryptor.makeCipher(pass, false);
//Read in the file:
FileInputStream inStream = new FileInputStream(inFile);
encData = new byte[(int)inFile.length()];
inStream.read(encData);
inStream.close();
//Decrypt the file data:
decData = cipher.doFinal(encData);
//Figure out how much padding to remove
int padCount = (int)decData[decData.length - 1];
//Naive check, will fail if plaintext file actually contained
//this at the end
//For robust check, check that padCount bytes at the end have same value
if( padCount >= 1 && padCount <= 8 ) {
decData = Arrays.copyOfRange( decData , 0, decData.length - padCount);
}
FileOutputStream target = new FileOutputStream(new File(fileName + ".decrypted.txt"));
target.write(decData);
target.close();
Looks like encData contains BOM and I think Java, when reading in a stream with BOM, will just treat the BOM as an UTF-8 character, which caused the "prefix". You can try the solution suggested here: Reading UTF-8 - BOM marker.
On the other hand, byte order mark is optional and not recommended for UTF-8 encoding. So two questions to ask is:
Is the original data encoded using utf-8?
If it is, it might be worth while to find out why did the BOM gets into the original data in the first place.

Remove Non-Ansi Chars from a UTF String and Keep Others

We have a java lib accpeting a UTF8 string as the input. But if there is any char which is a non-ansi char in the input, the lib may crash. So, we want to remove all non-ansi char from the string. But how to do that in java?
Thanks,
Try this, I pulled this from here so haven't tested it
// Create a encoder and decoder for the character encoding
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharsetEncoder encoder = charset.newEncoder();
// This line is the key to removing "unmappable" characters.
encoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
String result = inString;
try {
// Convert a string to bytes in a ByteBuffer
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(inString));
// Convert bytes in a ByteBuffer to a character ByteBuffer and then to a string.
CharBuffer cbuf = decoder.decode(bbuf);
result = cbuf.toString();
} catch (CharacterCodingException cce) {
String errorMessage = "Exception during character encoding/decoding: " + cce.getMessage();
cce.printStackTrace()
}
Take a look at String.codePointAt(index). That can give you the Unicode code point for a given character, and from there you could remove those outside your range.
How you handle the fact that a character has been removed is on your end, but keep in mind that the string you'll be sending to the library isn't necessarily the same as that provided by the client. This may or may not cause problems.
I'm not sure what you mean by ANSI here. Do you mean the Windows 1252 character encoding that people typically call ANSI? That's not ASCII and it's also not IS0-8859-1, so make sure you get your code pages correct.

Categories