How does encoding/decoding bytes work in Java? - java

Little background: I'm doing cryptopals challenges and I finished https://cryptopals.com/sets/1/challenges/1 but realized I didn't learn what I guess is meant to be learned (or coded).
I'm using the Apache Commons Codec library for Hex and Base64 encoding/decoding. The goal is to decode the hex string and re-encode it to Base64. The "hint" at the bottom of the page says "Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing."
Here's my answer...
private static Hex forHex = new Hex();
private static Base64 forBase64 = new Base64();
public static byte[] hexDecode(String hex) throws DecoderException {
byte[] rawBytes = forHex.decode(hex.getBytes());
return rawBytes;
}
public static byte[] encodeB64(byte[] bytes) {
byte[] base64Bytes = forBase64.encode(bytes);
return base64Bytes;
}
public static void main(String[] args) throws DecoderException {
String hex = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
//decode hex String to byte[]
byte[] myHexDecoded = hexDecode(hex);
String myHexDecodedString = new String(myHexDecoded);
//Lyrics from Queen's "Under Pressure"
System.out.println(myHexDecodedString);
//encode myHexDecoded to Base64 encoded byte[]
byte[] myHexEncoded = encodeB64(myHexDecoded);
String myB64String = new String(myHexEncoded);
//"pretty printing" of base64
System.out.println(myB64String);
}
...but I feel like I cheated. I didn't learn how to decode bytes that were encoded as hex, and I didn't learn how to encode "pure" bytes to Base64, I just learned how to use a library to do something for me.
If I were to take a String in Java then get its bytes, how would I encode those bytes into hex? For example, the following code snip turns "Hello" (which is readable English) to the byte value of each character:
String s = "Hello";
char[] sChar = s.toCharArray();
byte[] sByte = new byte[sChar.length]
for(int i = 0; i < sChar.length; i++) {
sByte[i] = (byte) sChar[i];
System.out.println("sByte[" + i + "] = " +sByte[i]);
}
which yields sByte[0] = 72, sByte[1] = 101, sByte[2] = 108, sByte[3] = 108, sByte[4] = 111
Lets use 'o' as an example - I am guessing its decimal version is 111 - do I just take its decimal version and change that to its hex version?
If so, to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?

to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
No. You take the characters 2 at a time, transform the character '0' to the numeric value 0, the character '1' to the numeric value 1, ..., the character 'a' (or 'A', depending on which encoding you want to support) to the numeric value 10, ..., the character 'f' or 'F' to the numeric value 15.
Then you multiply the first numeric value by 16, and you add it to the second numeric value to get the unsigned integer value of your byte. Then you transform that unsigned integer value to a signed byte.
ASCII has nothing to do with this algorithm.
To see how it's done in practice, since commons-codec is open-source, you can just look at its implementation.

Related

Java: how to convert unicode string Emoji to Integer

I received an unicode string to contain Emoji code, example: "U+1F44F" (from Emoji table : http://apps.timwhitlock.info/emoji/tables/unicode).
I want to convert this string to an Integer how can I do that ?
I tried this, but it crashs:
int hex = Integer.parseInt(unicodeStr, 16);
Thanks guys!
The comment of #flakes gives the correct answere. The U+ only indicates that the following codepoint (or hex number) is a Unicode. The value you want to convert into an Integer is the codepoint, so you have to omit the 2 first characters with .substring(2)
You wil obtain the following code:
int hex = Integer.parseInt(unicodeStr.substring(2), 16);
Unicode numbers such "characters," code points, upto the 3 byte range, such as U+1F44F.
Java String has a constructor with code points.
int[] codepoints = { 0x1F44F };
String s = new String(codepoints, 0, codepoints.length);
public static String fromCodepoints(int... codepoints) {
return new String(codepoints, 0, codepoints.length);
}
s = fromCodepoints(0x1F44F, 0x102);
Java String contains Unicode as an internal array of chars. Every char '(2 bytes) being UTF-16 encoded. For lower ranges a char can be a code point. And U+0102 could be written as "\u0102" containing the char '\u0102'.
Note that emoji must be representable in the font.
Font font = ...
if (!font.canDisplay(0x1F44F)) {
...
}

Will the result of String.getBytes() ever contain zeros?

I have tried numerous Strings with random characters, and except empty string "", their .getBytes() byte arrays seem to never contain any 0 values (like {123, -23, 54, 0, -92}).
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
Edit: the previous test code is as follows. Now I learned that in Java 8 the result seems always "contains no 0" if the String is made up of (char) random.nextInt(65535) + 1; and "contains 0" if the String contains (char) 0.
private static String randomString(int length){
Random random = new Random();
char[] chars = new char[length];
for (int i = 0; i < length; i++){
int integer = random.nextInt(65535) + 1;
chars[i] = (char) (integer);
}
return new String(chars);
}
public static void main(String[] args) throws Exception {
for (int i = 1; i < 100000; i++){
String s1 = randomString(10);
byte[] bytes = s1.getBytes();
for (byte b : bytes) {
if (b == 0){
System.out.println("contains 0");
System.exit(0);
}
}
}
System.out.println("contains no 0");
}
It does depend on your platform local encoding. But in many encodings, the '\0' (null) character will result in getBytes() returning an array with a zero in it.
System.out.println("\0".getBytes()[0]);
This will work with the US-ASCII, ISO-8859-1 and the UTF-8 encodings:
System.out.println("\0".getBytes("US-ASCII")[0]);
System.out.println("\0".getBytes("ISO-8859-1")[0]);
System.out.println("\0".getBytes("UTF-8")[0]);
If you have a byte array and you want the string that corresponds to it, you can also do the reverse:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b);
However this will give different results for different encodings, and in some encodings it may be an invalid sequence.
And the characters in it may not be printable.
Your best bet is the ISO-8859-1 encoding, only the null character cannot be printed:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b, "ISO-8859-1");
System.out.println(s);
System.out.println((int) s.charAt(3));
Edit
In the code that you posted, it's also easy to get "contains 0" if you specify the UTF-16 encoding:
byte[] bytes = s1.getBytes("UTF-16");
It's all about encoding, and you haven't specified it. When you haven't passed it as an argument to the getBytes method, it takes your platform default encoding.
To find out what that is on your platform, run this:
System.out.println(System.getProperty("file.encoding"));
On MacOS, it's UTF-8; on Windows it's likely to be one of the Windows codepages like Cp-1252. You can also specify the platform default on the command line when you run Java:
java -Dfile.encoding=UTF16 <the rest>
If you run your code that way you'll also see that it contains 0.
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
No, there is no such guarantee. First, and most importantly, .getBytes() returns "a sequence of bytes using the platform's default charset". As such there is nothing preventing you from defining your own custom charset that explicitly encodes certain values as 0s.
More practically, many common encodings will include zero-bytes, notably to represent the NUL character. But even if your strings don't include NUL's its possible for the byte sequence to include 0s. In particular UTF-16 (which Java uses internally) represents all characters in two bytes, meaning ASCII characters (which only need one) are paired with a 0 byte.
You could also very easily test this yourself by trying to construct a String from a sequence of bytes containing 0s with an appropriate constructor, such as String(byte[] bytes) or String(byte[] bytes, Charset charset). For example (notice my system's default charset is UTF-8):
System.out.println("Default encoding: " + System.getProperty("file.encoding"));
System.out.println("Empty string: " + Arrays.toString("".getBytes()));
System.out.println("NUL char: " + Arrays.toString("\0".getBytes()));
System.out.println("String constructed from {0} array: " +
Arrays.toString(new String(new byte[]{0}).getBytes()));
System.out.println("'a' in UTF-16: " +
Arrays.toString("a".getBytes(StandardCharsets.UTF_16)));
prints:
Default encoding: UTF-8
Empty string: []
NUL char: [0]
String constructed from {0} array: [0]
'a' in UTF-16: [-2, -1, 0, 97]

Why new String with UTF-8 contains more bytes

byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
log.warn("Hash generation failed", e);
}
When I generate a String with given method, and when i apply string.getBytes().length it returns some other value. Max was 32. Why a 16 byte array ends up generating a another size byte string ?
But if i do string.length() it returns 16.
This is because your bytes are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes. If a byte cannot be treated as ASCII char nor captured with next byte(s) to form legal unicode char, it is replaced by "�". Such char is transformed into 3 bytes when calling String#getBytes(), thus adding 2 extra bytes to resulting output.
If you're lucky to generate ASCII chars only, String#getBytes() will return 16-byte array, if no, resulting array may be longer. For example, the following code snippet:
byte[] b = new byte[16];
Arrays.fill(b, (byte) 190);
b = new String(b, "UTF-8").getBytes();
returns array of 48(!) bytes long.
Classical mistake born from the misunderstanding of the relationship between bytes and chars, so here we go again.
There is no 1-to-1 mapping between byte and char; it all depends on the character coding you use (in Java, that is a Charset).
Worse: given a byte sequence, it may or may not be encoded to a char sequence.
Try this for instance:
final byte[] buf = new byte[16];
new Random().nextBytes(buf);
final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT);
decoder.decode(ByteBuffer.wrap(buf));
This is very likely to throw a MalformedInputException.
I know this is not exactly an answer but then you didn't clearly explain your problem; and the example above shows already that you have the wrong understanding between what a byte is and what a char is.
The generated bytes might contain valid multibyte characters.
Take this as example. The string contains only one character, but as byte representation it take three bytes.
String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));
String.length() return the length of the string in characters. The character Ω is one character whereas it's a 3 byte long in UTF-8.
If you change your code like this
Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());
The same bytes are interpreted with a different charset. And following the javadoc from String(byte[] b, String charset)
The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
If you look at the string you're producing, most of the random bytes you're generating do not form valid UTF-8 characters. The String constructor, therefore, replaces them with the unicode 'REPLACEMENT CHARACTER' �, which takes up 3 bytes, 0xFFFD.
As an example:
public static void main(String[] args) throws UnsupportedEncodingException
{
Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
printBytes(bytes);
final String s = new String(bytes, "UTF-8");
System.out.println(s);
printCharacters(s);
}
private static void printBytes(byte[] bytes)
{
for (byte aByte : bytes)
{
System.out.print(
Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
}
System.out.println();
}
private static void printCharacters(String s)
{
s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}
On a given run, I got this output:
30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c
0A��2�8��#JT&͌
DIGIT ZERO
LATIN CAPITAL LETTER A
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
DIGIT TWO
REPLACEMENT CHARACTER
DIGIT EIGHT
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
SYNCHRONOUS IDLE
NUMBER SIGN
LATIN CAPITAL LETTER J
LATIN CAPITAL LETTER T
AMPERSAND
COMBINING ALMOST EQUAL TO ABOVE
String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.
read more here
This will try to create a String assuming the bytes are in UTF-8.
new String(bytes, "UTF-8");
This in general will go horribly wrong as UTF-8 multi-byte sequences can be invalid.
Like:
String s = new String(new byte[] { -128 }, StandardCharsets.UTF_8);
The second step:
byte[] bytes = s.getBytes();
will use the platform encoding (System.getProperty("file.encoding")). Better specify it.
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
One should realize, internally String will maintain Unicode, an array of 16-bit char in UTF-16.
One should entirely abstain from using String for byte[]. It will always involve a conversion, cost double memory and be error prone.

Converting string to binary and back again does not give the same string

I'm writing a Simplified DES algorithm to encrypt and subsequently decrypt a string. Suppose I have the initial character ( which has the binary value 00101000 which I get using the following algorithm:
public void getBinary() throws UnsupportedEncodingException {
byte[] plaintextBinary = text.getBytes("UTF-8");
for(byte b : plaintextBinary){
int val = b;
int[] tempBinRep = new int[8];
for(int i = 0; i<8; i++){
tempBinRep[i] = (val & 128) == 0 ? 0 : 1;
val <<= 1;
}
binaryRepresentations.add(tempBinRep);
}
}
After I perform the various permutations and shifts, ( and it's binary equivalent is transformed into 10001010 and it's ASCII equivalent Š. When I come around to decryption I pass the same character through the getBinary() method I now get the binary string 11000010 and another binary string 10001010 which translates into ASCII as x(.
Where is this rogue x coming from?
Edit: The full class can be found here.
You haven't supplied the decrypting code, so we can't know for sure, but I would guess you missed the encoding either when populating your String. Java Strings are encoded in UTF-16 by default. Since you're forcing UTF-8 when encrypting, I'm assuming you're doing the same when decrypting. The problem is, when you convert your encrypted bytes to a String for storage, if you let it default to UTF-16, you're probably ending up with a two-byte character because the 10001010 is 138, which is beyond the 127 range for ASCII charaters that get represented with a single byte.
So the "x" you're getting is the byte for the code page, followed by the actual character's byte. As suggested in the comments, you'd do better to just store the encrypted bytes as bytes, and not convert them to Strings until they're decrypted.

Convert Unicode to UTF-8 byte[] and save into string (Java)

I need to convert the character unicode to a byte[] representation and save into Srting, for example
U+1F601 -> \xF0\x9F\x98\x81
I dont have idea how can i do it this..
Anyone has idea?Thanks
int[] codepoints = { 0x1F601 }; // U+1F601
String s = new String(codepoints, 0, codepoints.length);
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); // As UTF-8 (Unicode) bytes
System.out.println(Arrays.toString(bytes));
So one first coposes the Unicode code points into a java String. Java Strings hold Unicode.
When one wants bytes, say in UTF-8 - a Unicode representation -, then one has to indicate the CharSet in which the bytes will be.

Categories