Will the result of String.getBytes() ever contain zeros? - java

I have tried numerous Strings with random characters, and except empty string "", their .getBytes() byte arrays seem to never contain any 0 values (like {123, -23, 54, 0, -92}).
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
Edit: the previous test code is as follows. Now I learned that in Java 8 the result seems always "contains no 0" if the String is made up of (char) random.nextInt(65535) + 1; and "contains 0" if the String contains (char) 0.
private static String randomString(int length){
Random random = new Random();
char[] chars = new char[length];
for (int i = 0; i < length; i++){
int integer = random.nextInt(65535) + 1;
chars[i] = (char) (integer);
}
return new String(chars);
}
public static void main(String[] args) throws Exception {
for (int i = 1; i < 100000; i++){
String s1 = randomString(10);
byte[] bytes = s1.getBytes();
for (byte b : bytes) {
if (b == 0){
System.out.println("contains 0");
System.exit(0);
}
}
}
System.out.println("contains no 0");
}

It does depend on your platform local encoding. But in many encodings, the '\0' (null) character will result in getBytes() returning an array with a zero in it.
System.out.println("\0".getBytes()[0]);
This will work with the US-ASCII, ISO-8859-1 and the UTF-8 encodings:
System.out.println("\0".getBytes("US-ASCII")[0]);
System.out.println("\0".getBytes("ISO-8859-1")[0]);
System.out.println("\0".getBytes("UTF-8")[0]);
If you have a byte array and you want the string that corresponds to it, you can also do the reverse:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b);
However this will give different results for different encodings, and in some encodings it may be an invalid sequence.
And the characters in it may not be printable.
Your best bet is the ISO-8859-1 encoding, only the null character cannot be printed:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b, "ISO-8859-1");
System.out.println(s);
System.out.println((int) s.charAt(3));
Edit
In the code that you posted, it's also easy to get "contains 0" if you specify the UTF-16 encoding:
byte[] bytes = s1.getBytes("UTF-16");
It's all about encoding, and you haven't specified it. When you haven't passed it as an argument to the getBytes method, it takes your platform default encoding.
To find out what that is on your platform, run this:
System.out.println(System.getProperty("file.encoding"));
On MacOS, it's UTF-8; on Windows it's likely to be one of the Windows codepages like Cp-1252. You can also specify the platform default on the command line when you run Java:
java -Dfile.encoding=UTF16 <the rest>
If you run your code that way you'll also see that it contains 0.

Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
No, there is no such guarantee. First, and most importantly, .getBytes() returns "a sequence of bytes using the platform's default charset". As such there is nothing preventing you from defining your own custom charset that explicitly encodes certain values as 0s.
More practically, many common encodings will include zero-bytes, notably to represent the NUL character. But even if your strings don't include NUL's its possible for the byte sequence to include 0s. In particular UTF-16 (which Java uses internally) represents all characters in two bytes, meaning ASCII characters (which only need one) are paired with a 0 byte.
You could also very easily test this yourself by trying to construct a String from a sequence of bytes containing 0s with an appropriate constructor, such as String(byte[] bytes) or String(byte[] bytes, Charset charset). For example (notice my system's default charset is UTF-8):
System.out.println("Default encoding: " + System.getProperty("file.encoding"));
System.out.println("Empty string: " + Arrays.toString("".getBytes()));
System.out.println("NUL char: " + Arrays.toString("\0".getBytes()));
System.out.println("String constructed from {0} array: " +
Arrays.toString(new String(new byte[]{0}).getBytes()));
System.out.println("'a' in UTF-16: " +
Arrays.toString("a".getBytes(StandardCharsets.UTF_16)));
prints:
Default encoding: UTF-8
Empty string: []
NUL char: [0]
String constructed from {0} array: [0]
'a' in UTF-16: [-2, -1, 0, 97]

Related

US-ASCII string (de-)compression into/from a byte array (7 bits/character)

As we all know, ASCII uses 7-bit to encode chars, so number of bytes used to represent the text is always less than the length of text letters
For example:
StringBuilder text = new StringBuilder();
IntStream.range(0, 160).forEach(x -> text.append("a")); // generate 160 text
int letters = text.length();
int bytes = text.toString().getBytes(StandardCharsets.US_ASCII).length;
System.out.println(letters); // expected 160, actual 160
System.out.println(bytes); // expected 140, actual 160
Always letters = bytes, but the expected is letters > bytes.
the main proplem: in smpp protocol sms body must be <= 140 byte, if we used ascii encoding, then you can write 160 letters =(140*8/7),so i'd like to text encoded in 7-bit based ascii, we are using JSMPP library
Can anyone explain it to me please and guide me to the right way, Thanks in advance (:
(160*7-160*8)/8 = 20, so you expect 20 bytes less used by the end of your script. However, there is a minimum size for registers, so even if you don't use all of your bits, you still can't concat it to an another value, so you are still using 8 bit bytes for your ASCII codes, that's why you get the same number. For example, the lowercase "a" is 97 in ASCII
‭01100001‬
Note the leading zero is still there, even it is not used. You can't just use it to store part of an another value.
Which concludes, in pure ASCII letters must always equal bytes.
(Or imagine putting size 7 object into size 8 boxes. You can't hack the objects to pieces, so the number of boxes must equal the number of objects - at least in this case.)
Here is a quick & dirty solution without any libraries, i.e. only JRE on-board means. It is not optimised for efficiency and does not check if the message is indeed US-ASCII, it just assumes it. It is just a proof of concept:
package de.scrum_master.stackoverflow;
import java.util.BitSet;
public class ASCIIConverter {
public byte[] compress(String message) {
BitSet bits = new BitSet(message.length() * 7);
int currentBit = 0;
for (char character : message.toCharArray()) {
for (int bitInCharacter = 0; bitInCharacter < 7; bitInCharacter++) {
if ((character & 1 << bitInCharacter) > 0)
bits.set(currentBit);
currentBit++;
}
}
return bits.toByteArray();
}
public String decompress(byte[] compressedMessage) {
BitSet bits = BitSet.valueOf(compressedMessage);
int numBits = 8 * compressedMessage.length - compressedMessage.length % 7;
StringBuilder decompressedMessage = new StringBuilder(numBits / 7);
for (int currentBit = 0; currentBit < numBits; currentBit += 7) {
char character = (char) bits.get(currentBit, currentBit + 7).toByteArray()[0];
decompressedMessage.append(character);
}
return decompressedMessage.toString();
}
public static void main(String[] args) {
String[] messages = {
"Hello world!",
"This is my message.\n\tAnd this is indented!",
" !\"#$%&'()*+,-./0123456789:;<=>?\n"
+ "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_\n"
+ "`abcdefghijklmnopqrstuvwxyz{|}~",
"1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
+ "1234567890123456789012345678901234567890"
};
ASCIIConverter asciiConverter = new ASCIIConverter();
for (String message : messages) {
System.out.println(message);
System.out.println("--------------------------------");
byte[] compressedMessage = asciiConverter.compress(message);
System.out.println("Number of ASCII characters = " + message.length());
System.out.println("Number of compressed bytes = " + compressedMessage.length);
System.out.println("--------------------------------");
System.out.println(asciiConverter.decompress(compressedMessage));
System.out.println("\n");
}
}
}
The console log looks like this:
Hello world!
--------------------------------
Number of ASCII characters = 12
Number of compressed bytes = 11
--------------------------------
Hello world!
This is my message.
And this is indented!
--------------------------------
Number of ASCII characters = 42
Number of compressed bytes = 37
--------------------------------
This is my message.
And this is indented!
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
--------------------------------
Number of ASCII characters = 97
Number of compressed bytes = 85
--------------------------------
!"#$%&'()*+,-./0123456789:;<=>?
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_
`abcdefghijklmnopqrstuvwxyz{|}~
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
--------------------------------
Number of ASCII characters = 160
Number of compressed bytes = 140
--------------------------------
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
Based on the encoding type, Byte length would be different. Check the below example.
String text = "0123456789";
byte[] b1 = text.getBytes(StandardCharsets.US_ASCII);
System.out.println(b1.length);
// prints "10"
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
System.out.println(utf8.length);
// prints "10"
byte[] utf16= text.getBytes(StandardCharsets.UTF_16);
System.out.println(utf16.length);
// prints "22"
byte[] utf32 = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println(utf32.length);
// prints "10"
Nope. In "modern" environments (since 3 or 4 decades ago), the ASCII character encoding for the ASCII character set uses 8 bit code units which are then serialized to one byte each. This is because we want to move and store data in "octets" (8-bit bytes). This character encoding happens to always have the high bit set to 0.
You could say there was, used long ago, a 7-bit character encoding for the ASCII character set. Even then data might have been moved or stored as octets. The high bit would be used for some application-specific purpose such as parity. Some systems, would zero it out in an attempt to increase interoperability but in the end hindered interoperability by not being "8-bit safe". With strong Internet standards, such systems are almost all in the past.

How does encoding/decoding bytes work in Java?

Little background: I'm doing cryptopals challenges and I finished https://cryptopals.com/sets/1/challenges/1 but realized I didn't learn what I guess is meant to be learned (or coded).
I'm using the Apache Commons Codec library for Hex and Base64 encoding/decoding. The goal is to decode the hex string and re-encode it to Base64. The "hint" at the bottom of the page says "Always operate on raw bytes, never on encoded strings. Only use hex and base64 for pretty-printing."
Here's my answer...
private static Hex forHex = new Hex();
private static Base64 forBase64 = new Base64();
public static byte[] hexDecode(String hex) throws DecoderException {
byte[] rawBytes = forHex.decode(hex.getBytes());
return rawBytes;
}
public static byte[] encodeB64(byte[] bytes) {
byte[] base64Bytes = forBase64.encode(bytes);
return base64Bytes;
}
public static void main(String[] args) throws DecoderException {
String hex = "49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d";
//decode hex String to byte[]
byte[] myHexDecoded = hexDecode(hex);
String myHexDecodedString = new String(myHexDecoded);
//Lyrics from Queen's "Under Pressure"
System.out.println(myHexDecodedString);
//encode myHexDecoded to Base64 encoded byte[]
byte[] myHexEncoded = encodeB64(myHexDecoded);
String myB64String = new String(myHexEncoded);
//"pretty printing" of base64
System.out.println(myB64String);
}
...but I feel like I cheated. I didn't learn how to decode bytes that were encoded as hex, and I didn't learn how to encode "pure" bytes to Base64, I just learned how to use a library to do something for me.
If I were to take a String in Java then get its bytes, how would I encode those bytes into hex? For example, the following code snip turns "Hello" (which is readable English) to the byte value of each character:
String s = "Hello";
char[] sChar = s.toCharArray();
byte[] sByte = new byte[sChar.length]
for(int i = 0; i < sChar.length; i++) {
sByte[i] = (byte) sChar[i];
System.out.println("sByte[" + i + "] = " +sByte[i]);
}
which yields sByte[0] = 72, sByte[1] = 101, sByte[2] = 108, sByte[3] = 108, sByte[4] = 111
Lets use 'o' as an example - I am guessing its decimal version is 111 - do I just take its decimal version and change that to its hex version?
If so, to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
to decode, do I just take the the characters in the hex String 2 at a time, decompose them to decimal values, then convert to ASCII? Will it always be ASCII?
No. You take the characters 2 at a time, transform the character '0' to the numeric value 0, the character '1' to the numeric value 1, ..., the character 'a' (or 'A', depending on which encoding you want to support) to the numeric value 10, ..., the character 'f' or 'F' to the numeric value 15.
Then you multiply the first numeric value by 16, and you add it to the second numeric value to get the unsigned integer value of your byte. Then you transform that unsigned integer value to a signed byte.
ASCII has nothing to do with this algorithm.
To see how it's done in practice, since commons-codec is open-source, you can just look at its implementation.

Why new String with UTF-8 contains more bytes

byte bytes[] = new byte[16];
random.nextBytes(bytes);
try {
return new String(bytes, "UTF-8");
} catch (UnsupportedEncodingException e) {
log.warn("Hash generation failed", e);
}
When I generate a String with given method, and when i apply string.getBytes().length it returns some other value. Max was 32. Why a 16 byte array ends up generating a another size byte string ?
But if i do string.length() it returns 16.
This is because your bytes are first converted to Unicode string, which attempts to create UTF-8 char sequence from these bytes. If a byte cannot be treated as ASCII char nor captured with next byte(s) to form legal unicode char, it is replaced by "�". Such char is transformed into 3 bytes when calling String#getBytes(), thus adding 2 extra bytes to resulting output.
If you're lucky to generate ASCII chars only, String#getBytes() will return 16-byte array, if no, resulting array may be longer. For example, the following code snippet:
byte[] b = new byte[16];
Arrays.fill(b, (byte) 190);
b = new String(b, "UTF-8").getBytes();
returns array of 48(!) bytes long.
Classical mistake born from the misunderstanding of the relationship between bytes and chars, so here we go again.
There is no 1-to-1 mapping between byte and char; it all depends on the character coding you use (in Java, that is a Charset).
Worse: given a byte sequence, it may or may not be encoded to a char sequence.
Try this for instance:
final byte[] buf = new byte[16];
new Random().nextBytes(buf);
final Charset utf8 = StandardCharsets.UTF_8;
final CharsetDecoder decoder = utf8.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT);
decoder.decode(ByteBuffer.wrap(buf));
This is very likely to throw a MalformedInputException.
I know this is not exactly an answer but then you didn't clearly explain your problem; and the example above shows already that you have the wrong understanding between what a byte is and what a char is.
The generated bytes might contain valid multibyte characters.
Take this as example. The string contains only one character, but as byte representation it take three bytes.
String s = "Ω";
System.out.println("length = " + s.length());
System.out.println("bytes = " + Arrays.toString(s.getBytes("UTF-8")));
String.length() return the length of the string in characters. The character Ω is one character whereas it's a 3 byte long in UTF-8.
If you change your code like this
Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
System.out.println("string = " + new String(bytes, "UTF-8").length());
System.out.println("string = " + new String(bytes, "ISO-8859-1").length());
The same bytes are interpreted with a different charset. And following the javadoc from String(byte[] b, String charset)
The length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
If you look at the string you're producing, most of the random bytes you're generating do not form valid UTF-8 characters. The String constructor, therefore, replaces them with the unicode 'REPLACEMENT CHARACTER' �, which takes up 3 bytes, 0xFFFD.
As an example:
public static void main(String[] args) throws UnsupportedEncodingException
{
Random random = new Random();
byte bytes[] = new byte[16];
random.nextBytes(bytes);
printBytes(bytes);
final String s = new String(bytes, "UTF-8");
System.out.println(s);
printCharacters(s);
}
private static void printBytes(byte[] bytes)
{
for (byte aByte : bytes)
{
System.out.print(
Integer.toHexString(Byte.toUnsignedInt(aByte)) + " ");
}
System.out.println();
}
private static void printCharacters(String s)
{
s.codePoints().forEach(i -> System.out.println(Character.getName(i)));
}
On a given run, I got this output:
30 41 9b ff 32 f5 38 ec ef 16 23 4a 54 26 cd 8c
0A��2�8��#JT&͌
DIGIT ZERO
LATIN CAPITAL LETTER A
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
DIGIT TWO
REPLACEMENT CHARACTER
DIGIT EIGHT
REPLACEMENT CHARACTER
REPLACEMENT CHARACTER
SYNCHRONOUS IDLE
NUMBER SIGN
LATIN CAPITAL LETTER J
LATIN CAPITAL LETTER T
AMPERSAND
COMBINING ALMOST EQUAL TO ABOVE
String.getBytes().length is likely to be longer, as it counts bytes needed to represent the string, while length() counts 2-byte code units.
read more here
This will try to create a String assuming the bytes are in UTF-8.
new String(bytes, "UTF-8");
This in general will go horribly wrong as UTF-8 multi-byte sequences can be invalid.
Like:
String s = new String(new byte[] { -128 }, StandardCharsets.UTF_8);
The second step:
byte[] bytes = s.getBytes();
will use the platform encoding (System.getProperty("file.encoding")). Better specify it.
byte[] bytes = s.getBytes(StandardCharsets.UTF_8);
One should realize, internally String will maintain Unicode, an array of 16-bit char in UTF-16.
One should entirely abstain from using String for byte[]. It will always involve a conversion, cost double memory and be error prone.

Single UTF-8 char to byte

If I am converting a UTF-8 char to byte, will there ever be a difference in the result of these 3 implementations based on locale, environment, etc.?
byte a = "1".getBytes()[0];
byte b = "1".getBytes(Charset.forName("UTF-8"))[0];
byte c = '1';
Your first line is dependent on the environment, because it will encode the string using the default character encoding of your system, which may or may not be UTF-8.
Your second line will always produce the same result, no matter what the locale or the default character encoding of your system is. It will always use UTF-8 to encode the string.
Note that UTF-8 is a variable-length character encoding. Only the first 127 characters are encoded in one byte; all other characters will take up between 2 and 6 bytes.
Your third line casts a char to an int. This will result in the int containing the UTF-16 character code of the character, since Java char stores characters using UTF-16. Since UTF-16 partially encodes characters in the same way as UTF-8, the result will be the same as the second line, but this is not true in general for any character.
In principle the question is already answered, but I cannot resist to post a little scribble, for those who like to play around with code:
import java.nio.charset.Charset;
public class EncodingTest {
private static void checkCharacterConversion(String c) {
byte asUtf8 = c.getBytes(Charset.forName("UTF-8"))[0];
byte asDefaultEncoding = c.getBytes()[0];
byte directConversion = (byte)c.charAt(0);
if (asUtf8 != asDefaultEncoding) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and default encoding %d",
c, asUtf8, asDefaultEncoding));
}
if (asUtf8 != directConversion) {
System.out.println(String.format(
"First char of %s has different result in UTF-8 %d and direct as byte %d",
c, asUtf8, directConversion));
}
}
public static void main(String[] argv) {
// btw: first time I ever wrote a for loop with a char - feels weird to me
for (char c = '\0'; c <= '\u007f'; c++) {
String cc = new String(new char[] {c});
checkCharacterConversion(cc);
}
}
}
If you run this e.g. with:
java -Dfile.encoding="UTF-16LE" EncodingTest
you will get no output.
But of course every single byte (ok, except for the first) will be wrong if you try:
java -Dfile.encoding="UTF-16BE" EncodingTest
because in "big endian" the first byte is always zero for ascii chars.
That is because in UTF-16 an ascii character '\u00xy is represented by two bytes, in UTF16-LE as [xy, 0] and in UTF16-BE as [0, xy]
However only the first statement produces any output, so b and c are indeed the same for the first 127 ascii characters - because in UTF-8 they are encoded by a single byte. This will not be true for any further characters, however; they all have multi-byte representations in UTF-8.

My java class implementation of XOR encryption has gone wrong

I am new to java but I am very fluent in C++ and C# especially C#. I know how to do xor encryption in both C# and C++. The problem is the algorithm I wrote in Java to implement xor encryption seems to be producing wrong results. The results are usually a bunch of spaces and I am sure that is wrong. Here is the class below:
public final class Encrypter {
public static String EncryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
public static String DecryptString(String input, String key)
{
int length;
int index = 0, index2 = 0;
byte[] ibytes = input.getBytes();
byte[] kbytes = key.getBytes();
length = kbytes.length;
char[] output = new char[ibytes.length];
for(byte b : ibytes)
{
if (index == length)
{
index = 0;
}
int val = (b ^ kbytes[index]);
output[index2] = (char)val;
index++;
index2++;
}
return new String(output);
}
}
Strings in Java are Unicode - and Unicode strings are not general holders for bytes like ASCII strings can be.
You're taking a string and converting it to bytes without specifying what character encoding you want, so you're getting the platform default encoding - probably US-ASCII, UTF-8 or one of the Windows code pages.
Then you're preforming arithmetic/logic operations on these bytes. (I haven't looked at what you're doing here - you say you know the algorithm.)
Finally, you're taking these transformed bytes and trying to turn them back into a string - that is, back into characters. Again, you haven't specified the character encoding (but you'll get the same as you got converting characters to bytes, so that's OK), but, most importantly...
Unless your platform default encoding uses a single byte per character (e.g. US-ASCII), then not all of the byte sequences you will generate represent valid characters.
So, two pieces of advice come from this:
Don't use strings as general holders for bytes
Always specify a character encoding when converting between bytes and characters.
In this case, you might have more success if you specifically give US-ASCII as the encoding. EDIT: This last sentence is not true (see comments below). Refer back to point 1 above! Use bytes, not characters, when you want bytes.
If you use non-ascii strings as keys you'll get pretty strange results. The bytes in the kbytes array will be negative. Sign-extension then means that val will come out negative. The cast to char will then produce a character in the FF80-FFFF range.
These characters will certainly not be printable, and depending on what you use to check the output you may be shown "box" or some other replacement characters.

Categories