Bytes of a string in Java - java

In Java, if I have a String x, how can I calculate the number of bytes in that string?

A string is a list of characters (i.e. code points). The number of bytes taken to represent the string depends entirely on which encoding you use to turn it into bytes.
That said, you can turn the string into a byte array and then look at its size as follows:
// The input string for this test
final String string = "Hello World";
// Check length, in characters
System.out.println(string.length()); // prints "11"
// Check encoded sizes
final byte[] utf8Bytes = string.getBytes("UTF-8");
System.out.println(utf8Bytes.length); // prints "11"
final byte[] utf16Bytes= string.getBytes("UTF-16");
System.out.println(utf16Bytes.length); // prints "24"
final byte[] utf32Bytes = string.getBytes("UTF-32");
System.out.println(utf32Bytes.length); // prints "44"
final byte[] isoBytes = string.getBytes("ISO-8859-1");
System.out.println(isoBytes.length); // prints "11"
final byte[] winBytes = string.getBytes("CP1252");
System.out.println(winBytes.length); // prints "11"
So you see, even a simple "ASCII" string can have different number of bytes in its representation, depending which encoding is used. Use whichever character set you're interested in for your case, as the argument to getBytes(). And don't fall into the trap of assuming that UTF-8 represents every character as a single byte, as that's not true either:
final String interesting = "\uF93D\uF936\uF949\uF942"; // Chinese ideograms
// Check length, in characters
System.out.println(interesting.length()); // prints "4"
// Check encoded sizes
final byte[] utf8Bytes = interesting.getBytes("UTF-8");
System.out.println(utf8Bytes.length); // prints "12"
final byte[] utf16Bytes= interesting.getBytes("UTF-16");
System.out.println(utf16Bytes.length); // prints "10"
final byte[] utf32Bytes = interesting.getBytes("UTF-32");
System.out.println(utf32Bytes.length); // prints "16"
final byte[] isoBytes = interesting.getBytes("ISO-8859-1");
System.out.println(isoBytes.length); // prints "4" (probably encoded "????")
final byte[] winBytes = interesting.getBytes("CP1252");
System.out.println(winBytes.length); // prints "4" (probably encoded "????")
(Note that if you don't provide a character set argument, the platform's default character set is used. This might be useful in some contexts, but in general you should avoid depending on defaults, and always use an explicit character set when encoding/decoding is required.)

If you're running with 64-bit references:
sizeof(string) =
8 + // object header used by the VM
8 + // 64-bit reference to char array (value)
8 + string.length() * 2 + // character array itself (object header + 16-bit chars)
4 + // offset integer
4 + // count integer
4 + // cached hash code
In other words:
sizeof(string) = 36 + string.length() * 2
On a 32-bit VM or a 64-bit VM with compressed OOPs (-XX:+UseCompressedOops), the references are 4 bytes. So the total would be:
sizeof(string) = 32 + string.length() * 2
This does not take into account the references to the string object.

The pedantic answer (though not necessarily the most useful one, depending on what you want to do with the result) is:
string.length() * 2
Java strings are physically stored in UTF-16BE encoding, which uses 2 bytes per code unit, and String.length() measures the length in UTF-16 code units, so this is equivalent to:
final byte[] utf16Bytes= string.getBytes("UTF-16BE");
System.out.println(utf16Bytes.length);
And this will tell you the size of the internal char array, in bytes.
Note: "UTF-16" will give a different result from "UTF-16BE" as the former encoding will insert a BOM, adding 2 bytes to the length of the array.

According to How to convert Strings to and from UTF8 byte arrays in Java:
String s = "some text here";
byte[] b = s.getBytes("UTF-8");
System.out.println(b.length);

A String instance allocates a certain amount of bytes in memory. Maybe you're looking at something like sizeof("Hello World") which would return the number of bytes allocated by the datastructure itself?
In Java, there's usually no need for a sizeof function, because we never allocate memory to store a data structure. We can have a look at the String.java file for a rough estimation, and we see some 'int', some references and a char[]. The Java language specification defines, that a char ranges from 0 to 65535, so two bytes are sufficient to keep a single char in memory. But a JVM does not have to store one char in 2 bytes, it only has to guarantee, that the implementation of char can hold values of the defines range.
So sizeof really does not make any sense in Java. But, assuming that we have a large String and one char allocates two bytes, then the memory footprint of a String object is at least 2 * str.length() in bytes.

There's a method called getBytes(). Use it wisely .

Try this :
Bytes.toBytes(x).length
Assuming you declared and initialized x before

To avoid try catch, use:
String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);
System.out.println(b.length);

Try this using apache commons:
String src = "Hello"; //This will work with any serialisable object
System.out.println(
"Object Size:" + SerializationUtils.serialize((Serializable) src).length)

Related

Java, trying to create a specific network byte header based on length of command

I'm running into some trouble when attempting to create a network byte header. The header should be 2 bytes long, which simply defines the length of the following command.
For example; The following command String "HED>0123456789ABCDEF" is 20 characters long, which is 0014 as hex signed 2 complement, creating the network byte header for this command works as the command is under 124 characters. The following snippet of code essentially works out the byte header and adds the following prefix to the command \u00000\u0014 when the command is under 124 characters.
However for commands that are 124 characters or above, the code in the if block doesn't work. Therefore, I looked into possible alternatives and tried a couple of things regarding generating hex characters and setting them as the network byte header, but as they aren't bytes it's not going to work (As seen in the else block). Instead the else block simply returns 0090 for a command which is 153 characters long which is technically correct, but I'm not able to use this 'length' header the same way as the if blocks length header
public static void main(String[] args) {
final String commandHeader = "HED>";
final String command = "0123456789ABCDEF";
short commandLength = (short) (commandHeader.length() + command.length());
char[] array;
if( commandLength < 124 )
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = new String( bb.array() ).toCharArray();
}
else
{
final ByteBuffer bb = ByteBuffer.allocate(2).putShort(commandLength);
array = convertToHex(bb.array());
}
final String command = new String(array) + commandHeader + command;
System.out.println( command );
}
private static char[] convertToHex(byte[] data) {
final StringBuilder buf = new StringBuilder();
for (byte b : data) {
int halfByte = (b >>> 4) & 0x0F;
int twoHalves = 0;
do {
if ((0 <= halfByte) && (halfByte <= 9))
buf.append((char) ( '0' + halfByte));
halfByte = b & 0x0F;
} while (twoHalves++ < 1);
}
return buf.toString().toCharArray();
}
Furthermore, I have managed to get this working in Python 2 by doing the following three lines, no less! This returns the following network byte header for a 153 character command as \x00\x99
msg_length = len(str_header + str_command)
command_length = pack('>h', msg_length)
command = command_length + str_header + str_command
Also simply replicated by running Python 2 and entering the following commands:
In [1]: import struct
In [2]: struct.pack('>h', 153)
Out[2]: '\x00\x99'
Any assistance, or light that could be shed to resolve this issue would be greatly appreciated.
The basic problem is that you (try to) convert fundamentally binary data to character data. Furthermore, you do it using the platform's default charset, which varies from machine to machine.
I think you have mischaracterized the problem slightly, however. I am confident that it arises when command.length() is at least 124, so that commandLength, which includes the length of commandHeader, too, is at least 128. You would also find that there are some (much) larger command lengths that worked, too.
The key point here is that when any of the bytes in the binary representation of the length have their most-significant bit set, that is meaningful to some character encodings, especially UTF-8, which is a common (but not universal) default. Unless you get very lucky, binary lengths that have any such bytes will not be correctly decoded into characters in UTF-8. Moreover, they may get decoded into characters successfully but differently on machines with that use different charsets for the purpose.
You also have another, related inconsistency. You are formatting data for transmission over the network, which is a byte-oriented medium. The transmission will be a sequence of bytes. But you are measuring and reporting the number of characters in the decoded internal representation, not the number of bytes in the encoded representation that will go over the wire. The two counts are the same for your example command, but they would differ for some strings that you could express in Java.
Additionally, your code is inconsistent with your description of the format wanted. You say that the "network byte header" should be four bytes long, but your code emits only two.
You can address all these issues by taking character encoding explicitly into account, and by avoiding the unneeded and inappropriate conversion of raw binary data to character data. The ByteBuffer class you're already using can help with that. For example:
public static void main(String[] args) throws IOException {
String commandHeader = "HED>";
// a 128-byte command
String command = "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF"
+ "0123456789ABCDEF";
// Convert characters to bytes, and do so with a specified charset
// Note that ALL Java implementations are required to support UTF-8
byte[] commandHeaderBytes = commandHeader.getBytes("UTF-8");
byte[] commandBytes = command.getBytes("UTF-8");
// Measure the command length in bytes, since that's what the receiver
// will need to know
int commandLength = commandHeaderBytes.length + commandBytes.length;
// Build the whole message in your ByteBuffer
// Allow a 4-byte length field, per spec
ByteBuffer bb = ByteBuffer.allocate(commandLength + 4);
bb.putInt(commandLength)
.put(commandHeaderBytes)
.put(commandBytes);
// DO NOT convert to a String or other character type. Output the
// bytes directly.
System.out.write(bb.array());
System.out.println();
}

Will the result of String.getBytes() ever contain zeros?

I have tried numerous Strings with random characters, and except empty string "", their .getBytes() byte arrays seem to never contain any 0 values (like {123, -23, 54, 0, -92}).
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
Edit: the previous test code is as follows. Now I learned that in Java 8 the result seems always "contains no 0" if the String is made up of (char) random.nextInt(65535) + 1; and "contains 0" if the String contains (char) 0.
private static String randomString(int length){
Random random = new Random();
char[] chars = new char[length];
for (int i = 0; i < length; i++){
int integer = random.nextInt(65535) + 1;
chars[i] = (char) (integer);
}
return new String(chars);
}
public static void main(String[] args) throws Exception {
for (int i = 1; i < 100000; i++){
String s1 = randomString(10);
byte[] bytes = s1.getBytes();
for (byte b : bytes) {
if (b == 0){
System.out.println("contains 0");
System.exit(0);
}
}
}
System.out.println("contains no 0");
}
It does depend on your platform local encoding. But in many encodings, the '\0' (null) character will result in getBytes() returning an array with a zero in it.
System.out.println("\0".getBytes()[0]);
This will work with the US-ASCII, ISO-8859-1 and the UTF-8 encodings:
System.out.println("\0".getBytes("US-ASCII")[0]);
System.out.println("\0".getBytes("ISO-8859-1")[0]);
System.out.println("\0".getBytes("UTF-8")[0]);
If you have a byte array and you want the string that corresponds to it, you can also do the reverse:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b);
However this will give different results for different encodings, and in some encodings it may be an invalid sequence.
And the characters in it may not be printable.
Your best bet is the ISO-8859-1 encoding, only the null character cannot be printed:
byte[] b = { 123, -23, 54, 0, -92 };
String s = new String(b, "ISO-8859-1");
System.out.println(s);
System.out.println((int) s.charAt(3));
Edit
In the code that you posted, it's also easy to get "contains 0" if you specify the UTF-16 encoding:
byte[] bytes = s1.getBytes("UTF-16");
It's all about encoding, and you haven't specified it. When you haven't passed it as an argument to the getBytes method, it takes your platform default encoding.
To find out what that is on your platform, run this:
System.out.println(System.getProperty("file.encoding"));
On MacOS, it's UTF-8; on Windows it's likely to be one of the Windows codepages like Cp-1252. You can also specify the platform default on the command line when you run Java:
java -Dfile.encoding=UTF16 <the rest>
If you run your code that way you'll also see that it contains 0.
Is it always the case that their .getBytes() byte arrays always contain no nero except an empty string?
No, there is no such guarantee. First, and most importantly, .getBytes() returns "a sequence of bytes using the platform's default charset". As such there is nothing preventing you from defining your own custom charset that explicitly encodes certain values as 0s.
More practically, many common encodings will include zero-bytes, notably to represent the NUL character. But even if your strings don't include NUL's its possible for the byte sequence to include 0s. In particular UTF-16 (which Java uses internally) represents all characters in two bytes, meaning ASCII characters (which only need one) are paired with a 0 byte.
You could also very easily test this yourself by trying to construct a String from a sequence of bytes containing 0s with an appropriate constructor, such as String(byte[] bytes) or String(byte[] bytes, Charset charset). For example (notice my system's default charset is UTF-8):
System.out.println("Default encoding: " + System.getProperty("file.encoding"));
System.out.println("Empty string: " + Arrays.toString("".getBytes()));
System.out.println("NUL char: " + Arrays.toString("\0".getBytes()));
System.out.println("String constructed from {0} array: " +
Arrays.toString(new String(new byte[]{0}).getBytes()));
System.out.println("'a' in UTF-16: " +
Arrays.toString("a".getBytes(StandardCharsets.UTF_16)));
prints:
Default encoding: UTF-8
Empty string: []
NUL char: [0]
String constructed from {0} array: [0]
'a' in UTF-16: [-2, -1, 0, 97]

Creating a string of a specific size (in bytes)

I have an application where I want to have a method that creates a string of a specified size in bytes.
here's what I basically wrote. I just want to make sure that this produces for example a string of size bytes
static String createMsg(int size){
byte[] msgB= new byte[size];
for(int i = 0; i < size; i++){
msgB[i] = 105;
}
String x = new String(msgB);
return x;
}
Thank you.
To create a string of a specific size in byte, you must create a string only containing ASCII characters. All characters between 0x0 (0) and 0x7F (127) are ASCII characters and each ASCII character takes one byte. It takes even one byte, if our encoding is UTF, as the first 128 characters in UTF and ASCII are the same for compatiblity reasons.
You can also rely on Apache Commons 3 and use this code snippet to generate a string of a specific size.
RandomStringUtils.randomAscii(100);

char[] into ascii and decimal value (double) java

based on this array :
final char[] charValue = { 'u', ' ', '}','+' };
i want to print the double value and the ascii value from it in Java.
i can't find a proper solution for that in internet. I just found how to convert a single Character into Integer value. But what about many characters?
the main problem is, i have a large char[] and some double and int values are stored in. for double values they are stored within 4 bytes size and integer 1 or 2 bytes so i have to read all this and convert into double or integer.
Thanks for you help
When java was designed, there was C char being used for binary bytes and text.
Java made a clear separation between binary data (byte[], InputStream/OutputStream) and Unicode text (char, String, Reader/Writer). Hence Java has full Unicode support. The binary data, byte[], need information: their used encoding, in order to be convertable to text: char[]/String.
In Java a char[] will rarely be used (as in C/C++), and it seems byte[] is intended, as you mention 4 elements to be used for an int etcetera. A char is 16 bits, containing UTF-16 text.
For this case one can use a ByteBuffer either wrapping a byte[] or being taken from a memory mapped file.
Writing
ByteBuffer buf = ByteBuffer.allocate(13); // 13 bytes
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
buf.putInt(42); // at 0
buf.putDouble(Math.PI); // at 4
buf.put((byte) '0'); // at 12
buf.putDouble(4, 3.0); // at 4 overwrite PI
byte[] bytes = buf.array();
Reading
ByteBuffer buf = ByteBuffer.wrap(bytes);
buf.order(ByteOrder.LITTLE_ENDIAN); // Intel order
int a = buf.getInt();
double b = buf.getDouble();
byte c = buf.get();

Convert byte array to string with equivalent number of bytes

Is it possible to convert a byte array to a string but where the length of the string is exactly the same length as the number of bytes in the array? If I use the following:
byte[] data; // Fill it with data
data.toString();
The length of the string is different than the length of the array. I believe that this is because Java and/or Android takes some kind of default encoding into account. The values in the array can be negative as well. Theoretically it should be possible to convert any byte to some character. I guess I need to figure out how to specify an encoding that generates a fixed single byte width for each character.
EDIT:
I tried the following but it didn't work:
byte[] textArray; // Fill this with some text.
String textString = new String(textArray, "ASCII");
textArray = textString.getBytes("ASCII"); // textArray ends up with different data.
You can use the String constructor String(byte[] data) to create a string from the byte array. If you want to specify the charset as well, you can use String(byte[] data, Charset charset) constructor.
Try your code sample with US-ASCII or ISO-8859-1 in place of ASCII. ASCII is not a built-in Character encoding for Java or Android, but one of those two are. They are guaranteed single-byte encodings, with a caveat that characters not in the character set will be silently truncated.
This should work fine!
public static byte[] stringToByteArray(String pStringValue){
int length= pStringValue.length();
byte[] bytes = new byte[length];
for(int index=0; index<length; index++){
char ch= pStringValue.charAt(index);
bytes[index]= (byte)ch;
}
return bytes;
}
since JDK 1.6:
You can also use:
stringValue.getBytes() which will return you a byte array.
In case of passing a NULL string, you need to handle that by either throwing the nullPointerException or handling it inside the method itself.

Categories